admin管理员组

文章数量:1389178

I have a pipe delimited file that I am trying to import into SQL Server. The column data is unfortunately not text qualified and I am unable to get a new copy of the file sent to me.

There is a field in the data that looks as though it was a free type text box in the software this file was pulled out of (a very old Oracle database based system). Unfortunately this means that users were able to enter carriage returns/line breaks when entering data into the field.

This has resulted in there being line breaks in the resulting text file that are being treated as new rows by Microsoft SQL Server BULK INSERT, and this "breaking" the data once it's been imported.

Here's an example of the data. This should only be 2 records, but is importing as 4 records due to the line breaks or whatever hidden character is being used for this

CONTACT_SOURCE_ID|LAST_UPDATED|LAST_UPDATED_BY|T92|D10|T94|T95|T96|T97|T98|T99|T100|T101|T124|T125|T126
72|05/01/2006 14:48:38|13Mr|S|BLOGGS|1 Random Building
Random Street|Random Town||Random County|RN1 2DM
74|05/01/2006 15:48:38|31Dr|Delta|Who|The
Tardis|SpaceAndTime||Universe|D1 0WW

I need to somehow remove these hidden line breaks without removing the line breaks acting as the real row terminators (otherwise it'll result in just one (very long) record. So essentially a method for removing all line breaks contained between two column delimiters.

I have tried reading the file in line by line, counting the number of column delimiters in the row (I know there will be 15), and if it is less than 15, doing a replace on the line to remove \n, \r and Environment.NewLine, then writing the lines back to a new file.

const Int32 BufferSize = 128;
string fileName = @"C:\Work\OriginalFile.txt";
string outputFile = @"C:\Work\NewFile.txt";
int numColBreaks = 0;

using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
    String line;

    while ((line = streamReader.ReadLine()) != null)
    {
        numColBreaks = 0;

        foreach (char c in line) 
        {
            if (c == '|')
            {
                numColBreaks++;
            }
        }

        if (numColBreaks != 15)
        {
            line.Replace(Environment.NewLine, " ").Replace("\n", " ").Replace("\r", " ");
        }
        
        using (StreamWriter sw = File.AppendText(outputFile))
        {
            sw.WriteLine(line);
        }
    }
}

But this hasn't worked.

I'm still pretty new to C# so am probably overlooking something here.

If I could text qualify the data somehow, I think SQL Server BULK INSERT would work like this:

BULK INSERT dbo.TargetTable
FROM 'C:\Work\OriginalFile.txt'
WITH (
    FORMAT = 'CSV'
    , FIELDTERMINATOR = '|'
    , ROWTERMINATOR = '\n'
    , FIRSTROW = 2
)

But I don't know how I'm going to do that, especially given this text file has over 600,000 rows of data in it.

I have a pipe delimited file that I am trying to import into SQL Server. The column data is unfortunately not text qualified and I am unable to get a new copy of the file sent to me.

There is a field in the data that looks as though it was a free type text box in the software this file was pulled out of (a very old Oracle database based system). Unfortunately this means that users were able to enter carriage returns/line breaks when entering data into the field.

This has resulted in there being line breaks in the resulting text file that are being treated as new rows by Microsoft SQL Server BULK INSERT, and this "breaking" the data once it's been imported.

Here's an example of the data. This should only be 2 records, but is importing as 4 records due to the line breaks or whatever hidden character is being used for this

CONTACT_SOURCE_ID|LAST_UPDATED|LAST_UPDATED_BY|T92|D10|T94|T95|T96|T97|T98|T99|T100|T101|T124|T125|T126
72|05/01/2006 14:48:38|13Mr|S|BLOGGS|1 Random Building
Random Street|Random Town||Random County|RN1 2DM
74|05/01/2006 15:48:38|31Dr|Delta|Who|The
Tardis|SpaceAndTime||Universe|D1 0WW

I need to somehow remove these hidden line breaks without removing the line breaks acting as the real row terminators (otherwise it'll result in just one (very long) record. So essentially a method for removing all line breaks contained between two column delimiters.

I have tried reading the file in line by line, counting the number of column delimiters in the row (I know there will be 15), and if it is less than 15, doing a replace on the line to remove \n, \r and Environment.NewLine, then writing the lines back to a new file.

const Int32 BufferSize = 128;
string fileName = @"C:\Work\OriginalFile.txt";
string outputFile = @"C:\Work\NewFile.txt";
int numColBreaks = 0;

using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
    String line;

    while ((line = streamReader.ReadLine()) != null)
    {
        numColBreaks = 0;

        foreach (char c in line) 
        {
            if (c == '|')
            {
                numColBreaks++;
            }
        }

        if (numColBreaks != 15)
        {
            line.Replace(Environment.NewLine, " ").Replace("\n", " ").Replace("\r", " ");
        }
        
        using (StreamWriter sw = File.AppendText(outputFile))
        {
            sw.WriteLine(line);
        }
    }
}

But this hasn't worked.

I'm still pretty new to C# so am probably overlooking something here.

If I could text qualify the data somehow, I think SQL Server BULK INSERT would work like this:

BULK INSERT dbo.TargetTable
FROM 'C:\Work\OriginalFile.txt'
WITH (
    FORMAT = 'CSV'
    , FIELDTERMINATOR = '|'
    , ROWTERMINATOR = '\n'
    , FIRSTROW = 2
)

But I don't know how I'm going to do that, especially given this text file has over 600,000 rows of data in it.

Share Improve this question edited Mar 14 at 16:09 marc_s 756k184 gold badges1.4k silver badges1.5k bronze badges asked Mar 14 at 15:34 welshsteve147welshsteve147 852 silver badges8 bronze badges 13
  • 1 And what happens if a pesky user entered | in your freetext field? You're pretty smoked here but you probably know that already :) But anyway, i'd probably do something like: 1) read row and calculate number of | of a row 2) if less than desired, read next row and calculate the | again 3) if they together become 15, then you're done, otherwise go to 2. If you have more than 15, then probably a user entered a pipe – siggemannen Commented Mar 14 at 15:40
  • NOT tested but... Can you open the file in VS Code and do a search replace using regex, replacing \|[\s\r\n]*\| with an empty string or space. Obviously make a backup first! – Alan Schofield Commented Mar 14 at 15:44
  • 2 And then - most important part - walk over to whoever created that dump and put salt in their coffe. /sarcasm But seriously: Tell them that this was the firrst and last time you fixed their mess. (Really, do it. I found it works more often than you might think, especially if you help them create reasonably formatted data dumps) – Fildor Commented Mar 14 at 15:54
  • 1 @Fildor The only line breaks I need to preserve are the end of row line breaks. If there is a line break inside the column data, I would rather replace these with a space. – welshsteve147 Commented Mar 14 at 16:21
  • 2 For the love of dog do not attempt to use Excel to fix this mess. Excel loves its native format files but has a particularly care-free attitude to data ingested from delimited text files that causes many insidious breakages: messing with international and scientific characters; inferring incorrect data types and so breaking data; etc. – AlwaysLearning Commented Mar 15 at 0:03
 |  Show 8 more comments

3 Answers 3

Reset to default 1

Assuming (as you asserted in your question) that none of the fields are currently quote-delimited, you should be able to use reqular expressions to add quotes around all fields in your data file. The pipe delimiters can also be changed to commas in the process. This can be done using any good text editor (like NotePad++) or language (like C#) that supports regular expressions.

The resulting file can then be successfully imported using the BULK INSERT command with the FORMAT=CSV option (SQL Server 2017 and later).

First you should double up any quotes that might be a part of the data:

  • Replace all: "
  • With: ""

Next you can use a regex to extract groups of 16 pipe-delimited fields and replace them with quoted, comma-delimited fields:

  • Replace all regex: ^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)$
  • With: "$1","$2","$3","$4","$5","$6","$7","$8","$9","${10}","${11}","${12}","${13}","${14}","${15}","${16}"

Each ([^|]*) matches and captures a sequence of zero or more non-pipe characters in a field. Each \| matches a pipe itself. The ^ and $ match the beginning and end of a line, respectively. In the replacement string, the $1 through ${16} sequences are replaced with the original captured field values. (Some editors may use a slightly different syntax for the replacement substitutions.)

All this assumes 16 fields per record.

The result would be something like:

"CONTACT_SOURCE_ID","LAST_UPDATED","LAST_UPDATED_BY","T92","D10","T94","T95","T96","T97","T98","T99","T100","T101","T124","T125","T126"
"72","05/01/2006 14:48:38","13","","","Mr","S","BLOGGS","1 Random Building
Random Street","Random Town","","Random County","RN1 2DM","","",""
"74","05/01/2006 15:48:38","31","","","Dr","Delta","Who","The
Tardis","SpaceAndTime","","Universe","D1 0WW","","",""

Even though all fields are quoted (both those with newlines and those without), this is allowed in CSV files, and the BULK INSERT ... FORMAT=CSV command will remove the quotes and un-double and previously doubled-up quotes. The result should even be readable by Excel for visual verification.

See also RFC 4180.

If you prefer to use a C# program, you can:

  • Read then entire file into a single string, using File.ReadAllText().
  • Apply the above regular expressions.
  • Write the result back out.

You can even generate the regex search and replace strings on-the-fly to make the code reusable for files with different field counts.

Something like:

string fileName = @"C:\Work\OriginalFile.txt";
string outputFile = @"C:\Work\NewFile.txt";
int fieldCount = 16;

// Read unquoted pipe-delimited source data
string unquotedPipeDelimited = File.ReadAllText(fileName, Encoding.UTF8);

// Double up any quotes in the data: " becomes ""
string search1 = @"""";
string replace1 = @"""""";
string doubledQuotes = Regex.Replace(unquotedPipeDelimited, search1, replace1);

// Quote the fields and replace the delimiter: abc|def becomes "abc","def"
string search2 =
    "^"
    + string.Join("\\|", Enumerable.Range(1, fieldCount).Select(i => "([^|]*)"))
    + "$";
string replace2 =
    string.Join(",", Enumerable.Range(1, fieldCount).Select(i => $"\"${i}\""));
RegexOptions options2 = RegexOptions.Multiline;
string csv = Regex.Replace(doubledQuotes, search2, replace2, options2);

// Write CSV
File.WriteAllText(outputFile, csv, Encoding.UTF8)

see this .NET Fiddle for a demo.

I believe this technique is solid, unless the last field has line breaks in the data. In that case, the data ambiguity is difficult if not impossible to resolve. (Is that line break part of the data or does it mark the end of the record?)

(This is not an answer basically because you wouldn't like it but I needed more space)

IMHO, having CRLF in text fields is quite normal and I wouldn't blame Oracle for that but blame MS SQL server for not playing well with CRLF. Nowadays, I am writing all sorts of things to migrate away from MS SQL to postgreSQL. That consists not only moving the tables and data but SP, functions etc as well. To keep the long story short, text import to MS SQL is something I wouldn't try to use.

For bulk copy, if you have opportunity, directly use the Oracle as source and MS SQL as target with SqlBulkCopy class in C#. That works pretty well.

If you need to use text imports, (and likely this is not a continuous task) then IMHO best database that can do text imports is postgreSQL. Just install postgreSQL or use it with docker (inserting locally is easier for this job) and do your imports using postgreSQL's wonderful COPY command. It can handle data like that. Once you have your data in postgreSQL (going on with postgreSQL would be the best), you could transfer to MS SQL server by using an FDW in postgreSQL, or postgreSQL as a linked server in MS SQL, or maybe easier for you, just by using SqlBulkCopy class in C# with postgreSQL as source, and MS SQL as target.

Speed wise, assuming your data looks like what you supplied, it should be under 15-20 seconds to get this data to postgreSQL and then another less than 15-20 seconds to MS SQL.

Or if you weren't new to C#, you could sanitize that data using regex and push to MS SQL with SqlBulkCopy class (but still postgreSQL approach is easier to complete this).

This is a very untested code because i'm not on C# comp at the moment, but something like the following should probably work:

using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
    String line;
    String lineBuffer = "";
    int numColBreaksBuffer = 0;
    using (StreamWriter sw = File.AppendText(outputFile))
    {

    while ((line = streamReader.ReadLine()) != null)
    {
        numColBreaks = 0;

        foreach (char c in line) 
        {
            if (c == '|')
            {
                numColBreaks++;
            }
        }
        numColBreaksBuffer += numColBreaks;
        if (numColBreaksBuffer < 15) //dangling string?
        {
            lineBuffer += line + " "; //or whatever line split chat you want default??
            continue;
        }
        sw.WriteLine(lineBuffer + line);
        
        lineBuffer = "";
        numColBreaksBuffer = 0;
        
    }
    if (lineBuffer.Length > 0) //write rest of the text
       sw.WriteLine(lineBuffer);
    }
}

本文标签: cIgnore line breaks between delimiters when importing txt file into SQL ServerStack Overflow