admin管理员组

文章数量:1417411

I've created a Python script to combine multiple SQL files into one.Sometimes the script inserts a BOM(byte order mark) character and that obviously breaks the SQL script and needs to be manually corrected.I can't do a simple string replace because BOM is interpreted as binary and I tried various decoding methods on both the files that I'm reading and the one that I write to,none of those have worked so far.Any tips on how I can solve this?

What the BOM looks like:

Script:

import os 

from FileDto import FileDto

def getFileItems(path):

    directory = os.fsencode(path)    
    items = []

    for file in os.listdir(directory):
        fileName = os.fsdecode(file)
        if os.path.isfile(path + '\\' + fileName):
            datePart = fileName[0:12]
            fileDto = FileDto(datePart,fileName)
            items.append(fileDto)
    
    items.sort(key=lambda x: x.id, reverse=False)
    return items

path = input('Enter a path where the SQL scripts that need to be combined live:')

fileItems = getFileItems(path)

counter = 1

outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName

with open(outFilePath, 'w', encoding='utf-8') as outfile:

    for names in fileItems:

        outfile.write('--' + str(counter) + '.' + names.fileName)
        outfile.write('\n\n')

        with open(path + "\\" + names.fileName) as infile:
            for line in infile:
                outfile.write(line)

        outfile.write('\nGO\n')
        counter += 1

print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))

I've created a Python script to combine multiple SQL files into one.Sometimes the script inserts a BOM(byte order mark) character and that obviously breaks the SQL script and needs to be manually corrected.I can't do a simple string replace because BOM is interpreted as binary and I tried various decoding methods on both the files that I'm reading and the one that I write to,none of those have worked so far.Any tips on how I can solve this?

What the BOM looks like:

Script:

import os 

from FileDto import FileDto

def getFileItems(path):

    directory = os.fsencode(path)    
    items = []

    for file in os.listdir(directory):
        fileName = os.fsdecode(file)
        if os.path.isfile(path + '\\' + fileName):
            datePart = fileName[0:12]
            fileDto = FileDto(datePart,fileName)
            items.append(fileDto)
    
    items.sort(key=lambda x: x.id, reverse=False)
    return items

path = input('Enter a path where the SQL scripts that need to be combined live:')

fileItems = getFileItems(path)

counter = 1

outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName

with open(outFilePath, 'w', encoding='utf-8') as outfile:

    for names in fileItems:

        outfile.write('--' + str(counter) + '.' + names.fileName)
        outfile.write('\n\n')

        with open(path + "\\" + names.fileName) as infile:
            for line in infile:
                outfile.write(line)

        outfile.write('\nGO\n')
        counter += 1

print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))
Share Improve this question edited Jan 31 at 16:39 Barmar 784k57 gold badges548 silver badges659 bronze badges asked Jan 31 at 10:49 Denys WesselsDenys Wessels 17k16 gold badges84 silver badges121 bronze badges 5
  • You didn't write how do you read the files. Possibly you should use utf_8_sig to read, which will remove BOM. Or just discard the first character of a string, if it is BOM. – Giacomo Catenazzi Commented Jan 31 at 11:05
  • you say you've tried different decoding techniques but don't tell us what they are. Usually open(filename, 'r', encoding= 'utf-8-sig') skips the BOM if present stackoverflow/questions/13590749/… – Pete Kirkham Commented Jan 31 at 11:05
  • @GiacomoCatenazzi the code to read individual files and write them to one file is included in the question – Denys Wessels Commented Jan 31 at 12:08
  • @PeteKirkham tried both utf-8 and utf-8-sig both don't work – Denys Wessels Commented Jan 31 at 12:09
  • @DenysWessels: the "read the file" is not in your code, and it is such part which cause problems. Not the writing (but you can also replace BOM to empty string – Giacomo Catenazzi Commented Jan 31 at 12:55
Add a comment  | 

2 Answers 2

Reset to default 2

If you know a file is encoded in a way that is not the default for your OS, then use the encoding argument when reading the file. You appear to be using Windows, where the system encoding is a Windows code page. eg. cp1252. If you know your text file is encoded with UTF-16, but do not know the endianess then use the uft-16 encoding. eg

with open('foo.txt', 'r', encoding='utf-16') as fp:
    text = fp.read()

This will look for the BOM mark, and use it to detect if the file encoding is using little endian or big endian UTF-16. It will also remove the BOM from the returned text.

If you are completely unsure of the encoding of a file, then you can use the package charset-normalizer to detect the encoding. Reading your file would then become:

import charset_normalizer

with open('foo.txt', 'rb') as fp: # NB: opened in binary mode
    result = charset_normalizer.from_fp(fp)

text = str(result.best())

BOM = '\ufeff'
assert not text.startswith(BOM)
assert result.best().encoding == 'utf_16'

Quick example that shows it detects utf-16-le and utf-16-be equally well and returns the same text:

import charset_normalizer

le_hello_world = b'\xff\xfe' b'h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
be_hello_world = b'\xfe\xff' b'\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d'

def decode(bs):
    return str(charset_normalizer.from_bytes(bs).best())

assert decode(le_hello_world) == decode(be_hello_world) == 'hello world'

So looks like BOM appears in files that are utf-16 encoded and when you read those files you need to explicitly specify that encoding is utf-16.That will fix the issue of not displaying the BOM in the file being read.However,if any other files in the directory are encoded using a different set e.g. utf-8 and I'm explicitly setting encoding to utf-16 the code will crash.As a workaround,I've created a function which tries to get the encoding of the file and then dynamically inserts the encoding when opening the file for read:

    encoding = detectEncoding(fullPath)

    if encoding != '':
        with open(path + "\\" + names.fileName,'r',encoding=encoding) 

This seems to have solved the problem.Here's a complete solution if anyone else encounters a similar issue:

import os 

from FileDto import FileDto

def detectEncoding(file_path):
    
    ENCODINGS_TO_TRY = ['utf-8', 'utf-16', 'latin-1', 'iso-8859-1', 'windows-1252']
    for encoding in ENCODINGS_TO_TRY:
        try:
            with open(file_path, 'r', encoding=encoding) as file:
                file.read()  
            return encoding  
        except (UnicodeDecodeError, OSError):
            continue  
    return ''  

def getFileItems(path):

    directory = os.fsencode(path)    
    items = []

    for file in os.listdir(directory):
        fileName = os.fsdecode(file)
        if os.path.isfile(path + '\\' + fileName):
            datePart = fileName[0:12]
            fileDto = FileDto(datePart,fileName)
            items.append(fileDto)
    
    items.sort(key=lambda x: x.id, reverse=False)
    return items

path = input('Enter a path where the SQL scripts that need to be combined live:')

fileItems = getFileItems(path)

counter = 1

outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName

with open(outFilePath, 'w') as outfile:

    for names in fileItems:

        outfile.write('--' + str(counter) + '.' + names.fileName)
        outfile.write('\n\n')

        fullPath = path + "\\" + names.fileName
        encoding = detectEncoding(fullPath)

        if encoding != '':
            with open(path + "\\" + names.fileName,'r',encoding=encoding) as infile:
                
                 for line in infile:                       
                        outfile.write(line)
                        
            outfile.write('\nGO\n')
            counter += 1

print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))

本文标签: Exclude BOM(byte order mark) character when combining files in PythonStack Overflow