admin管理员组文章数量:1417411
I've created a Python script to combine multiple SQL files into one.Sometimes the script inserts a BOM(byte order mark) character and that obviously breaks the SQL script and needs to be manually corrected.I can't do a simple string replace because BOM is interpreted as binary and I tried various decoding methods on both the files that I'm reading and the one that I write to,none of those have worked so far.Any tips on how I can solve this?
What the BOM looks like:
Script:
import os
from FileDto import FileDto
def getFileItems(path):
directory = os.fsencode(path)
items = []
for file in os.listdir(directory):
fileName = os.fsdecode(file)
if os.path.isfile(path + '\\' + fileName):
datePart = fileName[0:12]
fileDto = FileDto(datePart,fileName)
items.append(fileDto)
items.sort(key=lambda x: x.id, reverse=False)
return items
path = input('Enter a path where the SQL scripts that need to be combined live:')
fileItems = getFileItems(path)
counter = 1
outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName
with open(outFilePath, 'w', encoding='utf-8') as outfile:
for names in fileItems:
outfile.write('--' + str(counter) + '.' + names.fileName)
outfile.write('\n\n')
with open(path + "\\" + names.fileName) as infile:
for line in infile:
outfile.write(line)
outfile.write('\nGO\n')
counter += 1
print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))
I've created a Python script to combine multiple SQL files into one.Sometimes the script inserts a BOM(byte order mark) character and that obviously breaks the SQL script and needs to be manually corrected.I can't do a simple string replace because BOM is interpreted as binary and I tried various decoding methods on both the files that I'm reading and the one that I write to,none of those have worked so far.Any tips on how I can solve this?
What the BOM looks like:
Script:
import os
from FileDto import FileDto
def getFileItems(path):
directory = os.fsencode(path)
items = []
for file in os.listdir(directory):
fileName = os.fsdecode(file)
if os.path.isfile(path + '\\' + fileName):
datePart = fileName[0:12]
fileDto = FileDto(datePart,fileName)
items.append(fileDto)
items.sort(key=lambda x: x.id, reverse=False)
return items
path = input('Enter a path where the SQL scripts that need to be combined live:')
fileItems = getFileItems(path)
counter = 1
outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName
with open(outFilePath, 'w', encoding='utf-8') as outfile:
for names in fileItems:
outfile.write('--' + str(counter) + '.' + names.fileName)
outfile.write('\n\n')
with open(path + "\\" + names.fileName) as infile:
for line in infile:
outfile.write(line)
outfile.write('\nGO\n')
counter += 1
print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))
Share
Improve this question
edited Jan 31 at 16:39
Barmar
784k57 gold badges548 silver badges659 bronze badges
asked Jan 31 at 10:49
Denys WesselsDenys Wessels
17k16 gold badges84 silver badges121 bronze badges
5
|
2 Answers
Reset to default 2If you know a file is encoded in a way that is not the default for your OS, then use the encoding
argument when reading the file. You appear to be using Windows, where the system encoding is a Windows code page. eg. cp1252
. If you know your text file is encoded with UTF-16, but do not know the endianess then use the uft-16
encoding. eg
with open('foo.txt', 'r', encoding='utf-16') as fp:
text = fp.read()
This will look for the BOM mark, and use it to detect if the file encoding is using little endian or big endian UTF-16. It will also remove the BOM from the returned text.
If you are completely unsure of the encoding of a file, then you can use the package charset-normalizer
to detect the encoding. Reading your file would then become:
import charset_normalizer
with open('foo.txt', 'rb') as fp: # NB: opened in binary mode
result = charset_normalizer.from_fp(fp)
text = str(result.best())
BOM = '\ufeff'
assert not text.startswith(BOM)
assert result.best().encoding == 'utf_16'
Quick example that shows it detects utf-16-le
and utf-16-be
equally well and returns the same text:
import charset_normalizer
le_hello_world = b'\xff\xfe' b'h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
be_hello_world = b'\xfe\xff' b'\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d'
def decode(bs):
return str(charset_normalizer.from_bytes(bs).best())
assert decode(le_hello_world) == decode(be_hello_world) == 'hello world'
So looks like BOM appears in files that are utf-16 encoded and when you read those files you need to explicitly specify that encoding is utf-16.That will fix the issue of not displaying the BOM in the file being read.However,if any other files in the directory are encoded using a different set e.g. utf-8 and I'm explicitly setting encoding to utf-16 the code will crash.As a workaround,I've created a function which tries to get the encoding of the file and then dynamically inserts the encoding when opening the file for read:
encoding = detectEncoding(fullPath)
if encoding != '':
with open(path + "\\" + names.fileName,'r',encoding=encoding)
This seems to have solved the problem.Here's a complete solution if anyone else encounters a similar issue:
import os
from FileDto import FileDto
def detectEncoding(file_path):
ENCODINGS_TO_TRY = ['utf-8', 'utf-16', 'latin-1', 'iso-8859-1', 'windows-1252']
for encoding in ENCODINGS_TO_TRY:
try:
with open(file_path, 'r', encoding=encoding) as file:
file.read()
return encoding
except (UnicodeDecodeError, OSError):
continue
return ''
def getFileItems(path):
directory = os.fsencode(path)
items = []
for file in os.listdir(directory):
fileName = os.fsdecode(file)
if os.path.isfile(path + '\\' + fileName):
datePart = fileName[0:12]
fileDto = FileDto(datePart,fileName)
items.append(fileDto)
items.sort(key=lambda x: x.id, reverse=False)
return items
path = input('Enter a path where the SQL scripts that need to be combined live:')
fileItems = getFileItems(path)
counter = 1
outFileName = 'Combined Release Script.sql'
outFilePath = path + "\\" + outFileName
with open(outFilePath, 'w') as outfile:
for names in fileItems:
outfile.write('--' + str(counter) + '.' + names.fileName)
outfile.write('\n\n')
fullPath = path + "\\" + names.fileName
encoding = detectEncoding(fullPath)
if encoding != '':
with open(path + "\\" + names.fileName,'r',encoding=encoding) as infile:
for line in infile:
outfile.write(line)
outfile.write('\nGO\n')
counter += 1
print("Done.%s SQL scripts have been combined into one called - %s" % (len(fileItems), outFileName))
本文标签: Exclude BOM(byte order mark) character when combining files in PythonStack Overflow
版权声明:本文标题:Exclude BOM(byte order mark) character when combining files in Python - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745266668a2650643.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
utf_8_sig
to read, which will remove BOM. Or just discard the first character of a string, if it is BOM. – Giacomo Catenazzi Commented Jan 31 at 11:05open(filename, 'r', encoding= 'utf-8-sig')
skips the BOM if present stackoverflow/questions/13590749/… – Pete Kirkham Commented Jan 31 at 11:05