admin管理员组文章数量:1350109
I have a binary file with the following structure:
• Byte 0: file version number (2)
• The remaining bytes represent records, which are composed of the following information:
– 2 bytes: lane number (uint16)
– 4 bytes: tile number (uint32)
– 2 bytes: read number (uint16)
– 2 bytes: indexLength, the length in bytes of index name (uint16)
– indexLength bytes: string representing index name
The rest follows this format. I have read in the first set of items without issue, but am hung up on the best way to decode the string indexLength given the bytes representing it.
Here's what I have so far: I have also added my error at the end. There is not information available that I can find on encoding, so I went with utf-8.
samples: list[dict] = [] # empty list to hold sample dicts
with open(path, "rb") as f:
# get file version (first byte)
version = struct.unpack("B", f.read(1))[0]
logger.debug(f"IndexMetricsOut.bin file version: {version}")
while True:
# fixed fields chunking
# lane (2), tile (4), read (2), indexLength(2)
fixed_format = "HIHH"
size = struct.calcsize(fixed_format)
chunk = f.read(size)
# if end of file
if not chunk:
logger.debug(f"End of file reached")
break
# assign fixed byte variables
lane, tile, read_num, index_len = struct.unpack(fixed_format,
chunk,
)
logger.debug(f"lane: {lane}, tile: {tile}, "
f"read_num: {read_num}, index_len: {index_len}")
def unpack_helper(fmt, data):
size = struct.calcsize(fmt)
return struct.unpack(fmt, data[:size]), data[size:]
# decode and assign index_name based on index_len
index_name_bytes = f.read(index_len)
# TO DO: troubleshoot decoding issue at index name
index_name = index_name_bytes.decode(encoding = "utf-8")
print(index_name)
When I print the value of index_name_bytes, it seems far greater than it should be. I see my index value in alpha-numeric values within the hexadecimal code:
tests.readindexbin::DEBUG: IndexMetricsOut.bin file version: 2
tests.readindexbin::DEBUG: lane: 1, tile: 65536, read_num: 21, index_len: 16705
b'CACTGTTA-TGAGACTTGC\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180\x07\x00default\x0 ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 177: invalid start byte
I think I'm reading too many values, and there's got to be a better way to do this. Just stuck. Thanks so much for any help offered. I think I've looked at this too long.
I have a binary file with the following structure:
• Byte 0: file version number (2)
• The remaining bytes represent records, which are composed of the following information:
– 2 bytes: lane number (uint16)
– 4 bytes: tile number (uint32)
– 2 bytes: read number (uint16)
– 2 bytes: indexLength, the length in bytes of index name (uint16)
– indexLength bytes: string representing index name
The rest follows this format. I have read in the first set of items without issue, but am hung up on the best way to decode the string indexLength given the bytes representing it.
Here's what I have so far: I have also added my error at the end. There is not information available that I can find on encoding, so I went with utf-8.
samples: list[dict] = [] # empty list to hold sample dicts
with open(path, "rb") as f:
# get file version (first byte)
version = struct.unpack("B", f.read(1))[0]
logger.debug(f"IndexMetricsOut.bin file version: {version}")
while True:
# fixed fields chunking
# lane (2), tile (4), read (2), indexLength(2)
fixed_format = "HIHH"
size = struct.calcsize(fixed_format)
chunk = f.read(size)
# if end of file
if not chunk:
logger.debug(f"End of file reached")
break
# assign fixed byte variables
lane, tile, read_num, index_len = struct.unpack(fixed_format,
chunk,
)
logger.debug(f"lane: {lane}, tile: {tile}, "
f"read_num: {read_num}, index_len: {index_len}")
def unpack_helper(fmt, data):
size = struct.calcsize(fmt)
return struct.unpack(fmt, data[:size]), data[size:]
# decode and assign index_name based on index_len
index_name_bytes = f.read(index_len)
# TO DO: troubleshoot decoding issue at index name
index_name = index_name_bytes.decode(encoding = "utf-8")
print(index_name)
When I print the value of index_name_bytes, it seems far greater than it should be. I see my index value in alpha-numeric values within the hexadecimal code:
tests.readindexbin::DEBUG: IndexMetricsOut.bin file version: 2
tests.readindexbin::DEBUG: lane: 1, tile: 65536, read_num: 21, index_len: 16705
b'CACTGTTA-TGAGACTTGC\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180\x07\x00default\x0 ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 177: invalid start byte
I think I'm reading too many values, and there's got to be a better way to do this. Just stuck. Thanks so much for any help offered. I think I've looked at this too long.
Share Improve this question asked Apr 1 at 21:24 mm 11 silver badge2 bronze badges New contributor m is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.1 Answer
Reset to default 1You are reading the wrong size. The default for struct
formats is to apply padding bytes according to the native architecture (Ref: Byte Order, Size, and Alignment). To suppress that, specify the endianness of the structure:
>>> import struct
>>> struct.calcsize('HIHH') # 2+4+2+2 should be 10, but I is misaligned so 2 pad bytes added.
12
>>> struct.calcsize('<HIHH') # little-endian
10
This works correctly on the one complete record I can see in your data:
import io
import struct
f = io.BytesIO(b'\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180')
fixed_format = '<HIHH'
size = struct.calcsize(fixed_format)
chunk = f.read(size)
lane, tile, read_num, index_len = struct.unpack(fixed_format, chunk)
print(f'lane: {lane}, tile: {tile}, read_num: {read_num}, index_len: {index_len}')
print(f.read(index_len).decode())
Output:
lane: 25626, tile: 6, read_num: 0, index_len: 29
MMR_YPZ_PYI-1504_50898847_180
Also, note that the following data after this record, b'\x07\x00default'
looks like another length-encoded string and not another record starting with lane/tile/read_num, so it looks like there is more to this format than specified in the question.
本文标签: pythonReading str from binary file given length in bytesStack Overflow
版权声明:本文标题:python - Reading str from binary file given length in bytes - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743870086a2553306.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论