admin管理员组

文章数量:1350109

I have a binary file with the following structure:

•   Byte 0: file version number (2)
•   The remaining bytes represent records, which are composed of the following information:

–   2 bytes: lane number (uint16)
–   4 bytes: tile number (uint32)
–   2 bytes: read number (uint16)
–   2 bytes: indexLength, the length in bytes of index name (uint16)
–   indexLength bytes: string representing index name

The rest follows this format. I have read in the first set of items without issue, but am hung up on the best way to decode the string indexLength given the bytes representing it.

Here's what I have so far: I have also added my error at the end. There is not information available that I can find on encoding, so I went with utf-8.

samples: list[dict] = [] # empty list to hold sample dicts

with open(path, "rb") as f:
    # get file version (first byte)
    version = struct.unpack("B", f.read(1))[0]
    logger.debug(f"IndexMetricsOut.bin file version: {version}")

    while True:
        # fixed fields chunking
        # lane (2), tile (4), read (2), indexLength(2)
        fixed_format = "HIHH"
        size = struct.calcsize(fixed_format)
        chunk = f.read(size)

        # if end of file
        if not chunk:
            logger.debug(f"End of file reached")
            break

        # assign fixed byte variables
        lane, tile, read_num, index_len = struct.unpack(fixed_format,
                                                        chunk,
                                                        )
        logger.debug(f"lane: {lane}, tile: {tile}, "
                     f"read_num: {read_num}, index_len: {index_len}")
        
        def unpack_helper(fmt, data):
            size = struct.calcsize(fmt)
            return struct.unpack(fmt, data[:size]), data[size:]

        # decode and assign index_name based on index_len
        
        index_name_bytes = f.read(index_len)

        # TO DO: troubleshoot decoding issue at index name

        index_name = index_name_bytes.decode(encoding = "utf-8")

        print(index_name)

When I print the value of index_name_bytes, it seems far greater than it should be. I see my index value in alpha-numeric values within the hexadecimal code:

tests.readindexbin::DEBUG: IndexMetricsOut.bin file version: 2
tests.readindexbin::DEBUG: lane: 1, tile: 65536, read_num: 21, index_len: 16705
b'CACTGTTA-TGAGACTTGC\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180\x07\x00default\x0 ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 177: invalid start byte

I think I'm reading too many values, and there's got to be a better way to do this. Just stuck. Thanks so much for any help offered. I think I've looked at this too long.

I have a binary file with the following structure:

•   Byte 0: file version number (2)
•   The remaining bytes represent records, which are composed of the following information:

–   2 bytes: lane number (uint16)
–   4 bytes: tile number (uint32)
–   2 bytes: read number (uint16)
–   2 bytes: indexLength, the length in bytes of index name (uint16)
–   indexLength bytes: string representing index name

The rest follows this format. I have read in the first set of items without issue, but am hung up on the best way to decode the string indexLength given the bytes representing it.

Here's what I have so far: I have also added my error at the end. There is not information available that I can find on encoding, so I went with utf-8.

samples: list[dict] = [] # empty list to hold sample dicts

with open(path, "rb") as f:
    # get file version (first byte)
    version = struct.unpack("B", f.read(1))[0]
    logger.debug(f"IndexMetricsOut.bin file version: {version}")

    while True:
        # fixed fields chunking
        # lane (2), tile (4), read (2), indexLength(2)
        fixed_format = "HIHH"
        size = struct.calcsize(fixed_format)
        chunk = f.read(size)

        # if end of file
        if not chunk:
            logger.debug(f"End of file reached")
            break

        # assign fixed byte variables
        lane, tile, read_num, index_len = struct.unpack(fixed_format,
                                                        chunk,
                                                        )
        logger.debug(f"lane: {lane}, tile: {tile}, "
                     f"read_num: {read_num}, index_len: {index_len}")
        
        def unpack_helper(fmt, data):
            size = struct.calcsize(fmt)
            return struct.unpack(fmt, data[:size]), data[size:]

        # decode and assign index_name based on index_len
        
        index_name_bytes = f.read(index_len)

        # TO DO: troubleshoot decoding issue at index name

        index_name = index_name_bytes.decode(encoding = "utf-8")

        print(index_name)

When I print the value of index_name_bytes, it seems far greater than it should be. I see my index value in alpha-numeric values within the hexadecimal code:

tests.readindexbin::DEBUG: IndexMetricsOut.bin file version: 2
tests.readindexbin::DEBUG: lane: 1, tile: 65536, read_num: 21, index_len: 16705
b'CACTGTTA-TGAGACTTGC\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180\x07\x00default\x0 ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 177: invalid start byte

I think I'm reading too many values, and there's got to be a better way to do this. Just stuck. Thanks so much for any help offered. I think I've looked at this too long.

Share Improve this question asked Apr 1 at 21:24 mm 11 silver badge2 bronze badges New contributor m is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Add a comment  | 

1 Answer 1

Reset to default 1

You are reading the wrong size. The default for struct formats is to apply padding bytes according to the native architecture (Ref: Byte Order, Size, and Alignment). To suppress that, specify the endianness of the structure:

>>> import struct
>>> struct.calcsize('HIHH')  # 2+4+2+2 should be 10, but I is misaligned so 2 pad bytes added.
12
>>> struct.calcsize('<HIHH')  # little-endian
10

This works correctly on the one complete record I can see in your data:

import io
import struct

f = io.BytesIO(b'\x1ad\x06\x00\x00\x00\x00\x00\x1d\x00MMR_YPZ_PYI-1504_50898847_180')

fixed_format = '<HIHH'
size = struct.calcsize(fixed_format)
chunk = f.read(size)
lane, tile, read_num, index_len = struct.unpack(fixed_format, chunk)
print(f'lane: {lane}, tile: {tile}, read_num: {read_num}, index_len: {index_len}')
print(f.read(index_len).decode())

Output:

lane: 25626, tile: 6, read_num: 0, index_len: 29
MMR_YPZ_PYI-1504_50898847_180

Also, note that the following data after this record, b'\x07\x00default' looks like another length-encoded string and not another record starting with lane/tile/read_num, so it looks like there is more to this format than specified in the question.

本文标签: pythonReading str from binary file given length in bytesStack Overflow