admin管理员组

文章数量:1296463

I just noticed that whenever I use seek() on a TextIOWrapper object, the performance decreases noticeably.

The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.

I'd expect this code to only read once from disk. The whole file fits into the buffer.

However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek() the file is only read once. Why doesn't this work with seek()?

input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
    while True:
        pos = file.tell()
        l1 = file.readline()
        if not l1:
            break
        
        file.seek(pos)
        l2 = file.readline()
input("Press Enter to exit...")

I just noticed that whenever I use seek() on a TextIOWrapper object, the performance decreases noticeably.

The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.

I'd expect this code to only read once from disk. The whole file fits into the buffer.

However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek() the file is only read once. Why doesn't this work with seek()?

input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
    while True:
        pos = file.tell()
        l1 = file.readline()
        if not l1:
            break
        
        file.seek(pos)
        l2 = file.readline()
input("Press Enter to exit...")
Share Improve this question edited Feb 11 at 20:39 Barmar 783k56 gold badges546 silver badges660 bronze badges asked Feb 11 at 20:19 T3rm1T3rm1 2,5845 gold badges39 silver badges53 bronze badges 7
  • 1 Maybe in case the file has changed. – Barmar Commented Feb 11 at 20:42
  • @Barmar reading in binary mode 'rb' does not seek. Therefore that cannot be the justification – Homer512 Commented Feb 11 at 20:54
  • That detail would be useful to put in the question. – Barmar Commented Feb 11 at 20:56
  • 1 @jsbueno please don't try to find solutions for other problems. Reading all lines is not an option. If you don't agree with me that the code should in fact only read the file once, feel free to give an answer. – T3rm1 Commented Feb 12 at 14:22
  • 1 It seems like a bug, maybe you should submit it to the CPython developers. – Barmar Commented Feb 12 at 18:44
 |  Show 2 more comments

2 Answers 2

Reset to default 2

I think I have found the issue, but I don't have a satisfying immediate solution.

Minimal reproducible example

First things first: This problem only affects the io.TextIOWrapper created when opening in text mode ('r'), not the io.BufferedReader created in binary mode 'rb'. I have verified this behavior with strace. A simple self-contained test case may look like this:

import tempfile

with tempfile.NamedTemporaryFile(mode='w') as tmp:
    tmp.write("foo\nbar\n")
    tmp.file.close()
    with open(tmp.name, 'r') as infile:
        infile.readline()
        infile.seek(0)
        infile.readline()

Change the mode to rb and observe the difference in strace. With text mode:

openat(AT_FDCWD, "/tmp/tmpwxotyrmt", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffff3fb9420)        = -1 ENOTTY
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "foo\nbar\n", 8192)             = 8
lseek(3, 0, SEEK_SET)                   = 0
read(3, "foo\nbar\n", 8192)             = 8
close(3)                                = 0
unlink("/tmp/tmpwxotyrmt")              = 0

With binary mode:

openat(AT_FDCWD, "/tmp/tmp29ujuat3", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffdaf2d3110)        = -1 ENOTTY
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "foo\nbar\n", 4096)             = 8
close(3)                                = 0
unlink("/tmp/tmp29ujuat3")              = 0

At first this discrepancy may be puzzling since the TextIOWrapper has a BufferedReader as its underlying self.buffer.

Exploring the implementation

We can trace the operations caused by the TextIOWrapper with a debugger or a simple wrapper around the underlying buffer.

import io
import tempfile

class DebugWrapper:
    def __init__(self, buffer):
        self.buffer = buffer

    @property
    def closed(self):
        return self.buffer.closed

    @property
    def readable(self):
        return self.buffer.readable

    @property
    def writable(self):
        return self.buffer.writable

    @property
    def seekable(self):
        return self.buffer.seekable

    def seek(self, pos):
        print(f"seek({pos})")
        return self.buffer.seek(pos)

    def read(self, count):
        print(f"read({count})")
        return self.buffer.read(count)

    def flush(self):
        print("flush()")
        return self.buffer.flush()


with tempfile.NamedTemporaryFile(mode='w') as tmp:
    tmp.write("foo\nbar\n")
    tmp.file.close()
    with open(tmp.name, 'rb') as raw:
        wrapped = DebugWrapper(raw)
        infile = io.TextIOWrapper(wrapped)
        infile.readline()
        infile.seek(0)
        infile.readline()

This prints

read(8192)
flush()
seek(0)
read(8192)

The C-code

This brings me to the way the BufferedReader decides to use its buffer and when to reuse its buffer instead of reading again.

The first issue is that the BufferedReader will not even use its internal buffer for these read operations. They are large powers of 2. The reader will instead read directly into the output string. Normally, this is optimal because it avoids a needless copy. At least that's what I think is happening here.

The second issue would prevent the use even if the buffer was filled. This also negatively affects regular binary IO. So I think this one is an actual bug.

The relevant lines are here in the repository in the function _io__Buffered_seek_impl:

#define READAHEAD(self) \
    ((self->readable && VALID_READ_BUFFER(self)) \
        ? (self->read_end - self->pos) : 0)

...

    /* SEEK_SET and SEEK_CUR are special because we could seek inside the
       buffer. Other whence values must be managed without this optimization.
       Some Operating Systems can provide additional values, like
       SEEK_HOLE/SEEK_DATA. */
    if (((whence == 0) || (whence == 1)) && self->readable) {
        Py_off_t current, avail;
        /* Check if seeking leaves us inside the current buffer,
           so as to return quickly if possible. Also, we needn't take the
           lock in this fast path.
           Don't know how to do that when whence == 2, though. */
        /* NOTE: RAW_TELL() can release the GIL but the object is in a stable
           state at this point. */
        current = RAW_TELL(self);
        avail = READAHEAD(self);
        if (avail > 0) {
            Py_off_t offset;
            if (whence == 0)
                offset = target - (current - RAW_OFFSET(self));
            else
                offset = target;
            if (offset >= -self->pos && offset <= avail) {
                self->pos += offset;

                // GH-95782
                if (current - avail + offset < 0)
                    return PyLong_FromOff_t(0);

                return PyLong_FromOff_t(current - avail + offset);
            }
        }
    }

The trouble is with the avail > 0. Because the TextIOWrapper reads the full 8192 buffer size, avail will be zero. The logical file pointer of the BufferedReader is placed at the end of the current buffer. The program logic does not even consider that the position may be in the buffer before the current location unless there is at least one more byte buffered in the forward direction.

In the case of binary IO, after we call readline(), the file position will be 4 bytes into the file and avail will also be 4, triggering the optimized code path.

Testing and discussion

I have verified that this is the issue with two changes to the code above:

  1. I changed the file to be more than 8192 bytes large
  2. I changed the DebugWrapper.read() to read one less byte than requested

This is obviously wrong but it triggers the reuse of the underlying buffer.

This opens the question: Is the behavior intentional?

Honestly, I don't know. avail == 0 can mean end of buffer but also end of file. If we are at the end of a file that is currently being written to by a different process, do we always want to do a true read + seek to see if new data has arrived?

It also begs the question why open(…, 'r') creates a TextIOWrapper around a BufferedReader in the first place. If the TextIOWrapper always reads 8192 bytes to do its own internal buffering, it never uses the buffer provided by the BufferedReader. It might as well use unbuffered io.FileIO under the hood.

In the end I think the TextIOWrapper cannot rely on the BufferedReader to handle the seeking efficiently. The way these two interact effectively prevents the BufferedReader from buffering much, if anything. Therefore its seek optimization cannot work. Even if we fix that performance issue and let the buffer be reused when positioned at its end, that would not help the TextIOWrapper. At this point the TextIOWrapper itself needs to be smarter in handling short seeks. I have not investigated whether the object retains sufficient state to do this itself.

This seems like deliberate behavior in the implementation of TextIOWrapper. The code implementing this always seeks to the beginning of the file when using SEEK_SET as you are. It might be nice if there was an optimization where it knows that you're seeking to somewhere already buffered, but that isn't currently implemented.

本文标签: pythonWhy does TextIOWrapperseek() not use the bufferStack Overflow