python - Why does TextIOWrapper.seek() not use the buffer? - Stack Overflow

IT技术

更新时间：2025-03-111

admin管理员组
文章数量:1296463

I just noticed that whenever I use seek() on a TextIOWrapper object, the performance decreases noticeably.

The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.

I'd expect this code to only read once from disk. The whole file fits into the buffer.

However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek() the file is only read once. Why doesn't this work with seek()?

input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
    while True:
        pos = file.tell()
        l1 = file.readline()
        if not l1:
            break
        
        file.seek(pos)
        l2 = file.readline()
input("Press Enter to exit...")

I just noticed that whenever I use seek() on a TextIOWrapper object, the performance decreases noticeably.

The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.

I'd expect this code to only read once from disk. The whole file fits into the buffer.

However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek() the file is only read once. Why doesn't this work with seek()?

input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
    while True:
        pos = file.tell()
        l1 = file.readline()
        if not l1:
            break
        
        file.seek(pos)
        l2 = file.readline()
input("Press Enter to exit...")

Share Improve this question edited Feb 11 at 20:39 Barmar 783k56 gold badges546 silver badges660 bronze badges asked Feb 11 at 20:19 T3rm1 2,5845 gold badges39 silver badges53 bronze badges

1 Maybe in case the file has changed. – Barmar Commented Feb 11 at 20:42
@Barmar reading in binary mode 'rb' does not seek. Therefore that cannot be the justification – Homer512 Commented Feb 11 at 20:54
That detail would be useful to put in the question. – Barmar Commented Feb 11 at 20:56
1 @jsbueno please don't try to find solutions for other problems. Reading all lines is not an option. If you don't agree with me that the code should in fact only read the file once, feel free to give an answer. – T3rm1 Commented Feb 12 at 14:22
1 It seems like a bug, maybe you should submit it to the CPython developers. – Barmar Commented Feb 12 at 18:44

| Show 2 more comments

2 Answers 2

Sorted by: Reset to default 2

I think I have found the issue, but I don't have a satisfying immediate solution.

Minimal reproducible example

First things first: This problem only affects the io.TextIOWrapper created when opening in text mode ('r'), not the io.BufferedReader created in binary mode 'rb'. I have verified this behavior with strace. A simple self-contained test case may look like this:

import tempfile

with tempfile.NamedTemporaryFile(mode='w') as tmp:
    tmp.write("foo\nbar\n")
    tmp.file.close()
    with open(tmp.name, 'r') as infile:
        infile.readline()
        infile.seek(0)
        infile.readline()

Change the mode to rb and observe the difference in strace. With text mode:

openat(AT_FDCWD, "/tmp/tmpwxotyrmt", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffff3fb9420)        = -1 ENOTTY
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "foo\nbar\n", 8192)             = 8
lseek(3, 0, SEEK_SET)                   = 0
read(3, "foo\nbar\n", 8192)             = 8
close(3)                                = 0
unlink("/tmp/tmpwxotyrmt")              = 0

With binary mode:

openat(AT_FDCWD, "/tmp/tmp29ujuat3", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffdaf2d3110)        = -1 ENOTTY
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "foo\nbar\n", 4096)             = 8
close(3)                                = 0
unlink("/tmp/tmp29ujuat3")              = 0

At first this discrepancy may be puzzling since the TextIOWrapper has a BufferedReader as its underlying self.buffer.

Exploring the implementation

We can trace the operations caused by the TextIOWrapper with a debugger or a simple wrapper around the underlying buffer.

import io
import tempfile

class DebugWrapper:
    def __init__(self, buffer):
        self.buffer = buffer

    @property
    def closed(self):
        return self.buffer.closed

    @property
    def readable(self):
        return self.buffer.readable

    @property
    def writable(self):
        return self.buffer.writable

    @property
    def seekable(self):
        return self.buffer.seekable

    def seek(self, pos):
        print(f"seek({pos})")
        return self.buffer.seek(pos)

    def read(self, count):
        print(f"read({count})")
        return self.buffer.read(count)

    def flush(self):
        print("flush()")
        return self.buffer.flush()


with tempfile.NamedTemporaryFile(mode='w') as tmp:
    tmp.write("foo\nbar\n")
    tmp.file.close()
    with open(tmp.name, 'rb') as raw:
        wrapped = DebugWrapper(raw)
        infile = io.TextIOWrapper(wrapped)
        infile.readline()
        infile.seek(0)
        infile.readline()

This prints

read(8192)
flush()
seek(0)
read(8192)

The C-code

This brings me to the way the BufferedReader decides to use its buffer and when to reuse its buffer instead of reading again.

The first issue is that the BufferedReader will not even use its internal buffer for these read operations. They are large powers of 2. The reader will instead read directly into the output string. Normally, this is optimal because it avoids a needless copy. At least that's what I think is happening here.

The second issue would prevent the use even if the buffer was filled. This also negatively affects regular binary IO. So I think this one is an actual bug.

The relevant lines are here in the repository in the function _io__Buffered_seek_impl:

#define READAHEAD(self) \
    ((self->readable && VALID_READ_BUFFER(self)) \
        ? (self->read_end - self->pos) : 0)

...

    /* SEEK_SET and SEEK_CUR are special because we could seek inside the
       buffer. Other whence values must be managed without this optimization.
       Some Operating Systems can provide additional values, like
       SEEK_HOLE/SEEK_DATA. */
    if (((whence == 0) || (whence == 1)) && self->readable) {
        Py_off_t current, avail;
        /* Check if seeking leaves us inside the current buffer,
           so as to return quickly if possible. Also, we needn't take the
           lock in this fast path.
           Don't know how to do that when whence == 2, though. */
        /* NOTE: RAW_TELL() can release the GIL but the object is in a stable
           state at this point. */
        current = RAW_TELL(self);
        avail = READAHEAD(self);
        if (avail > 0) {
            Py_off_t offset;
            if (whence == 0)
                offset = target - (current - RAW_OFFSET(self));
            else
                offset = target;
            if (offset >= -self->pos && offset <= avail) {
                self->pos += offset;

                // GH-95782
                if (current - avail + offset < 0)
                    return PyLong_FromOff_t(0);

                return PyLong_FromOff_t(current - avail + offset);
            }
        }
    }

The trouble is with the avail > 0. Because the TextIOWrapper reads the full 8192 buffer size, avail will be zero. The logical file pointer of the BufferedReader is placed at the end of the current buffer. The program logic does not even consider that the position may be in the buffer before the current location unless there is at least one more byte buffered in the forward direction.

In the case of binary IO, after we call readline(), the file position will be 4 bytes into the file and avail will also be 4, triggering the optimized code path.

Testing and discussion

I have verified that this is the issue with two changes to the code above:

I changed the file to be more than 8192 bytes large
I changed the DebugWrapper.read() to read one less byte than requested

This is obviously wrong but it triggers the reuse of the underlying buffer.

This opens the question: Is the behavior intentional?

Honestly, I don't know. avail == 0 can mean end of buffer but also end of file. If we are at the end of a file that is currently being written to by a different process, do we always want to do a true read + seek to see if new data has arrived?

It also begs the question why open(…, 'r') creates a TextIOWrapper around a BufferedReader in the first place. If the TextIOWrapper always reads 8192 bytes to do its own internal buffering, it never uses the buffer provided by the BufferedReader. It might as well use unbuffered io.FileIO under the hood.

In the end I think the TextIOWrapper cannot rely on the BufferedReader to handle the seeking efficiently. The way these two interact effectively prevents the BufferedReader from buffering much, if anything. Therefore its seek optimization cannot work. Even if we fix that performance issue and let the buffer be reused when positioned at its end, that would not help the TextIOWrapper. At this point the TextIOWrapper itself needs to be smarter in handling short seeks. I have not investigated whether the object retains sufficient state to do this itself.

This seems like deliberate behavior in the implementation of TextIOWrapper. The code implementing this always seeks to the beginning of the file when using SEEK_SET as you are. It might be nice if there was an optimization where it knows that you're seeking to somewhere already buffered, but that isn't currently implemented.

本文标签： pythonWhy does TextIOWrapperseek() not use the bufferStack Overflow

版权声明：本文标题：python - Why does TextIOWrapper.seek() not use the buffer? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741636965a2389694.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Why does TextIOWrapper.seek() not use the buffer? - Stack Overflow

2 Answers 2

Minimal reproducible example

Exploring the implementation

The C-code

Testing and discussion

更多相关文章

python - Why does TextIOWrapper.seek() not use the buffer? - Stack Overflow

发表评论

推荐文章

Graphviz ‘gvLayout’ crashing - Stack Overflow

django - Buffer channels when there are no consumers - Stack Overflow

Need help understanding function invocation in JavaScript - Stack Overflow

javascript - How display selected item from Bootstrap 4 drop-down item menu - Stack Overflow

javascript - Vuejs Webpack Compression Plugin not compressing - Stack Overflow

热门文章

javascript - .click() affect only one element with class - Stack Overflow

javascript - Vue Router with Boolean Query Parameter - Stack Overflow

java - how to display 1 million data from a table on UI - Stack Overflow

javascript - Bootstrap Typeahead.js - Stack Overflow

Numpy 2.0.2 installation fails on Raspberry pi 4B running OSMC, Python 3.9.2 - Stack Overflow

android - React Native : Execution failed for task &#39;:app:configureCMakeDebug[arm64-v8a]&#39; - Stack Overflow

javascript - HTML: Text in block-element; get exact position of click - Stack Overflow

functions - How to enqueue JavaScript for specific WordPress pages only?

javascript - Angular 2: Add validators to ngModelGroup - Stack Overflow

javascript - What is this chart called? How do I reproduce it in a web browser? - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

c# - I cannot display a popup from a view model - Stack Overflow

Convert Node.JS code snippet to Javascript (Google Apps Script) - Stack Overflow

plugin development - Detect if requested page is PWA on server side

javascript - How to use x-init to set &#39;selected&#39; attribute of dropdown option using AlpineJS - Stack Overflow

How can we set alternative fallback image instead of alternative text in email? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

android - React Native : Execution failed for task ':app:configureCMakeDebug[arm64-v8a]' - Stack Overflow

javascript - How to use x-init to set 'selected' attribute of dropdown option using AlpineJS - Stack Overflow