admin管理员组文章数量:1296463
I just noticed that whenever I use seek()
on a TextIOWrapper
object, the performance decreases noticeably.
The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.
I'd expect this code to only read once from disk. The whole file fits into the buffer.
However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek()
the file is only read once. Why doesn't this work with seek()
?
input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
while True:
pos = file.tell()
l1 = file.readline()
if not l1:
break
file.seek(pos)
l2 = file.readline()
input("Press Enter to exit...")
I just noticed that whenever I use seek()
on a TextIOWrapper
object, the performance decreases noticeably.
The following code opens a text file (should be of size between 10 and 50 MB) reads one line of code and then calls seek with a position before the last line. Then reads another line.
I'd expect this code to only read once from disk. The whole file fits into the buffer.
However, with a file of size of 25 MB, this reads a total of 1.2 GB from disk. If I remove the call to seek()
the file is only read once. Why doesn't this work with seek()
?
input("Press Enter to start...")
with open('file.txt', 'r', 50 * 1024 * 1024, 'utf-8', newline='\n') as file:
while True:
pos = file.tell()
l1 = file.readline()
if not l1:
break
file.seek(pos)
l2 = file.readline()
input("Press Enter to exit...")
Share
Improve this question
edited Feb 11 at 20:39
Barmar
783k56 gold badges546 silver badges660 bronze badges
asked Feb 11 at 20:19
T3rm1T3rm1
2,5845 gold badges39 silver badges53 bronze badges
7
|
Show 2 more comments
2 Answers
Reset to default 2I think I have found the issue, but I don't have a satisfying immediate solution.
Minimal reproducible example
First things first: This problem only affects the io.TextIOWrapper
created when opening in text mode ('r'
), not the io.BufferedReader
created in binary mode 'rb'
. I have verified this behavior with strace
. A simple self-contained test case may look like this:
import tempfile
with tempfile.NamedTemporaryFile(mode='w') as tmp:
tmp.write("foo\nbar\n")
tmp.file.close()
with open(tmp.name, 'r') as infile:
infile.readline()
infile.seek(0)
infile.readline()
Change the mode to rb
and observe the difference in strace
. With text mode:
openat(AT_FDCWD, "/tmp/tmpwxotyrmt", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffff3fb9420) = -1 ENOTTY
lseek(3, 0, SEEK_CUR) = 0
read(3, "foo\nbar\n", 8192) = 8
lseek(3, 0, SEEK_SET) = 0
read(3, "foo\nbar\n", 8192) = 8
close(3) = 0
unlink("/tmp/tmpwxotyrmt") = 0
With binary mode:
openat(AT_FDCWD, "/tmp/tmp29ujuat3", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=8, ...}) = 0
ioctl(3, TCGETS, 0x7ffdaf2d3110) = -1 ENOTTY
lseek(3, 0, SEEK_CUR) = 0
read(3, "foo\nbar\n", 4096) = 8
close(3) = 0
unlink("/tmp/tmp29ujuat3") = 0
At first this discrepancy may be puzzling since the TextIOWrapper
has a BufferedReader
as its underlying self.buffer
.
Exploring the implementation
We can trace the operations caused by the TextIOWrapper
with a debugger or a simple wrapper around the underlying buffer.
import io
import tempfile
class DebugWrapper:
def __init__(self, buffer):
self.buffer = buffer
@property
def closed(self):
return self.buffer.closed
@property
def readable(self):
return self.buffer.readable
@property
def writable(self):
return self.buffer.writable
@property
def seekable(self):
return self.buffer.seekable
def seek(self, pos):
print(f"seek({pos})")
return self.buffer.seek(pos)
def read(self, count):
print(f"read({count})")
return self.buffer.read(count)
def flush(self):
print("flush()")
return self.buffer.flush()
with tempfile.NamedTemporaryFile(mode='w') as tmp:
tmp.write("foo\nbar\n")
tmp.file.close()
with open(tmp.name, 'rb') as raw:
wrapped = DebugWrapper(raw)
infile = io.TextIOWrapper(wrapped)
infile.readline()
infile.seek(0)
infile.readline()
This prints
read(8192)
flush()
seek(0)
read(8192)
The C-code
This brings me to the way the BufferedReader
decides to use its buffer and when to reuse its buffer instead of reading again.
The first issue is that the BufferedReader
will not even use its internal buffer for these read operations. They are large powers of 2. The reader will instead read directly into the output string. Normally, this is optimal because it avoids a needless copy. At least that's what I think is happening here.
The second issue would prevent the use even if the buffer was filled. This also negatively affects regular binary IO. So I think this one is an actual bug.
The relevant lines are here in the repository in the function _io__Buffered_seek_impl
:
#define READAHEAD(self) \
((self->readable && VALID_READ_BUFFER(self)) \
? (self->read_end - self->pos) : 0)
...
/* SEEK_SET and SEEK_CUR are special because we could seek inside the
buffer. Other whence values must be managed without this optimization.
Some Operating Systems can provide additional values, like
SEEK_HOLE/SEEK_DATA. */
if (((whence == 0) || (whence == 1)) && self->readable) {
Py_off_t current, avail;
/* Check if seeking leaves us inside the current buffer,
so as to return quickly if possible. Also, we needn't take the
lock in this fast path.
Don't know how to do that when whence == 2, though. */
/* NOTE: RAW_TELL() can release the GIL but the object is in a stable
state at this point. */
current = RAW_TELL(self);
avail = READAHEAD(self);
if (avail > 0) {
Py_off_t offset;
if (whence == 0)
offset = target - (current - RAW_OFFSET(self));
else
offset = target;
if (offset >= -self->pos && offset <= avail) {
self->pos += offset;
// GH-95782
if (current - avail + offset < 0)
return PyLong_FromOff_t(0);
return PyLong_FromOff_t(current - avail + offset);
}
}
}
The trouble is with the avail > 0
. Because the TextIOWrapper
reads the full 8192 buffer size, avail
will be zero. The logical file pointer of the BufferedReader
is placed at the end of the current buffer. The program logic does not even consider that the position may be in the buffer before the current location unless there is at least one more byte buffered in the forward direction.
In the case of binary IO, after we call readline()
, the file position will be 4 bytes into the file and avail
will also be 4, triggering the optimized code path.
Testing and discussion
I have verified that this is the issue with two changes to the code above:
- I changed the file to be more than 8192 bytes large
- I changed the
DebugWrapper.read()
to read one less byte than requested
This is obviously wrong but it triggers the reuse of the underlying buffer.
This opens the question: Is the behavior intentional?
Honestly, I don't know. avail == 0
can mean end of buffer but also end of file. If we are at the end of a file that is currently being written to by a different process, do we always want to do a true read + seek to see if new data has arrived?
It also begs the question why open(…, 'r')
creates a TextIOWrapper
around a BufferedReader
in the first place. If the TextIOWrapper
always reads 8192 bytes to do its own internal buffering, it never uses the buffer provided by the BufferedReader
. It might as well use unbuffered io.FileIO
under the hood.
In the end I think the TextIOWrapper
cannot rely on the BufferedReader
to handle the seeking efficiently. The way these two interact effectively prevents the BufferedReader
from buffering much, if anything. Therefore its seek optimization cannot work. Even if we fix that performance issue and let the buffer be reused when positioned at its end, that would not help the TextIOWrapper
. At this point the TextIOWrapper
itself needs to be smarter in handling short seeks. I have not investigated whether the object retains sufficient state to do this itself.
This seems like deliberate behavior in the implementation of TextIOWrapper
. The code implementing this always seeks to the beginning of the file when using SEEK_SET
as you are. It might be nice if there was an optimization where it knows that you're seeking to somewhere already buffered, but that isn't currently implemented.
本文标签: pythonWhy does TextIOWrapperseek() not use the bufferStack Overflow
版权声明:本文标题:python - Why does TextIOWrapper.seek() not use the buffer? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741636965a2389694.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
'rb'
does not seek. Therefore that cannot be the justification – Homer512 Commented Feb 11 at 20:54