admin管理员组

文章数量:1345089

I am developing a secure P2P file transfer tool in C which is intended to be used for sending arbitrary-size files from between two machines running the program.

I have been trying to figure out what the currently best known techniques are for a versatile yet near-optimal approach for reading/writing large files.

The data flow is as follows

read() file chunk into buffer
            |
            v
      encrypt chunk
            |
            v
      compress chunk
            |
            v
   write() to TCP socket

A couple of details I'd like to point out:

  • The encryption (and compression) are done within the application since the protocol is built into the application; Therefore using something like kTLS does not apply here.
  • I've been suggested to profile different I/O techniques before finalizing any one design. However, the project is at a very early stage right now for me to be able to do any such profiling. In fact, I am only partially done working on the client state machine, and a server which can understand these protocol-specific messages just does not exist yet.
  • io_uring's difficult interface would add more complexity than I can track on the project right now, but maybe this can be a viable optimization/refactor later-on.

For reads, I am currently mmap()ing in chunks of maximum 48 * PAGE_SIZE for files >= 4 GB and a maximum chunk size of 24 * PAGE_SIZE for everything smaller (these are completely asspull numbers that I just went with).

For writes, I just write() to the file from within a loop that recieves data from the TCP socket.

I found a post from a 2003 mail thread between a few folks and Linus Torvalds where he says

Quite a lot of operations could be done directly on the page cache. I'm not a huge fan of mmap() myself - the biggest advantage of mmap is when you don't know your access patterns, and you have reasonably good locality. In many other cases mmap is just a total loss, because the page table walking is often more expensive than even a memcpy().

...

memcpy() often gets a bad name. Yeah, memory is slow, but especially if you copy something you just worked on, you're actually often better off letting the CPU cache do its job, rather than walking page tables and trying to be clever.

Just as an example: copying often means that you don't need nearly as much locking and synchronization - which in turn avoids one whole big mess (yes, the memcpy() will look very hot in profiles, but then doing extra work to avoid the memcpy() will cause spread-out overhead that is a lot worse and harder to think about).

This is why a simple read()/write() loop often beats mmap approaches. And often it's actually better to not even have big buffers (ie the old "avoid system calls by aggregation" approach) because that just blows your cache away.

Right now, the fastest way to copy a file is apparently by doing lots of ~8kB read/write pairs (that data may be slightly stale, but it was true at some point). Never mind the system call overhead - just having the extra buffer stay in the L1 cache and avoiding page faults from mmap is a bigger win.

Besides the fact that this thread is from 21 years ago, it also precedes the Spectre/Meltdown attacks and syscalls have gotten much more expensive since so minimizing syscalls is imperative.

To summarize this post with all the above context, how should I approach desiging the I/O interface for my application, when I don't have the means to profile at this stage? What would be my best bet? And where can I learn the nifty I/O tricks used for such uses?

I am developing a secure P2P file transfer tool in C which is intended to be used for sending arbitrary-size files from between two machines running the program.

I have been trying to figure out what the currently best known techniques are for a versatile yet near-optimal approach for reading/writing large files.

The data flow is as follows

read() file chunk into buffer
            |
            v
      encrypt chunk
            |
            v
      compress chunk
            |
            v
   write() to TCP socket

A couple of details I'd like to point out:

  • The encryption (and compression) are done within the application since the protocol is built into the application; Therefore using something like kTLS does not apply here.
  • I've been suggested to profile different I/O techniques before finalizing any one design. However, the project is at a very early stage right now for me to be able to do any such profiling. In fact, I am only partially done working on the client state machine, and a server which can understand these protocol-specific messages just does not exist yet.
  • io_uring's difficult interface would add more complexity than I can track on the project right now, but maybe this can be a viable optimization/refactor later-on.

For reads, I am currently mmap()ing in chunks of maximum 48 * PAGE_SIZE for files >= 4 GB and a maximum chunk size of 24 * PAGE_SIZE for everything smaller (these are completely asspull numbers that I just went with).

For writes, I just write() to the file from within a loop that recieves data from the TCP socket.

I found a post from a 2003 mail thread between a few folks and Linus Torvalds where he says

Quite a lot of operations could be done directly on the page cache. I'm not a huge fan of mmap() myself - the biggest advantage of mmap is when you don't know your access patterns, and you have reasonably good locality. In many other cases mmap is just a total loss, because the page table walking is often more expensive than even a memcpy().

...

memcpy() often gets a bad name. Yeah, memory is slow, but especially if you copy something you just worked on, you're actually often better off letting the CPU cache do its job, rather than walking page tables and trying to be clever.

Just as an example: copying often means that you don't need nearly as much locking and synchronization - which in turn avoids one whole big mess (yes, the memcpy() will look very hot in profiles, but then doing extra work to avoid the memcpy() will cause spread-out overhead that is a lot worse and harder to think about).

This is why a simple read()/write() loop often beats mmap approaches. And often it's actually better to not even have big buffers (ie the old "avoid system calls by aggregation" approach) because that just blows your cache away.

Right now, the fastest way to copy a file is apparently by doing lots of ~8kB read/write pairs (that data may be slightly stale, but it was true at some point). Never mind the system call overhead - just having the extra buffer stay in the L1 cache and avoiding page faults from mmap is a bigger win.

Besides the fact that this thread is from 21 years ago, it also precedes the Spectre/Meltdown attacks and syscalls have gotten much more expensive since so minimizing syscalls is imperative.

To summarize this post with all the above context, how should I approach desiging the I/O interface for my application, when I don't have the means to profile at this stage? What would be my best bet? And where can I learn the nifty I/O tricks used for such uses?

Share Improve this question asked 10 hours ago vibhav950vibhav950 571 silver badge8 bronze badges 9
  • do the stupidest thing that could possibly work and you still have a good chance of saturating your user's network link. – Botje Commented 10 hours ago
  • 6 unrelated: don't do compression after encryption, it's pointless. – Botje Commented 10 hours ago
  • @Botje Perhaps 2 threads then, the write() and the rest? – chux Commented 10 hours ago
  • "so minimizing syscalls is imperative" -- this is exactly the kind of assumption that you should avoid. Rely instead on that profiling that was recommended to you, as directed by overall performance testing, to determine just how imperative it is to minimize syscalls. Fewer will probably tend to be better, but it may well not be the case that absolute fewest is absolute best. – John Bollinger Commented 10 hours ago
  • 2 @Botje is correct - there is absolutely no point in trying to compress encrypted data. It is as close to uncompressible as you are ever likely to encounter IRL. Attempts to compress it may even make the resulting block larger and will certainly consume considerable computational time. – Martin Brown Commented 9 hours ago
 |  Show 4 more comments

1 Answer 1

Reset to default 5

Performance is always best measured, not predicted. So,

how should I approach desiging the I/O interface for my application, when I don't have the means to profile at this stage?

First make it work. Only then focus on making it faster, and that with the support of profiling and performance testing.

You should start with something -- anything -- that does the job and is easy to write, to validate, and to reason about. For instance, straightforward reading in 8kb chunks with read(). Make this modular so that it is easy to swap it out for anything else you want to test.

It may be that when you get around to performance testing, you discover that your modular I/O interface itself is a significant bottleneck. At this point you have data to guide you as to what to do instead.

At worst, you throw away a whole first iteration, and write a new one in light of the lessons learned. Some people even develop applications with the a priori expectation that they will need to do this.

本文标签: cMost efficient way to handle IO for large filesStack Overflow