caching - Cost of cache miss and the number of memory round trips - Stack Overflow

IT技术

更新时间：2025-04-178

admin管理员组
文章数量:1410689

Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3). How does the CPU fill up the cache? Obviously the entire cache block cannot fit into the bus, so I'm guessing that the CPU would have to request reads 4 separate times (as 512/128 = 4).

More specifically, given this setup and the cache miss penalty of 100 nanoseconds, does this mean that the CPU would make a read request every 25 nanoseconds? Or does it somehow batch request the addresses and wait for them to arrive in sequence?

Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3). How does the CPU fill up the cache? Obviously the entire cache block cannot fit into the bus, so I'm guessing that the CPU would have to request reads 4 separate times (as 512/128 = 4).

More specifically, given this setup and the cache miss penalty of 100 nanoseconds, does this mean that the CPU would make a read request every 25 nanoseconds? Or does it somehow batch request the addresses and wait for them to arrive in sequence?

Share asked Mar 10 at 19:25 martian17 5003 silver badges19 bronze badges

2 That's what the line-fill buffers are for. See for example How do the store buffer and Line Fill Buffer interact with each other? for an explanation specific to write access. Read access is similar – Homer512 Commented Mar 10 at 20:11
2 Also, see What every programmer should know about memory part 2 (Caches), specifically section 3.5.2 Critical Word Load – Homer512 Commented Mar 10 at 20:15
1 What are you implying by calling it an L4 when there are no other levels? That it's shared by all cores, and made of EDRAM or something (like on some Intel laptop chips with Iris GPUs) vs. a usual shared L3 made from SRAM? Also, as others have said, a single request will initiate a DDR DRAM burst transfer, and the memory controller will send the whole line to the cache. – Peter Cordes Commented Mar 14 at 4:06

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Nowadays, I doubt you could find a single processor being manufactured with an L4 cache. If you only have a single level of cache, that is going to be your L1 cache. I don't see the point in calling it L4. Unless you have another underlying question, from which this weird question originates (The XY problem).

Since there is a Line Fill Buffer (LFB) associated with the L1 cache, there would be an entry allocated in the LFB to track this miss (to retrieve and assemble the cache line as data is sent over).

The memory controller will further send the request to the appropriate DRAM chip, which would activate the corresponding row containing the data, followed by the proper column. This part of the procedure is responsible for most of the latency you mentioned. Since the bus is capable of transferring 16 bytes at a time in your scenario, the burst transfer of data to the processor would happen in 4 separate bus cycles (all part of one request/response transaction), from the DRAM to the processor.

This is not 4 different requests from the processor, rather one request for a cache line from the processor, and 4 separate chunks of data which are assembled later by the LFB and inserted into the L1. (Or assembled by the memory controller and sent over wider busses inside the CPU.)

These transfers occur in "burst mode" where the DRAM automatically sends sequential chunks after receiving the initial address, with the memory controller issuing just a single command to retrieve the entire cache line. This is due to the high memory access latency (most % of the 100ns in your scenario), which needs to be amortized by transferring more than a single-bus-width of data upon a request, which could later potentially be useful.

It's not a coincidence that DDR SDRAM's burst size is 64 bytes (or 32 for a short burst), same as the cache line width of typical CPUs. (DDR SDRAM's data width is only 64 bits, so your hypothetical system has a wider data path.)

本文标签： cachingCost of cache miss and the number of memory round tripsStack Overflow

版权声明：本文标题：caching - Cost of cache miss and the number of memory round trips - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744827427a2627144.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

caching - Cost of cache miss and the number of memory round trips - Stack Overflow

1 Answer 1

更多相关文章