admin管理员组文章数量:1122832
in Mike Actons talks here and here (links are with timestamps) as well as in this blog post they prepare/precondition/prefetch/query needed, non-contiguous data first to then iterate them / do calculations in a cache friendly way.
Here is the code from the blog post:
// objects_store.h
struct ObjectsStore
{
vector<Vector3D> m_Positions;
vector<Quaternion> m_Orientations;
};
// Runtime solution:
struct TranslationData
{
Vector3D m_ObjPosition;
const Quaternion m_ObjOrientation;
const Vector3D m_ObjTranslation;
};
void PrepareTranslations(const ObjectsStore& inStore,
const vector<Vector3D>& inTranslations,
vector<TranslationData>& outObjectsToTranslate)
{
assert(inStore.m_Positions.size() == inStore.m_Orientations.size());
assert(inStore.m_Positions.size() == inTranslations.size());
for (size_t i = 0; i < inTranslations.size(); ++i)
{
outObjectsToTranslate.push_back
({
inStore.m_Positions[i];
inStore.m_Orientations[i];
inTranslations[i];
});
}
}
void TranslateObject(vector<TranslationData>& ioObjectsToTranslate)
{
for (TranslationData& data: ioObjectsToTranslate)
data.m_ObjPosition += data.m_ObjOrientation * data.m_ObjTranslation;
}
void ApplyTranslation(const vector<TranslationData>& inTranslationData
ObjectsStore& outStore)
{
assert(inStore.m_Positions.size() == inTranslationData.size());
for (size_t i = 0; i < inTranslationData.size(); ++i)
outStore.m_Positions[i] = inTranslationData[i].m_ObjPosition;
}
void UpdateGame(ObjectsStore& ioStore, vector<Vector3D>& inTranslations)
{
vector<TranslationData> translation_data;
PrepareTranslations(ioStore, inTranslations, translation_data);
TranslateObject(translation_data);
ApplyTranslation(translation_data, ioStore);
}
My question is: in order to prepare the data in such a way, one has to access the data anyway (and also copy it), so I wonder even if the data is more cache friendly afterwards, wouldn't it be more efficient to modify the data directly and spare the preparing step?
As far as I understand it, obviously calculations are then very efficient on the prepared data, but with the additional cost of preparing them in the first place.
Can anyone tell me what I am missing here? Does one take the extra cost to gain other benefits (like clearer code etc.) or is it actually more efficient than direct modification?
Thanks in advance
in Mike Actons talks here and here (links are with timestamps) as well as in this blog post they prepare/precondition/prefetch/query needed, non-contiguous data first to then iterate them / do calculations in a cache friendly way.
Here is the code from the blog post:
// objects_store.h
struct ObjectsStore
{
vector<Vector3D> m_Positions;
vector<Quaternion> m_Orientations;
};
// Runtime solution:
struct TranslationData
{
Vector3D m_ObjPosition;
const Quaternion m_ObjOrientation;
const Vector3D m_ObjTranslation;
};
void PrepareTranslations(const ObjectsStore& inStore,
const vector<Vector3D>& inTranslations,
vector<TranslationData>& outObjectsToTranslate)
{
assert(inStore.m_Positions.size() == inStore.m_Orientations.size());
assert(inStore.m_Positions.size() == inTranslations.size());
for (size_t i = 0; i < inTranslations.size(); ++i)
{
outObjectsToTranslate.push_back
({
inStore.m_Positions[i];
inStore.m_Orientations[i];
inTranslations[i];
});
}
}
void TranslateObject(vector<TranslationData>& ioObjectsToTranslate)
{
for (TranslationData& data: ioObjectsToTranslate)
data.m_ObjPosition += data.m_ObjOrientation * data.m_ObjTranslation;
}
void ApplyTranslation(const vector<TranslationData>& inTranslationData
ObjectsStore& outStore)
{
assert(inStore.m_Positions.size() == inTranslationData.size());
for (size_t i = 0; i < inTranslationData.size(); ++i)
outStore.m_Positions[i] = inTranslationData[i].m_ObjPosition;
}
void UpdateGame(ObjectsStore& ioStore, vector<Vector3D>& inTranslations)
{
vector<TranslationData> translation_data;
PrepareTranslations(ioStore, inTranslations, translation_data);
TranslateObject(translation_data);
ApplyTranslation(translation_data, ioStore);
}
My question is: in order to prepare the data in such a way, one has to access the data anyway (and also copy it), so I wonder even if the data is more cache friendly afterwards, wouldn't it be more efficient to modify the data directly and spare the preparing step?
As far as I understand it, obviously calculations are then very efficient on the prepared data, but with the additional cost of preparing them in the first place.
Can anyone tell me what I am missing here? Does one take the extra cost to gain other benefits (like clearer code etc.) or is it actually more efficient than direct modification?
Thanks in advance
Share Improve this question edited Nov 22, 2024 at 18:28 Igor G 2,3418 silver badges22 bronze badges asked Nov 22, 2024 at 17:11 SinixNDSinixND 835 bronze badges 2- yes, this is too conceptual it is actually wrong, if you have to prepare the data on every frame you might as well update them in-place, the gains are usually that you only prepare them once on startup then use that array instead for the rest of the scene's lifetime, also you might want to look into Andrew Kelley Practical Data Oriented Design (DoD) where the data is constructed in a cache-friendly way in the first place, it is never moved from one place to another. – Ahmed AEK Commented Nov 22, 2024 at 18:22
- this is what I thought as well, hence the question – SinixND Commented Nov 22, 2024 at 23:20
1 Answer
Reset to default 2Here it does not worth preparing data. In fact, it will certainly be slower.
First of all, PrepareTranslations
calls push_back
but no reserve
is called before so the vector will be resized O(log n)
times. Not great for memory accesses especially for large vectors.
The following points needs to be considered:
PrepareTranslations
should be memory bound and not SIMD friendly (push_back
certainly prevent any SIMD optimization).TranslateObject
should be rather compute bound (though it might not be benefit well from AVX2/AVX-512 which operate on respectively 8 items and 16 items meanwhile the Vector3D should have 3 attributes and the Quaternion should have 4 attributes).ApplyTranslation
should be memory bound.
The thing is if you merge the 3 operations in one loop, the CPU can hide the memory latency with computation, read/write data while doing computation, and move less data in memory overall. Here preparing data give almost no benefit and introduce additional costs so the result is certainly a slower execution.
Preparing data would be a good idea if you can then make the main computations more SIMD friendly for example. It can also be a good idea to reduce latency stalls of complex loops with loop-carried dependency preventing the CPU to fetch a lot of data from memory concurrently. Loop splitting can also help to avoid cache trashing in some cases (e.g. when you access many arrays aligned on a large power of two in memory in the same loop).
Here is a practical example: you have a list of N 2D positions (eg. player) you want to update based on a function requiring 8 arrays of N items and the computation requires doing accesses to a large hash-map. In this case, it is better to:
- use 2 arrays (one per coordinate) for the list of positions (see AoS vs SoA);
- fetch the hash-map values ahead of time and put the result in a temporary array;
- do the computation in a SIMD-friendly way (possibly using 2 loops regarding the target machine, and/or chunk by chunk).
Indeed, the hash-map will prevent any SIMD optimization by mainstream compilers. The same thing would be true for a code having conditional branches that cannot be vectorized.
In your case, using one array per coordinate will certainly speed things a bit assuming you do not need to convert the SoA to AoS every time (consider keeping them as SoA as much as possible). Indeed, this make the computing operation more SIMD friendly. Indeed, modern mainstream x86-64 CPUs can operate on 8 float
items at a time using a single AVX/AVX2 instruction and even 16 float
with AVX-512 (so a Vector3D
data structure does not fit well). If the arrays contain a lot of items, then AoSoA is the best layout (since AoS is not SIMD friendly and SoA tend to cause cache trashing on large arrays). However, this is a pain to write (and maintain) codes using AoSoA data structures, not to mention changing the layout is also complicated and often needed because of external libraries.
Last but not least, when you have to fetch many data structure per item of an array, the accesses will look random and mainstream CPUs cannot prefetch data easily. In this case, they will just start the memory loads as early as possible, but this is often not enough since the memory latency can be really huge (especially for in-RAM data) and there is generally not enough computation to hide it. Manual prefetching can help in this case but it is not a silver-bullet and it is brittle (dependent of the target architecture). In this case, move data into arrays (i.e. switching from AoS to SoA) can help a bit because hardware prefetchers can more efficiently prefetch data with a small constant stride.
Note on GPUs
If some of your computation can be ported to GPUs, then be aware that SoA is generally far more efficient on them than AoS (because of coalescence and also because GPUs are SIMT device). The benefit of AoSoA on GPUs is often small so it does not worth the effort on such platform.
本文标签: cData oriented designpreparing data for cache friendly iteration vs direct accessStack Overflow
版权声明:本文标题:c++ - Data oriented design - preparing data for cache friendly iteration vs direct access - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736302067a1931402.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论