Background
I've been learning some Zig in my free time, which sent me down the Data Oriented Design rabbithole. While I had heard the term before, I didn't have a complete understanding of what it was. That was until I stumbled onto Mike Acton's CppCon talk.
Much has been said about this talk, and I find myself agreeing with most of the praise. However, I still think that the examples given are more complex than they need to be to drive home some of the key points. So, I tried to come up with the smallest example I could.
Example
Say we have a collection of some "nodes" which we want to find the average value of based on some condition. The most straightforward implementation may look like:
// one.c
#include <stdio.h>
#include <stdbool.h>
struct node {
int value;
bool include;
};
#define NUM_NODES 8
const struct node nodes[NUM_NODES] = {
{ 661255741, 0 },
{ 858440027, 1 },
{ 2015743814, 1 },
{ -614169128, 0 },
{ 726937019, 1 },
{ -1228229102, 1 },
{ -2135130442, 1 },
{ -348350879, 1 },
};
int main() {
int sum = 0;
int count = 0;
for (int i = 0; i < NUM_NODES; i++) {
struct node curr = nodes[i];
if (curr.include) {
sum += curr.value;
count++;
}
}
printf("%d nodes counted with average: %f\n",
count, (double)sum/(double)count);
}
There are two main drawbacks to this approach, both related to the
include
field on the struct.
-
Keeping everthing in the same array and then iterating over it
creates a lot of wasted work. For any nodes with
include == 0
, we waste several instructions and a loop iteration to do nothing of value. - The bool being present in the struct leads to the struct wasting space. In this case, the int causes this struct to be aligned on 32-bit (4 byte) boundaries but the bool only needs one byte (only one bit of information). This results in 3 bytes of padding to get to a size of 8, which aligns on the 4 byte boundary. These last 3 bytes carry no information and lead to increased cache misses as we will be loading 8 bytes and only using the first 5.
The data oriented approach would be to solve for the common case and maximize our use of the cache.
// two.c
#include <stdio.h>
#include <stdbool.h>
struct node {
int value;
};
#define NUM_INCLUDED_NODES 6
const struct node included_nodes[NUM_INCLUDED_NODES] = {
{ 858440027 },
{ 2015743814 },
{ 726937019 },
{ -1228229102 },
{ -2135130442 },
{ -348350879 },
};
#define NUM_EXCLUDED_NODES 2
const struct node excluded_nodes[NUM_EXCLUDED_NODES] = {
{ 661255741 },
{ -614169128 },
};
int main() {
int sum = 0;
for (int i = 0; i < NUM_INCLUDED_NODES; i++) {
sum += included_nodes[i].value;
}
printf("%d nodes counted with average: %f\n",
NUM_INCLUDED_NODES,
(double)sum/(double)NUM_INCLUDED_NODES);
}
Here, we remove the bool from the struct and encode that bit of information by instead grouping the nodes into separate arrays based on whether or not they are included. Our "common case" is iterating over and adding included nodes. This solution allows us to iterate directly over them without needing a conditional check. Avoiding this check combined with avoiding struct padding makes much better use of the cache.
Benchmarks
- Using 1,048,576 nodes
- Values chosen randomly from MIN_INT to MAX_INT
- 1024 runs for each perf invocation
- Verified identical output
Results
For unoptimized builds, the naive approach results in a ~60% cache miss rate. Combined with the additional instructions required and the increased number of loop iterations, this is the slowest of the runs. The data oriented approach reduces the cache miss rate to ~27%, while running 1/3 of the instructions. This leads to a >2x wall time improvement.
For optimized builds, the naive approach has a slightly better cache miss rate at ~53%. The compiler also was able to do some work, halving the amount of cycles and instructions needed. The data oriented approach drops down to a ~15% cache miss rate while also drastically reducing the amount of instructions and cycles by around an order of magnitude each.
This is possible due to the compiler auto-vectorizing and using
paddd
with xmm0
instead of addl
with normal registers. It also unrolls the loop twice. This means that
the naive approach does one addition per loop, whereas the data
oriented approach does 8 additions per loop (in only 3 instructions).
This was possible due to the simplication introduced by the data
oriented approach.
Full perf output below
Debug (-g -O0)
Performance counter stats for 'one_debug' (1024 runs):
294,051 cache-references:u ( +- 0.02% )
173,755 cache-misses:u # 59.09% of all cache refs ( +- 0.48% )
29,445,380 cycles:u ( +- 0.02% )
14,320,751 instructions:u # 0.49 insn per cycle ( +- 0.00% )
2,135,613 branches:u ( +- 0.00% )
189 faults:u ( +- 0.02% )
0 migrations:u
0.0087163 +- 0.0000542 seconds time elapsed ( +- 0.62% )
Performance counter stats for 'two_debug' (1024 runs):
85,982 cache-references:u ( +- 0.04% )
23,817 cache-misses:u # 27.70% of all cache refs ( +- 2.25% )
3,932,744 cycles:u ( +- 0.03% )
4,883,161 instructions:u # 1.24 insn per cycle ( +- 0.00% )
562,601 branches:u ( +- 0.00% )
94 faults:u ( +- 0.04% )
0 migrations:u
0.0038912 +- 0.0000820 seconds time elapsed ( +- 2.11% )
Release (-O3)
Performance counter stats for 'one_release' (1024 runs):
299,789 cache-references:u ( +- 0.02% )
161,372 cache-misses:u # 53.83% of all cache refs ( +- 0.44% )
15,365,240 cycles:u ( +- 0.03% )
6,456,473 instructions:u # 0.42 insn per cycle ( +- 0.00% )
2,135,610 branches:u ( +- 0.00% )
190 faults:u ( +- 0.02% )
0 migrations:u
0.0064188 +- 0.0000569 seconds time elapsed ( +- 0.89% )
Performance counter stats for 'two_release' (1024 runs):
71,807 cache-references:u ( +- 0.11% )
11,414 cache-misses:u # 15.90% of all cache refs ( +- 2.59% )
534,521 cycles:u ( +- 0.56% )
492,674 instructions:u # 0.92 insn per cycle ( +- 0.00% )
103,891 branches:u ( +- 0.00% )
94 faults:u ( +- 0.04% )
0 migrations:u
0.0014308 +- 0.0000312 seconds time elapsed ( +- 2.18% )
Summary
Even for a painfully simple example, data oriented design shows how much can be gained when we are conscious of the platform we are running on and shape our data for our common case. In this basic example, we gained more than an order of magnitude of wall-time performance.