NUMA performance improvement for AMD ThreadRipper in Sandra SP2

What is NUMA?

Modern CPUs have had a built-in memory controller for many years now, starting with the K8/Opteron, in order to higher better bandwidth and lower latency. As a result in SMP systems each CPU has their own memory controller and its own system memory that it can access at high speed – while to access other memory it must send requests to the other CPUs. NUMA is a way of describing such systems and allow the operating system and applications to allocate memory on the node they are running on for best performance.

As ThreadRipper is really two (2) Ryzen dies connected internally through InfinityFabric – it is basically a 2-CPU SMP system and thus a 2-node NUMA system.

While it is possible to configure it in UMA (Uniform Memory Access mode) where all memory appears to be unified and interleaved between nodes, for best performance the NUMA mode is recommended when the operating system and applications support it.

While Sandra has always supported NUMA in the standard benchmarks – some of the new benchmarks have not been updated with NUMA support especially since multi-core systems have pretty much killed SMP systems on the desktop – with only expensive severs left to bring SMP / NUMA support.

Note that all the NUMA improvements here would apply to competitor NUMA (e.g. Intel) systems, thus it is not just for ThreadRipper – with EPYC systems likely showing a far higher improvement too.

In this article we test NUMA performance; please see our other articles on:

Native Performance

We are testing native performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks NUMA 2-nodes
UMA single-node
Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  965 [+2.8%]  938 The ‘lightest’ workload should show some NUMA overhead but we can only manage 3% here.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  312 [+2.3%] 305 With a 64-bit integer workload the improvement drops to 2%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  10.9 [=]  10.9 Emulating int128 means far increased compute workload with NUMA overhead insignificant.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s)  997 [+1.2%]  985 Again no measured improvement here.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  562 [+1%]  556 Again no measured improvement here.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  27 [=]  26.85 In this heavy algorithm using FP64 to mantissa extend FP128 we see no improvement.
Fractals are compute intensive with few memory accesses – mainly to store results – thus we see a maximum of 3% improvement with NUMA support with the rest insignificant. However, this is a simple 2-node system – bigger 4/8-node systems would likely show bigger gains.
BenchCrypt Crypto AES256 (GB/s) 27.1 [+139%] 11.3 AES hardware accelerated is memory bandwidth bound thus NUMA support matters; even in this 2-node system we see a over 2x improvement of 139%!
BenchCrypt Crypto AES128 (GB/s) 27.4 [+142%] 11.3 Similar to above we see a massive 142% improvement by allocating memory on the right NUMA node.
BenchCrypt Crypto SHA2-256 (GB/s)  32.3 [+50%] 21.4 SHA is also hardware accelerated but operates on a single input buffer (with a small output hash value buffer) and here out improvement drops to 50%, still very much significant.
BenchCrypt Crypto SHA1 (GB/s) 34.2  [+56%]  21.8 Similar to above we see an even larger 56% improvement for supporting NUMA.
BenchCrypt Crypto SHA2-512 (GB/s)  6.36 [=]  6.35 SHA2-256 is not hardware accelerated (AVX2 used) but heavy compute bound thus our improvement drops to nothing.
Finally in streaming algorithms we see just how much NUMA support matters: even on this 2-note system we see over 2x improvement of 140% when working with 2 buffers (in/out). When using a single buffer our improvement drops to 50% but still very much significant. TR needs NUMA suppport to shine.
BenchScience SGEMM (GFLOPS) float/FP32  395 [114%]  184 As with crypto, GEMM benefits greatly from NUMA support with an incredible 114% improvement by allocating the (3) buffers on the right NUMA nodes.
BenchScience DGEMM (GFLOPS) double/FP64  183 [131%]  79 Changing to FP64 brings an even more incredible 131%.
BenchScience SFFT (GFLOPS) float/FP32  11.6 [86%]  6.25 FFT also shows big gains from NUMA support with 86% improvement just by allocating the buffers (2+1 const) on the right nodes.
BenchScience DFFT (GFLOPS) double/FP64  10.6 [112%]  5 With FP64 again increases
BenchScience SNBODY (GFLOPS) float/FP32  479 [=]  483 Strangely N-Body does not benefit much from NUMA support with no appreciable improvement.
BenchScience DNBODY (GFLOPS) double/FP64  189 [=]  191 With FP64 workload nothing much changes.
As with crypto, buffer heavy algorithms (GEMM, FFT, N-Body) greatly benefit from NUMA support with performance doubling (86-131%) by allocating on the right NUMA nodes; in effect TR needs NUMA in order to perform better than a standard Ryzen!
CPU Image Processing Blur (3×3) Filter (MPix/s)  2090 [+71%]  1220 Least compute brings highest benefit from NUMA support – here it is 71%.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  886 [=]  890 Same algorithm but more compute brings the improvement to nothing.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  494 [=]  495 Again same algorithm but even more compute again no benefit.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  720 [=]  719 Using two buffers does not seem to show any benefit either.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  116 [=]  117 Different algorithm keeps with more compute means no benefit either.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  40.3 [=]  40.7 Using the new scatter/gather in AVX2 does not help matters even with NUMA support.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1880 [+90%]  982 Here we have a 64-bit integer workload algorithm with many gathers not compute heavy brings 90% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  397 [=]  396 Heavy compute brings down the improvement to nothing.
As with other SIMD tests,  low compute algorithms see 70-90% improvement from NUMA support; heavy compute algorithms bring the improvement down to zero. It all depends whether the overhead of accessing other nodes can be masked by compute; in effect TR seems to perform pretty well.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that ThreadRipper needs NUMA support in applications – just like any other SMP system today to shine: we see over 2x improvement in bandwidth-heavy algorithms. However, in compute-heavy algorithms TR is able to mask the overhead pretty well – with NUMA bringing almost no improvement. For non NUMA supporting software the UMA mode should be employed.

Let’s remember we are only testing a 2-node system, here, a 4+ node system is likely to show higher improvements and with EPYC systems stating at 1-socket 4-node we can potentially have common 4-socket 16-node systems that absolutely need NUMA for best performance. We look forward to testing such a system as soon as possible 😉

 

Tagged , , , . Bookmark the permalink.

Comments are closed.