.Net Vectors (CLR 4.6 RyuJIT) Performance

What is RyuJIT?

“RyuJIT” is the code-name of the latest CLR of .Net 4.6 as included in Windows 10 (with updates available for Windows 8.1, 8, 7) that includes a variety of performance optimisations as well as new features like vectorised/SIMD native support.

Why do we need .Net Vector support?

Many algorithms benefit from vectorisation/parallelisation through SIMD instruction sets in (all) modern processors; while compilers/run-times (CLR/JVM) may be able to automatically vectorise code – the most efficient way is through constructs that indicate to the compiler/run-time how to vectorise code for the hardware it is running on.

While we could always interop to native code libraries using SIMD, these would be platform / instruction-set dependent and introduce code and maintenance complexity.

What are other Pro/Cons of .Net Vector support?

The new CLR is a boon for high-performance algorithms:

  • Widely deployed: by default on Windows 10 and Windows Update on older Windows.
  • Widest possible: automatically uses the “widest” SIMD ISA (instruction set) supported by the processor, be it AVX2/FMA, AVX, SSE2, etc. [and AVX512 in future CLR] without any code modifications.
  • ISA/platform independent: same .Net code runs whatever the platform/ISA now and in the future. No need to write native code for each platform and ISA (e.g. AVX-Win64, SSE2-Win32, etc.)
  • All primitive data types supported: single/double floating-point, int/long integers.

Unfortunately Microsoft could not go the “whole way” and there are downsides:

  • x64 Only: RyuJIT is for x64 Windows only with x86 stuck with the old CLR that is unlikely to be updated.
  • Very limited Integer operators: without basic binary operators like “shift”, “mask”, “swap/permute”, etc. integer performance is low.
  • Limited functions and operators: even floating-point provides a limited subset of functions and operators.
  • CLR Issues: the new RyuJIT CLR does have problems with some .Net apps which may require users to stick to the older CLR and thus no Vector support.

.Net Vectors vs. Native SIMD Performance

We are testing native and .Net multi-media (fractal generation) performance using various SIMD instruction sets (AVX2/FMA, AVX, SSE2, etc.).

Hardware: Intel i7-4650U (Haswell ULV) with AVX2/FMA, AVX, SSE2 support.

Results Interpretation: Higher values (MPix/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on all configurations.

.Net Vectorised Performance

Data Type .Net Vectorised .Net Scalar Native AVX2/FMA Native AVX Native SSE2
Single Float (Mpix/s) 54 (8pix width) [+9.2x] 5.89 (1pix) 102.3 (8pix) [+17.4x] 89 (8pix) 57.8 (4pix)
Double Float (Mpix/s) 30.1 (4pix width) [+2.04x] 14.78 (1pix) 62.5 (4pix) [+4.2x] 53.4 (2pix) 31.9 (2pix)
Integer (Mpix/s) 1.03 (8pix width) [0.056x] 18.5 (1pix) 114.5 (16pix) [+6.2x] 73.4 (8pix) 31.3 (4pix)
Int64 (Mpix/s) 0.361 (4pix witdth) [0.020x] 18 (1pix) 41.6 (8pix) [+2.3x] 23.4 (4pix) 23 (2pix)

We can confirm the use of AVX2/FMA/AVX by the width of the Vectors (256-bit wide, with float/int being 8-units wide, double/int64 being 4-units wide).

While the performance improvement over scalar code is significant (~2x-9x), it does not quite reach the native SIMD implementation (~50%) which is somewhat disappointing but not altogether unexpected. However, future versions of the CLR will likely improve upon this – while our native code is unlikely to be optimised further.

No, the Vector integer performance is *not* a bug: the lack of bit-manipulation operations (“shift”, “swap/permute”, “mask”, etc.) makes complex Vector algorithms pretty much useless. Thus we only enable Vectors for floating-point operations.

Vectors may never replace native code completely, but lots of algorithms may now be implemented in native .Net code with good performance without the need of native libraries making deployment to different platforms (e.g. ARM/Windows, Mono/Linux, etc.) far easier.

It is good to see Microsoft adding new features to the CLR – which we would have expected Java to release first – as both the CLR and JVM have somewhat “stagnated” lately which is not good to see.

Tagged , , , , . Bookmark the permalink.

2 Responses to .Net Vectors (CLR 4.6 RyuJIT) Performance

  1. Pingback: SiSoftware Sandra 2016 RTMa Released | SiSoftware

  2. Pingback: SiSoftware Sandra 2016 Released | SiSoftware