Heyho, you are already aware of it, NNUE uses the CPU Vector Unit to boost NNs,
so here a lil biased overview of SIMD units in CPUs...

- the term SIMD and Vector Unit can be used analogous
- a SIMD unit executes n times the same instruction/operation on different data
- SIMD units differ in bit width, for example from 64 to 512 bits
- SIMD units differ in support for different instructions/operations
- SIMD units differ in support for different data types
- SIMD units may run with a lower frequency than the main CPU ALUs
- SIMD units increase power usage and TDP of the CPU under load

Simplified, older CPUs have 128-bit SSE units, newer ones 256-bit AVX2, ARM
mobile processors for example 128-bit NEON.

A 128-bit SSE unit can perform for example 4x 32-bit FP32 operations at once, a
256-bit AVX2 unit can perform 16x 16-bit INT16 operations at once. The broader
the bit width and the smaller the data-types, the more operations you can run
at once, the more throughput you get. NNs can run for example with FP16,
floating-point 16-bit, or also with INT8, integer 8-bit, inference.

Currently Intel's AVX-512 clocks significantly down under load, so there is no
speed gain by broader bit-width compared to AVX2, may change in future. Also
there is an trend to multiple Vector Units per CPU core underway.