Eta Chess

CPU Vector Unit, the new jam for NNs...

Heyho, you are already aware of it, NNUE uses the CPU Vector Unit to boost NNs,
so here a lil biased overview of SIMD units in CPUs...

- the term SIMD and Vector Unit can be used analogous
- a SIMD unit executes n times the same instruction/operation on different data
- SIMD units differ in bit width, for example from 64 to 512 bits
- SIMD units differ in support for different instructions/operations
- SIMD units differ in support for different data types
- SIMD units may run with a lower frequency than the main CPU ALUs
- SIMD units increase power usage and TDP of the CPU under load

Simplified, older CPUs have 128-bit SSE units, newer ones 256-bit AVX2, ARM
mobile processors for example 128-bit NEON.

A 128-bit SSE unit can perform for example 4x 32-bit FP32 operations at once, a
256-bit AVX2 unit can perform 16x 16-bit INT16 operations at once. The broader
the bit width and the smaller the data-types, the more operations you can run
at once, the more throughput you get. NNs can run for example with FP16,
floating-point 16-bit, or also with INT8, integer 8-bit, inference.

Currently Intel's AVX-512 clocks significantly down under load, so there is no
speed gain by broader bit-width compared to AVX2, may change in future. Also
there is an trend to multiple Vector Units per CPU core underway.

Transhuman Chess with NN and RL...

Some people argue that the art of writing a chess engine lies in the evaluation function. A programmer gets into the expert knowledge of the domain of chess and encodes this via evaluation terms in his engine. We had the division between chess advisor and chess programmer, and with speedy computers our search algorithms were able to reach super-human level chess and outperform any human. We developed automatic tuning methods for the values of our evaluation functions but now with Neural Networks and Reinforcement Learning present I wish to point that we entered another kind of level, I call it trans-human level chess.

If we look at the game of Go this seems pretty obvious, I recall one master naming the play of A0 "Go from another dimension". A super-human level engine still relies on handcrafted evaluation terms human do come up with (and then get tuned), but a Neural Network is able to encode evaluation terms humans simply do not come up with, to 'see' relations and patterns we can not see, which are beyond our scope, trans-human, and the Reinforcement Learning technique discovers lines which are yet uncommon for humans, trans-human.

As mentioned, pretty obvious for Go, less obvious for chess, but still applicable. NNs replacing the evaluation function is just one part of the game, people will come up with NN based pruning, move selection, reduction and extension. What is left is the search algorithm, and we already saw the successful mix of NNs with MCTS and classic eval with MCTS, so I am pretty sure we will see different kind of mixtures of already known (search) techniques and upcoming NN techniques. Summing above up, the switch is now from encoding the expert knowledge of chess in evaluation terms to encoding the knowledge into NNs and use them in a search algorithm, that is what the paradigm shift since A0 and Lc0 and recently NNUE is about, and that is the shift to what I call trans-human chess.

NNs are also called 'black-boxes' cos we can not decode what the layers of weights represent in an human-readable form, so I see here some room for the classic approach, can we decode the black-box and express the knowledge via handcrafted evaluation terms in our common programming languages?

Currently NNs outperform human expert-systems in many domains, this not chess or Go specific, but maybe the time for the question of reasoning will come, a time to decode the black-boxes, or maybe the black-box will decode itself, yet another level, time will tell.

Eta - v0600 - The next step for LC0?

I know, LC0's primary goal was an open source adaptation of A0, and I am not into the Discord development discussions and alike, anyway, my 2 cents on this:

  • MCTS-PUCT search was an descendant from AlphaGo, generalized to be applied on Go, Shogi and Chess, it can utilize a GPU via batches but has its known weaknesses, tactics in form of "shallow-traps" in a row, and end-game.
  • A CPU AB search will not work with NN on GPU via batches.
  • NNUE makes no sense on GPU.
  • LC0 has a GPU-cloud-cluster to play Reinforcement Learning games.
  • LC0 plays already ~2400(?) Elo with an depth 1 search alone.
  • It is estimated that the NN eval is worth 4 plies AB search.

Looking at the above points it seems pretty obvious what the next step for LC0 could be, drop the weak part, MCTS-PUCT search, ignore AB and NNUE, and focus what LC0 is good at. Increase the plies encoded in NN, increase the Elo at depth 1 eval.

To put it to an extreme, drop the search part completely, increase the CNN size 1000 fold, decrease NPS from ~50K to ~50, add multiple, increasing sized NNs to be queried stepwise for Time Control.

Just thinking loud...

LC0 vs. NNUE - some tech details...

- LC0 uses CNNs, Convolutional Neural Networks, for position evaluation
- NNUE is currently a kind of MLP, Multi-Layer-Perceptron, with incremental updates for the first layer

- A0 used originally about 50 million neural network weights
- NNUE uses currently about 10 million weights? Or more, depending on net size

- LC0 uses a MCTS-PUCT search
- NNUE uses the Alpha-Beta search of its "host" engine

- LC0 uses the Zero approach with Reinforcement Learning on a GPU-Cloud-Cluster
- NNUE uses initial RL with addition of SL, Supervised Learning, with engine-engine games

- LC0 runs the NN part well on GPU (up to hundreds of Vector-Units) via batches
- NNUE runs on the Vector-Unit of the CPU (SSE, AVX, NEO), no batches in need

Cos NNUE runs a smaller kind of NN on a CPU efficient it gains more NPS in an AB search than previous approaches like Giraffe, you can view it in a way that it can combine both worlds, the LC0 NN part and the SF AB search part, on a CPU.

Eta - v0600

Okay, let's do an timewarpjump back to the year 2008 and figure out how we could use the hardware back then for an neural network based chess engine.

Reinforcement Learning on a GPU-Cluster is probably a no go (the Titan supercomputer with 18,688 K20Xs went op in 2012) so we stick on Supervised Learning from a database of quality games or alike. A neural network as used in A0 with ~50 millions parameters queried by an MCTS-PUCT like search with ~80 knps is also not doable, we had only ~336 GFLOPS on an Nvidia 8800 GT back then, compared to ~108 TFLOPS on an RTX 2080 TI via Tensor Cores nowadays. So we have to skip the MCTS-PUCT part and rethink the search. Instead to go for NPS, we could build a really big CNN, but the memory back then on a GPU was only about 512 MB, so we stick on ~128 Mega parameters. So, we have to split the CNN, for example by piece count, let us use 30 distinct neural networks indexed by piece count, so we get accumulated ~3840 Mega parameters, that sounds already better. Maybe this would be already enough to skip the search part and do only a depth 1 search for NN eval. If not, we could split the CNN further, layer by layer, inferred via different waves on GPU, loaded layer-wise from disk to GPU memory via PCIe or alike and hence increase the total number of parameters...so what is the drawback if we could run an CNN with several billion parameters? Obviously the training of such an monster, not only the horse power needed to train, but the training data, the games. A0 used about 40 million RL games to reach top-notch computer chess level, for only ~50 million parameters, the Chess Base Mega Database contains ~8 million quality games...so we simply have not enough games to train such an CNN monster via Supervised Learning, we rely on Reinforcement Learning, and therefore on some kind of GPU-Cluster to play RL games... nowadays, and also back in 2008.

Home - Top