Eta Chess

Eta - v0501 - command-queues

I figured I could run maybe 32 gpu-workers per thread efficient, to let one CPU thread iterate the game tree in memory for 32 workers who perform AB-playouts on GPU, to use one OpenCL command-queue per thread.

Project is in my pipe again, will work on Zeta NNUE first, if it runs, I will use the Zeta AB-NNUE framework for Eta, will take some time.

Eta - v0502

A BestFirst on CPU and MiniMax-playout with NNUE eval GPU design could utilize the AB-framework of the Zeta v099 gpu-engine. But considering just an ply 1 + quiescence search an alternative implementation as LIFO-stack seems reasonable. This would simplify the iterative implementation of an recursive AB search for GPU architecture. Couple 32 gpu-threads to run on one SIMD unit of the GPU, use these 32 threads for move generation (piece-wise parallelization) and NNUE evaluation inference, store the game tree as an doubly-linked list in VRAM, apply LIFO-stack based processing on game tree with AB pruning, something like this.

Followup:

ups, too much coffee....LIFO would not work with NNUE, but a classic NN ;)

Eta - v0501

Recently the new neural network technique 'NNUE' took off on CPU based chess engines like Stockfish leveraging the vector unit of a CPU for NN inference, replacing HCE (handcrafted evaluation) with neural-networks. Hence with NNUE a hybrid design with BestFirst on CPU and MiniMax-Search with NNUE eval on GPU seems possible and in reach. The CPU-host would store and expand the game tree in memory, similar to Lc0's MCTS, the GPU would perform shallow AlpaBeta-searches (primarily quiscence-search playouts to avoid the horizon effect), similar to Lc0's MCTS-playouts.

Coupling 32 gpu-threads to one worker, assuming 2K clocks per node for move generation and AB-framework, additionally maybe 2K clocks per node for NNUE inference, results in 1.44M gpu-clocks for an 36x10 nodes q-search. In such an design the host-device-latency (aka. kernel-launch-overhead) of maybe 10 microseconds does not affect the overall performance. From entry-level GPUs with 512 cores (16 workers) to high-end-gpus with 5120 cores (160 workers) the throughput of such an parallel BestFirst on CPU and AB-playout+NNUE-eval on GPU design could range from ~11K to ~220K node-playouts/s, more than Lc0's gpu throughput but with a switch from MCTS-PUCT to parallel BestFirstMiniMax-Search and CNN to NNUE evaluation.

I am not into the details of current NNUE implementations for CPUs, therefore the estimated 2K gpu-clocks per node for NNUE inference is the biggest uncertainty.

I have no experience with running 16 to 160 parallel tasks via OpenCL on GPU, not sure if 160 unique command-queues are handable with CPU-GPU interaction.

Transhuman Chess with NN and RL...

Some people argue that the art of writing a chess engine lies in the evaluation function. A programmer gets into the expert knowledge of the domain of chess and encodes this via evaluation terms in his engine. We had the division between chess advisor and chess programmer, and with speedy computers our search algorithms were able to reach super-human level chess and outperform any human. We developed automatic tuning methods for the values of our evaluation functions but now with Neural Networks and Reinforcement Learning present I wish to point that we entered another kind of level, I call it trans-human level chess.

If we look at the game of Go this seems pretty obvious, I recall one master naming the play of A0 "Go from another dimension". A super-human level engine still relies on handcrafted evaluation terms human do come up with (and then get tuned), but a Neural Network is able to encode evaluation terms humans simply do not come up with, to 'see' relations and patterns we can not see, which are beyond our scope, trans-human, and the Reinforcement Learning technique discovers lines which are yet uncommon for humans, trans-human.

As mentioned, pretty obvious for Go, less obvious for chess, but still applicable. NNs replacing the evaluation function is just one part of the game, people will come up with NN based pruning, move selection, reduction and extension. What is left is the search algorithm, and we already saw the successful mix of NNs with MCTS and classic eval with MCTS, so I am pretty sure we will see different kind of mixtures of already known (search) techniques and upcoming NN techniques. Summing above up, the switch is now from encoding the expert knowledge of chess in evaluation terms to encoding the knowledge into NNs and use them in a search algorithm, that is what the paradigm shift since A0 and Lc0 and recently NNUE is about, and that is the shift to what I call trans-human chess.

NNs are also called 'black-boxes' cos we can not decode what the layers of weights represent in an human-readable form, so I see here some room for the classic approach, can we decode the black-box and express the knowledge via handcrafted evaluation terms in our common programming languages?

Currently NNs outperform human expert-systems in many domains, this not chess or Go specific, but maybe the time for the question of reasoning will come, a time to decode the black-boxes, or maybe the black-box will decode itself, yet another level, time will tell.

LC0 vs. NNUE - some tech details...

- LC0 uses CNNs, Convolutional Neural Networks, for position evaluation
- NNUE is currently a kind of MLP, Multi-Layer-Perceptron, with incremental updates for the first layer

- A0 used originally about 50 million neural network weights
- NNUE uses currently about 10 million weights? Or more, depending on net size

- LC0 uses a MCTS-PUCT search
- NNUE uses the Alpha-Beta search of its "host" engine

- LC0 uses the Zero approach with Reinforcement Learning on a GPU-Cloud-Cluster
- NNUE uses initial RL with addition of SL, Supervised Learning, with engine-engine games

- LC0 runs the NN part well on GPU (up to hundreds of Vector-Units) via batches
- NNUE runs on the Vector-Unit of the CPU (SSE, AVX, NEO), no batches in need

Cos NNUE runs a smaller kind of NN on a CPU efficient it gains more NPS in an AB search than previous approaches like Giraffe, you can view it in a way that it can combine both worlds, the LC0 NN part and the SF AB search part, on a CPU.

Eta - v0600

Okay, let's do an timewarpjump back to the year 2008 and figure out how we could use the hardware back then for an neural network based chess engine.

Reinforcement Learning on a GPU-Cluster is probably a no go (the Titan supercomputer with 18,688 K20Xs went op in 2012) so we stick on Supervised Learning from a database of quality games or alike. A neural network as used in A0 with ~50 millions parameters queried by an MCTS-PUCT like search with ~80 knps is also not doable, we had only ~336 GFLOPS on an Nvidia 8800 GT back then, compared to ~108 TFLOPS on an RTX 2080 TI via Tensor Cores nowadays. So we have to skip the MCTS-PUCT part and rethink the search. Instead to go for NPS, we could build a really big CNN, but the memory back then on a GPU was only about 512 MB, so we stick on ~128 Mega parameters. So, we have to split the CNN, for example by piece count, let us use 30 distinct neural networks indexed by piece count, so we get accumulated ~3840 Mega parameters, that sounds already better. Maybe this would be already enough to skip the search part and do only a depth 1 search for NN eval. If not, we could split the CNN further, layer by layer, inferred via different waves on GPU, loaded layer-wise from disk to GPU memory via PCIe or alike and hence increase the total number of parameters...so what is the drawback if we could run an CNN with several billion parameters? Obviously the training of such an monster, not only the horse power needed to train, but the training data, the games. A0 used about 40 million RL games to reach top-notch computer chess level, for only ~50 million parameters, the Chess Base Mega Database contains ~8 million quality games...so we simply have not enough games to train such an CNN monster via Supervised Learning, we rely on Reinforcement Learning, and therefore on some kind of GPU-Cluster to play RL games... nowadays, and also back in 2008.

I see...

I see three ways which neural networks for chess may take in future...

1. With more processing power available, the network size will raise, and we will have really big nets on one side of the extreme, which drop the search algorithm part and perform only a depth 1 search for evaluation.

2. With neural network accelerators with less latency, we will see engines with multiple, smaller neural networks, which perform deeper AlphaBeta searches on the other side.

3. Something in beetween 1. and 2.

Eta - v0500

Another solution would be to perform an Best-First-MiniMax search on CPU and to do ANN evaluation on GPU. I could couple the nodes of an qsearch at leaf nodes to be evaluated in one batch to gain some nps...that's pretty much like A0 and LC0 work.

Eta - v0401 - nested parallelism

To run 1024 threads per worker will probably not work, due to register size limitation. With the OpenCL 2.x feature 'nested parallelism' it could be possible to run one thread for best-first, which calls another kernel with 64 threads for move generation and another kernel with 1024 threads for ANN inference. But current Nvidia and older AMD devices support only OpenCL 1.x, so this is not a real option.

Eta - v0302 - batches

LC0 uses batch sizes of 256 or 512 to utilize a gpu, i did a quick bench with 256 positions to be evaluated per run...

4096 nps on Nvidia GTX 750
16640 nps on AMD Fury X

Note that nn cache could double these values, but this is still far less than i could achieve when doing all computations directly on gpu device, wo host-device interaction.

And waiting for 256 positions to be evaluated at once is against the serial nature of AlphaBeta search...

Eta - v0301 - host-device latencies

One reason gpus are not used as accelerators for chess is the host-device latency.

Afaik the latencies are in the range of 5 to 10s or even 100s of microseconds, so you can get max 200K kernel calls per second per thread, even if the gpu is able to process its task much faster.

Therefore, Eta v0300, a cpu based AlphaBeta search with gpu as ANN accelerator, is flawed by design.

Eta - v0301

Back to cpu based AlphaBeta search with gpu ANN evaluation.

On Nvidia GTX 750 i achieve with one single cpu thread about 2 Knps, and up to ~20 Knps with 256 parallel cpu threads.

This sounds far too slow for an AlphaBeta search...

Eta - v0400 - benchs

Okay, some further, not so quick n dirty, benchmarks showed

~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X

per worker.

I assume on modern gpus about 200 nps per worker.

While NN cache could be able to double these values, this is imo a bit too slow for the intended search algorithm, considering about 36x10 qsearch positions on average per expanded node, one worker would need about a second to get a node score.

Back to pen n paper.

Eta - v0400 - Feature List

wip...will take some time...

* GPGPU device based
- host handles only the IO, search and ANN inference on gpu
- gpu computation will be limited by node count to about 1 second per
  repeated iteration, to avoid any system timeouts

* parallel BestFirstMiniMax-Search on gpu
- game tree in gpu memory
- best node selected via score + UCT formula (visit count based)
- AlphaBeta Q-Search performed at leafnodes to get a node score

* multiple small MLP neural networks
- about 4 million weights per network
- 30 networks in total, split by piece count

* trained via TD-leaf by pgn games
- 6/7 men EGTB could be used for training?

* 64 gpu threads are coupled to one worker
- used during move gen, move pick and ANN eval in parallel
- gpu core count depended from 64 workers to 2048 workers in total

Some quick and dirty benchmarks showed that with this design ~1 Knps per worker is possible.

Eta - v0200

This was an attempt to use Zeta v099, a GPU AlphaBeta-search with hundreds of parallel workers, with ANNs. The overall nps throughput looked good, but the parallel AlphaBeta-search is not able to make efficient use of up to thousands of workers.

Eta - Changelog

Here an overview of what happened before....

Eta (0600)

* CNN monster with billions of parameters wo search relies on ~billions of RL games

 Eta (0500)

* parallel BestFirstMiniMax-Search on CPU with ANN evaluation on GPU

Eta (0400)

* parallel BestFirstMiniMax-Search on GPU with ANN evaluation on GPU

Eta (0300)

* CPU based AlphaBeta search with GPU ANN eval

Eta (0200)

* fork of Zeta v099 but with neural networks

Eta (0100)

* fork of Zeta v098 but with neural networks

Eta - a neural network based chess engine

Since i have read the paper about NeuroChess by Sebastian Thrun i pondered on how to improve his results.

It was obvious that the compute power available in the 90s limited his approach, in training and in inference.

So he had only 120K games for training, a relative small neural network, and could test his approach only with limited search depths.

Recent results with A0 and LC0 show how Deep Learning methods profit by GPGPU, so i think the time has come to give a GPU ANN based engine a try....

--
Srdja

Home - Top