Eta project paused.
To run 896 gpu threads as one worker, and multiple workers in parallel, does not
pay off. There is an 12x slow down when increasing from 64 to 896 threads.
Actually this seems only natural, but i thought the multiple threads would
handle somehow the memory latency better.
Another solution would be to perform an Best-First-MiniMax search on CPU and
to do ANN evaluation on GPU. I could couple the nodes of an qsearch at leaf
nodes to be evaluated in one batch to gain some nps...that's pretty much like
A0 and LC0 work.
To run 1024 threads per worker will probably not work, due to register size
limitation. With the OpenCL 2.x feature 'nested parallelism' it could be
possible to run one thread for best-first, which calls another kernel with
64 threads for move generation and another kernel with 1024 threads for ann
inference. But current Nvidia and older AMD devices support only OpenCL 1.x,
so this is not a real option.
LC0 uses batch sizes of 256 or 512 to utilize a gpu,
i did a quick bench with 256 positions to be evaluated per run...
4096 nps on Nvidia GTX 750
16640 nps on AMD Fury X
Note that nn cache could double these values,
but this is still far less than i could achieve when doing all computations
directly on gpu device, wo host-device interaction.
And waiting for 256 positions to be evaluated at once is against the serial
nature of AlphaBeta search...
The overall performance of the v0400 design was satisfactory,
only the nps per worker was too low,
but maybe i can couple up to 1024 gpu threads to one worker,
used to infere the ann faster? Benchs will tell...
One reason gpus are not used as accelerators for chess is the host-device latency.
Afaik the latencies are in the range of 5 to 10s or even 100s of microseconds,
so you can get max 200K kernel calls per second per thread,
even if the gpu is able to process its task much faster.
Therefore, Eta v0300, a cpu based AlphaBeta search with gpu as ANN accelerator,
is flawed by design.
Back to cpu based AlphaBeta search with gpu ANN evaluation.
On Nvidia GTX 750 i achieve with one single cpu thread about 2 Knps,
and up to ~20 Knps with 256 parallel cpu threads.
This sounds far too slow for an AlphaBeta search...
Okay, some further, not so quick n dirty, benchmarks showed
~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X
I assume on modern gpus about 200 nps per worker.
While NN cache could be able to double these values,
this is imo a bit too slow for the intended search algorithm,
considering about 36x10 qsearch positions on average per expanded node,
one worker would need about a second to get a node score.
Back to pen n paper.
wip...will take some time...
* GPGPU device based
- host handles only the IO, search and ANN inference on gpu
- gpu computation will be limited by node count to about 1 second per
repeated iteration, to avoid any system timeouts
* parallel BestFirstMiniMax-Search on gpu
- game tree in gpu memory
- best node selected via score + UCT formula (visit count based)
- AlphaBeta Q-Search performed at leafnodes to get a node score
* multiple small MLP neural networks
- about 4 million weights per network
- 30 networks in total, split by piece count
* trained via TD-leaf by pgn games
- 6/7 men EGTB could be used for training?
* 64 gpu threads are coupled to one worker
- used during move gen, move pick and ANN eval in parallel
- gpu core count depended from 64 workers to 2048 workers in total
Some quick and dirty benchmarks showed that with this design ~1 Knps per worker is possible.
This was an attempt to use Zeta v099, a gpu AlphaBeta-search with hundreds of
parallel workers, with anns. The overall nps throughput looked good, but the
parallel AlphaBeta-search is not able to make efficient use of up to thousands
This was an attempt to use Zeta v098, a gpu BestFirst-search with thousands of
parallel threads, with anns, but one single thread is too slow for ann inference.
Here an overview of what happened before....
* parallel BestFirstMiniMax-Search on cpu with ann evaluation on gpu
* parallel BestFirstMiniMax-Search on gpu with ann evaluation on gpu
* cpu based alpha beta search with gpu ann eval
* fork of Zeta v099 but with neural networks
* fork of Zeta v098 but with neural networks
Since i have read the paper about NeuroChess by Sebastian Thrun i pondered on how to improve his results.
It was obvious that the compute power available in the 90s limited his approach, in training and in inference.
So he had only 120K games for training, a relative small neural network, and could test his approach only with limited search depths.