To run 896 gpu threads as one worker, and multiple workers in parallel, does not pay off. There is an 12x slow down when increasing from 64 to 896 threads. Actually this seems only natural, but i thought the multiple threads would handle somehow the memory latency better.
The overall performance of the v0400 design was satisfactory, only the nps per worker was too low, but maybe i can couple up to 1024 gpu threads to one worker, used to infere the ANN faster? Benchs will tell...
Okay, some further, not so quick n dirty, benchmarks showed
~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X
I assume on modern gpus about 200 nps per worker.
While NN cache could be able to double these values, this is imo a bit too slow for the intended search algorithm, considering about 36x10 qsearch positions on average per expanded node, one worker would need about a second to get a node score.
Back to pen n paper.
wip...will take some time...
* GPGPU device based
- host handles only the IO, search and ANN inference on gpu
- gpu computation will be limited by node count to about 1 second per
repeated iteration, to avoid any system timeouts
* parallel BestFirstMiniMax-Search on gpu
- game tree in gpu memory
- best node selected via score + UCT formula (visit count based)
- AlphaBeta Q-Search performed at leafnodes to get a node score
* multiple small MLP neural networks
- about 4 million weights per network
- 30 networks in total, split by piece count
* trained via TD-leaf by pgn games
- 6/7 men EGTB could be used for training?
* 64 gpu threads are coupled to one worker
- used during move gen, move pick and ANN eval in parallel
- gpu core count depended from 64 workers to 2048 workers in total
Some quick and dirty benchmarks showed that with this design ~1 Knps per worker is possible.