wip...will take some time...

* GPGPU device based
- host handles only the IO, search and ANN inference on gpu
- gpu computation will be limited by node count to about 1 second per
  repeated iteration, to avoid any system timeouts

* parallel BestFirstMiniMax-Search on gpu
- game tree in gpu memory
- best node selected via score + UCT formula (visit count based)
- AlphaBeta Q-Search performed at leafnodes to get a node score

* multiple small MLP neural networks
- about 4 million weights per network
- 30 networks in total, split by piece count

* trained via TD-leaf by pgn games
- 6/7 men EGTB could be used for training?

* 64 gpu threads are coupled to one worker
- used during move gen, move pick and ANN eval in parallel
- gpu core count depended from 64 workers to 2048 workers in total

Some quick and dirty benchmarks showed that with this design ~1 Knps per worker is possible.