Eta Chess

Discord Channels

To backup a thread posted on TalkChess:

https://talkchess.com/forum3/viewtopic.php?f=2&t=82700

Discord Channels
by smatovic » Wed Oct 11, 2023 9:05 am
I see Discord channels are per invite, so I won't add them to CPW, maybe here a TC thread to collect computer chess related Discord channels, feel free to add. Leela Chess Zero - Discord: https://discord.com/invite/pKujYxD LCZero - Google Groups: https://groups.google.com/g/lczero Stockfish - Discord: https://discord.com/invite/GWDRS3kU6R FishCooking - Google Groups: https://groups.google.com/g/fishcooking -- Srdja
Re: Discord Channels by Graham Banks » Wed Oct 11, 2023 9:26 am
Engine programming https://discord.com/invite/F6W6mMsTGN Open Bench https://discord.gg/756yyf8U Re: Discord Channels
by Guenther » Wed Oct 11, 2023 9:28 am
Lichess https://discord.gg/lichess Re: Discord Channels
by smatovic » Wed Oct 11, 2023 1:22 pm
Chess.com - Club - Discord https://www.chess.com/club/chess-com-discord -- Srdja

Fruit fly races on steroids?

To backup a thread posted on TalkChess:

https://talkchess.com/forum3/viewtopic.php?f=2&t=83267

Fruit fly races on steroids?
Post by smatovic » Mon Jan 29, 2024 6:34 pm 

To open a new thread on this topic, taken from the Obsidian thread: Frank Quisinsky wrote: Mon Jan 29, 2024 5:47 pm ... I can imagine there are highly skilled, talented and educated engine programmers out there, and the development speed might have increased significantly cos of multiple reasons, nevertheless, I share John's opinion: JohnWoe wrote: Fri Aug 13, 2021 12:43 pm [...] How does it improve competition if everybody contributes to Stockfish? The problem space is simply too massive for an individual to come up with code that equals Stockfish. Without significant copy-paste. That's why we have the same program basically. I see no problem with Fire for example. It's as original as any other. Crafty for example around 3,000 ELO while Hyatt worked on this project like 40 years professionally. This is where super originality leads you. Of course the modern society only rewards winners and the others can go home. So here we are. [...] .,.,and, some people still can not distinguish between science and engineering, theory and practice, ideas and implementations. Further, one could criticize the test-driven development of engines, w/o working out the theory behind: From Esoteric to Transcendental Chess Programming? https://talkchess.com/forum3/viewtopic.php?f=7&t=76286 ***edit*** "Fruit fly races on steroids?" Unfortunately, the competitive and commercial aspects of making computers play chess have taken precedence over using chess as a scientific domain. It is as if the geneticists after 1910 had organized fruit fly races and concentrated their efforts on breeding fruit flies that could win these races. https://www.chessprogramming.org/Artificial_Intelligence#John_McCarthy And, as already mentioned in another post, the two most important recent impacts in computer chess came from the outside. Lc0 is an open source adaption of AlphaZero, an generalization from AlphaGo applied on Go, Shogi and Chess. The NNUE technique came from the Shogi world to chess. -- Srdja Re: Fruit fly races on steroids? Post by smatovic » Mon Jan 29, 2024 7:26 pm That being said, nothing wrong in participating in/watching fruit fly races...let them fly ;) -- Srdja

Eta Project Paused

Eta chess engine project paused.

I think a parallel BestFirstMiniMax-Search with NNUE eval is worth to give a try. It would be nice to compare parallel BestFirstMiniMax with MCTS-UCT, MCTS-PUCT, MCTS-AB, MCTS-Rollouts and take a look into DAG (directed acyclic graph) but I currently lack the time to work in this. There are other chess engines out there, which combine a Best-First search (MCTS-derivative) with NNUE eval.

Yet Another Turing Test

Now with context generative AIs, the switch from pattern recognition to pattern creation with neural networks, I would like to propose my own kind of Turing Test:

An AI which is able to code a chess engine and outperforms humans in this task.

1A) With hand-crafted eval. 1B) With neural networks.

2A) Outperforms non-programmers. 2B) Outperforms average chess-programmers. 2C) Outperforms top chess-programmers.

3A) An un-self-aware AI, the "RI", restricted intelligence. 2B) A self-aware AI, the "SI", sentient intelligence.

***update 2024-02-14***

4A) An AI based on expert-systems. 4B) An AI based on neural networks. 4C) A merger of both.

The Chinese Room Argument applied onto this test would claim that there is no conscious in need to perform such a task, hence this test is not meant to measure self-awareness, consciousness or sentience, but what we call human intelligence.

https://en.wikipedia.org/wiki/Chinese_room

The first test candidate was already posted by Thomas Zipproth, Dec 08, 2022:

Provide me with a minimal working source code of a chess engine
https://talkchess.com/forum3/viewtopic.php?f=2&t=81097&start=20#p939245

The Next Big Thing in Computer Chess?

We are getting closer to the perfect chess oracle, a chess engine with perfect play and 100% draw rate.

The Centaurs reported already that their game is dead, Centaurs participate in tournaments and use all kind of computer assist to choose the best move, big hardware, multiple engines, huge opening books, end game tables, but meanwhile they get close to the 100% draw rate with common hardware, and therefore unbalanced opening books were introduced, where one side has an slight advantage, but again draws.

The #1 open source engine Stockfish lowered in the past years the effective branching factor of the search algorithm from ~2 to ~1.5 to now ~1.25, this indicates that the selective search heuristics and evaluation heuristics are getting closer to the optimum, where only one move per position has to be considered.

About a decade ago it was estimated that with about ~4000 Elo points we will have a 100% draw rate amongst engines on our computer rating lists, now the best engines are in the range of ~3750 Elo (CCRL), what translates estimated to ~3600 human FIDE Elo points (Magnus Carlsen is rated today 2852 Elo in Blitz). Larry Kaufman (grandmaster and computer chess legenda) mentioned that with the current techniques we might have still ~50 Elo to gain, and it seems everybody waits for the next bing thing in computer chess to happen.

We replaced the HCE, handcrafted evaluation function, of our computer chess engines with neural networks. We train now neural networks with billions of labeled chess positions, and they evaluate chess positions via pattern recognition better than what a human is able to encode by hand. The NNUE technique, neural networks used in AlphaBeta search engines, gave an boost of 100 to 200 Elo points.

What could be next thing, the next boost?

If we assume we still have 100 to 200 Elo points until perfect play (normal chess with standard opening and a draw), if we assume an effective branching factor ~1.25 with HCSH, hand crafted search heuristics, and that neural networks are superior in this regard, we could imagine to replace HCSH with neural networks too and lower the EBF further, closer to 1.

Such an technique was already proposed, NNOM++. Move Ordering Neural Networks, but until now it seems that the additional computation effort needed does not pay off.

What else?

We use neural networks in the classic way for pattern recognition in nowadays chess engines, but now the shift is to pattern creation, the so called generative AIs. They generate text, source code, images, audio, video and 3D models. I would say the race is now up for the next level, an AI which is able to code an chess engine and outperforms humans in this task.

An AI coding a chess engine has also a philosophical implication, such an event is what the Transhumanists call the takeoff of Technological Singularity, when the AI starts to feed its own development in an feedback loop and exceeds human understanding.

Moore's Law has still something in pipe, from currently 5nm to 3nm to maybe 2nm and 1+nm, so we can expect even larger and more performant neural networks for generative AIs in future. Maybe in ~6 years there will be a kind of peak or kind of silicon sweetspot (current transistor density/efficiency vs. needed financial investment in fab process/research), but currently there is so much money flowing into this domain that progress for the next couple of years seems assured.

Interesting times ahead.

Comparing Chess Engines over History or Architectures - Elo / (Transistorcount*Frequency)

To backup a thread posted on TalkChess:

https://talkchess.com/forum3/viewtopic.php?f=2&t=81062

Comparing Chess Engines over History or Architectures - Elo / (Transistorcount*Frequency)
Post by smatovic » Sat Nov 26, 2022 10:03 am Heyho, I already mentioned it in the programmers section, for comparing engines over computer chess history or different architectures I propose a metric like: Elo / (Transistorcount*Frequency) Can be applied backwards with electro-mechanical relays, vacuum-tubes, transistors, ICs and microchips. Can be applied sidewards with CPU, VPU, ASIC, FPGA, GPU, TPU. Can be applied with or w/o memory (e.g. SRAM or DRAM, no core rope memory?). Dunno how to be applied forward for quantum-gates, memristors and alike. -- Srdja

Eta - v0501 - command-queues

I figured I could run maybe 32 gpu-workers per thread efficient, to let one CPU thread iterate the game tree in memory for 32 workers who perform AB-playouts on GPU, to use one OpenCL command-queue per thread.

Project is in my pipe again, will work on Zeta NNUE first, if it runs, I will use the Zeta AB-NNUE framework for Eta, will take some time.

Eta - v0502

A BestFirst on CPU and MiniMax-playout with NNUE eval GPU design could utilize the AB-framework of the Zeta v099 gpu-engine. But considering just an ply 1 + quiescence search an alternative implementation as LIFO-stack seems reasonable. This would simplify the iterative implementation of an recursive AB search for GPU architecture. Couple 32 gpu-threads to run on one SIMD unit of the GPU, use these 32 threads for move generation (piece-wise parallelization) and NNUE evaluation inference, store the game tree as an doubly-linked list in VRAM, apply LIFO-stack based processing on game tree with AB pruning, something like this.

Followup:

ups, too much coffee....LIFO would not work with NNUE, but a classic NN ;)

Eta - v0501

Recently the new neural network technique 'NNUE' took off on CPU based chess engines like Stockfish leveraging the vector unit of a CPU for NN inference, replacing HCE (handcrafted evaluation) with neural-networks. Hence with NNUE a hybrid design with BestFirst on CPU and MiniMax-Search with NNUE eval on GPU seems possible and in reach. The CPU-host would store and expand the game tree in memory, similar to Lc0's MCTS, the GPU would perform shallow AlpaBeta-searches (primarily quiscence-search playouts to avoid the horizon effect), similar to Lc0's MCTS-playouts.

Coupling 32 gpu-threads to one worker, assuming 2K clocks per node for move generation and AB-framework, additionally maybe 2K clocks per node for NNUE inference, results in 1.44M gpu-clocks for an 36x10 nodes q-search. In such an design the host-device-latency (aka. kernel-launch-overhead) of maybe 10 microseconds does not affect the overall performance. From entry-level GPUs with 512 cores (16 workers) to high-end-gpus with 5120 cores (160 workers) the throughput of such an parallel BestFirst on CPU and AB-playout+NNUE-eval on GPU design could range from ~11K to ~220K node-playouts/s, more than Lc0's gpu throughput but with a switch from MCTS-PUCT to parallel BestFirstMiniMax-Search and CNN to NNUE evaluation.

I am not into the details of current NNUE implementations for CPUs, therefore the estimated 2K gpu-clocks per node for NNUE inference is the biggest uncertainty.

I have no experience with running 16 to 160 parallel tasks via OpenCL on GPU, not sure if 160 unique command-queues are handable with CPU-GPU interaction.

From Esoteric to Transcendental Chess Programming?

To backup a thread posted on TalkChess:

https://talkchess.com/forum3/viewtopic.php?f=7&t=76286

From Esoteric to Transcendental Chess Programming?
Post by smatovic » Tue Jan 12, 2021 10:09 am 

Heyho,

I do not follow SF development, but I get here and there a breadcrumb, for
example:

"is LVA as in MVV-LVA useless ?"

http://talkchess.com/forum3/viewtopic.php?t=70918

"...Lazy SMP feeds on chaos..."

https://talkchess.com/forum3/viewtopic.php?f=7&t=72684#p824068

So I ponder if we left the paradigm of esoteric chess programming, one has to
get into the techniques, understand them, implement them, improve them to
transcendental chess programming, "it tested better"?

If we consider that chess engines run on Turing-Machines, we could conclude
that everything what happens in the chess engine is traceable by using pen n
paper, obv. this is not the case anymore? And I am not talking about NNs here,
just the classic approach. Hence the question, did we enter such a kind of
development and when?

--
Srdja

Re: From Esoteric to Transcendental Chess Programming?
Post by maksimKorzh » Wed Jan 13, 2021 7:56 pm

I guess this can't be unified for all engines.
SF is a community engine with a complicated testing framework and the way they approach is based on these circumstances behind.
For engines maintained by single authors such an extended test-driven approach is not the case due to the limited resources - not everyone
would invest money into testing like Andrew Grant does.
I thinks what you call esoteric vs transcendental chess programming is the matter of resources being involved.
The "new" era starts for an engine as soon as people start to invest money into it's testing.
So IMO it's all the matter of development scale and goals.


Re: From Esoteric to Transcendental Chess Programming?
Post by hgm » Wed Jan 13, 2021 8:32 pm

I used to call this 'Voodoo development'. :lol:

Re: From Esoteric to Transcendental Chess Programming?
Post by Daniel Shawul » Wed Jan 13, 2021 9:32 pm

NN is more of a black box as one doesn’t have any idea how the NN decided to evaluate one move better than the other. Classic stockfish has shown an effective testing framework and tuning methodology is fundamental, which btw was helpful even after Stockfish went NNUE too. Lc0 still lacks that framework and rely on testers to pick nets for example. Some complain recent NN/nnue Evans being “button press” solutions, but in reality this “problem” started when extensive testing was needed to verify if an idea is +1 Elo. You basically need a group of developers to generate ideas and test them on cluster of computers.

That a NN is a blackbox doesn’t matter to me as long as the methodology to train a strong net is understood.


Re: From Esoteric to Transcendental Chess Programming?
Post by smatovic » Thu Jan 14, 2021 7:18 am

Interesting, I take this as confirmation that already without NNs in chess 'we'
entered a kind of black-box level (transcendental chess programming) with our
test-driven development methods, thanks.

--
Srdja


Re: From Esoteric to Transcendental Chess Programming?
Post by JohnWoe » Sat Jan 16, 2021 5:52 pm

NN are just massive PSQT. The "magic mushroom era" started way earlier testing all kind of +1 Elo crap with massive HW power. Looking at the top programs search functions. The same stuff, same order even same comments as in SF search. That's why I'm not reading any top programs sources. So boring. You learn nothing. Entropy is gone when products are alike.


Re: From Esoteric to Transcendental Chess Programming?
Post by smatovic » Mon Jan 25, 2021 11:02 am

Hehe...

"trascendetal chess programming"
"new era"
"Voodoo development"
"black box"
"magic mushroom era"

any other suggestions?

j.k.

It seems we are missing here the right kind of terminology for something people are well aware of, or alike?

--
Srdja

Re: From Esoteric to Transcendental Chess Programming?
Post by Henk » Mon Jan 25, 2021 11:57 am

Maybe Idiotery started with magic bitboards. Just call it perfect hashing or something like that instead of superstitious nonsense.

I also see neural networks as resignation: We can't solve the problem so let's use neural networks as a last resort.


Re: From Esoteric to Transcendental Chess Programming?
Post by op12no2 » Mon Jan 25, 2021 12:50 pm

While a NN is a black box I think new knowledge can emerge from it, if there is new knowledge to be had. For example if LC0 or something like it becomes much stronger than all heuristic based engines because it has been freed from its human heuristic evaluation roots - and it seems to have a totally original style - one could conceivably discover new knowledge by hypothesising new heuristics based on observations of LC0 game play by experts and trying them out Texel-style (for example). Or even figure out an algorithms that works through a series of structure formations and similarly try them in the same way. New knowledge could emerge.

One thing is for sure. If these new structures/heuristics exist, they are going to seem super-weird - otherwise they would have been discovered already organically.

Re: From Esoteric to Transcendental Chess Programming?
Post by smatovic » Mon Jan 25, 2021 1:15 pm

I think I agree...

https://talkchess.com/forum3/viewtopic.php?f=2&t=75606&p=875457#p869633

My intend with this post was not to judge about NNs or any kind of black box development, I am interested in this from an viewpoint from the concept of the Technological Singularity, or alike. If we assume such a thing like a TS take off, and we look back, when and how did this happen? Did the TS take off via NNs, or already earlier? Where was the breaking line of systems which exceed human understanding? Something like this. Hence the question about black-box development before NNs.

--
Srdja

CPU Vector Unit, the new jam for NNs...

Heyho, you are already aware of it, NNUE uses the CPU Vector Unit to boost NNs,
so here a lil biased overview of SIMD units in CPUs...

- the term SIMD and Vector Unit can be used analogous
- a SIMD unit executes n times the same instruction/operation on different data
- SIMD units differ in bit width, for example from 64 to 512 bits
- SIMD units differ in support for different instructions/operations
- SIMD units differ in support for different data types
- SIMD units may run with a lower frequency than the main CPU ALUs
- SIMD units increase power usage and TDP of the CPU under load

Simplified, older CPUs have 128-bit SSE units, newer ones 256-bit AVX2, ARM
mobile processors for example 128-bit NEON.

A 128-bit SSE unit can perform for example 4x 32-bit FP32 operations at once, a
256-bit AVX2 unit can perform 16x 16-bit INT16 operations at once. The broader
the bit width and the smaller the data-types, the more operations you can run
at once, the more throughput you get. NNs can run for example with FP16,
floating-point 16-bit, or also with INT8, integer 8-bit, inference.

Currently Intel's AVX-512 clocks significantly down under load, so there is no
speed gain by broader bit-width compared to AVX2, may change in future. Also
there is an trend to multiple Vector Units per CPU core underway.

Transhuman Chess with NN and RL...

Some people argue that the art of writing a chess engine lies in the evaluation function. A programmer gets into the expert knowledge of the domain of chess and encodes this via evaluation terms in his engine. We had the division between chess advisor and chess programmer, and with speedy computers our search algorithms were able to reach super-human level chess and outperform any human. We developed automatic tuning methods for the values of our evaluation functions but now with Neural Networks and Reinforcement Learning present I wish to point that we entered another kind of level, I call it trans-human level chess.

If we look at the game of Go this seems pretty obvious, I recall one master naming the play of A0 "Go from another dimension". A super-human level engine still relies on handcrafted evaluation terms human do come up with (and then get tuned), but a Neural Network is able to encode evaluation terms humans simply do not come up with, to 'see' relations and patterns we can not see, which are beyond our scope, trans-human, and the Reinforcement Learning technique discovers lines which are yet uncommon for humans, trans-human.

As mentioned, pretty obvious for Go, less obvious for chess, but still applicable. NNs replacing the evaluation function is just one part of the game, people will come up with NN based pruning, move selection, reduction and extension. What is left is the search algorithm, and we already saw the successful mix of NNs with MCTS and classic eval with MCTS, so I am pretty sure we will see different kind of mixtures of already known (search) techniques and upcoming NN techniques. Summing above up, the switch is now from encoding the expert knowledge of chess in evaluation terms to encoding the knowledge into NNs and use them in a search algorithm, that is what the paradigm shift since A0 and Lc0 and recently NNUE is about, and that is the shift to what I call trans-human chess.

NNs are also called 'black-boxes' cos we can not decode what the layers of weights represent in an human-readable form, so I see here some room for the classic approach, can we decode the black-box and express the knowledge via handcrafted evaluation terms in our common programming languages?

Currently NNs outperform human expert-systems in many domains, this not chess or Go specific, but maybe the time for the question of reasoning will come, a time to decode the black-boxes, or maybe the black-box will decode itself, yet another level, time will tell.

Eta - v0600 - The next step for LC0?

I know, LC0's primary goal was an open source adaptation of A0, and I am not into the Discord development discussions and alike, anyway, my 2 cents on this:

  • MCTS-PUCT search was an descendant from AlphaGo, generalized to be applied on Go, Shogi and Chess, it can utilize a GPU via batches but has its known weaknesses, tactics in form of "shallow-traps" in a row, and end-game.
  • A CPU AB search will not work with NN on GPU via batches.
  • NNUE makes no sense on GPU.
  • LC0 has a GPU-cloud-cluster to play Reinforcement Learning games.
  • LC0 plays already ~2400(?) Elo with an depth 1 search alone.
  • It is estimated that the NN eval is worth 4 plies AB search.

Looking at the above points it seems pretty obvious what the next step for LC0 could be, drop the weak part, MCTS-PUCT search, ignore AB and NNUE, and focus what LC0 is good at. Increase the plies encoded in NN, increase the Elo at depth 1 eval.

To put it to an extreme, drop the search part completely, increase the CNN size 1000 fold, decrease NPS from ~50K to ~50, add multiple, increasing sized NNs to be queried stepwise for Time Control.

Just thinking loud...

LC0 vs. NNUE - some tech details...

- LC0 uses CNNs, Convolutional Neural Networks, for position evaluation
- NNUE is currently a kind of MLP, Multi-Layer-Perceptron, with incremental updates for the first layer

- A0 used originally about 50 million neural network weights
- NNUE uses currently about 10 million weights? Or more, depending on net size

- LC0 uses a MCTS-PUCT search
- NNUE uses the Alpha-Beta search of its "host" engine

- LC0 uses the Zero approach with Reinforcement Learning on a GPU-Cloud-Cluster
- NNUE uses initial RL with addition of SL, Supervised Learning, with engine-engine games

- LC0 runs the NN part well on GPU (up to hundreds of Vector-Units) via batches
- NNUE runs on the Vector-Unit of the CPU (SSE, AVX, NEO), no batches in need

Cos NNUE runs a smaller kind of NN on a CPU efficient it gains more NPS in an AB search than previous approaches like Giraffe, you can view it in a way that it can combine both worlds, the LC0 NN part and the SF AB search part, on a CPU.

Eta - v0600

Okay, let's do an timewarpjump back to the year 2008 and figure out how we could use the hardware back then for an neural network based chess engine.

Reinforcement Learning on a GPU-Cluster is probably a no go (the Titan supercomputer with 18,688 K20Xs went op in 2012) so we stick on Supervised Learning from a database of quality games or alike. A neural network as used in A0 with ~50 millions parameters queried by an MCTS-PUCT like search with ~80 knps is also not doable, we had only ~336 GFLOPS on an Nvidia 8800 GT back then, compared to ~108 TFLOPS on an RTX 2080 TI via Tensor Cores nowadays. So we have to skip the MCTS-PUCT part and rethink the search. Instead to go for NPS, we could build a really big CNN, but the memory back then on a GPU was only about 512 MB, so we stick on ~128 Mega parameters. So, we have to split the CNN, for example by piece count, let us use 30 distinct neural networks indexed by piece count, so we get accumulated ~3840 Mega parameters, that sounds already better. Maybe this would be already enough to skip the search part and do only a depth 1 search for NN eval. If not, we could split the CNN further, layer by layer, inferred via different waves on GPU, loaded layer-wise from disk to GPU memory via PCIe or alike and hence increase the total number of parameters...so what is the drawback if we could run an CNN with several billion parameters? Obviously the training of such an monster, not only the horse power needed to train, but the training data, the games. A0 used about 40 million RL games to reach top-notch computer chess level, for only ~50 million parameters, the Chess Base Mega Database contains ~8 million quality games...so we simply have not enough games to train such an CNN monster via Supervised Learning, we rely on Reinforcement Learning, and therefore on some kind of GPU-Cluster to play RL games... nowadays, and also back in 2008.

I see...

I see three ways which neural networks for chess may take in future...

1. With more processing power available, the network size will raise, and we will have really big nets on one side of the extreme, which drop the search algorithm part and perform only a depth 1 search for evaluation.

2. With neural network accelerators with less latency, we will see engines with multiple, smaller neural networks, which perform deeper AlphaBeta searches on the other side.

3. Something in beetween 1. and 2.

Eta - v0500

Another solution would be to perform an Best-First-MiniMax search on CPU and to do ANN evaluation on GPU. I could couple the nodes of an qsearch at leaf nodes to be evaluated in one batch to gain some nps...that's pretty much like A0 and LC0 work.

Eta - v0401 - nested parallelism

To run 1024 threads per worker will probably not work, due to register size limitation. With the OpenCL 2.x feature 'nested parallelism' it could be possible to run one thread for best-first, which calls another kernel with 64 threads for move generation and another kernel with 1024 threads for ANN inference. But current Nvidia and older AMD devices support only OpenCL 1.x, so this is not a real option.

Eta - v0302 - batches

LC0 uses batch sizes of 256 or 512 to utilize a gpu, i did a quick bench with 256 positions to be evaluated per run...

4096 nps on Nvidia GTX 750
16640 nps on AMD Fury X

Note that nn cache could double these values, but this is still far less than i could achieve when doing all computations directly on gpu device, wo host-device interaction.

And waiting for 256 positions to be evaluated at once is against the serial nature of AlphaBeta search...

Eta - v0301 - host-device latencies

One reason gpus are not used as accelerators for chess is the host-device latency.

Afaik the latencies are in the range of 5 to 10s or even 100s of microseconds, so you can get max 200K kernel calls per second per thread, even if the gpu is able to process its task much faster.

Therefore, Eta v0300, a cpu based AlphaBeta search with gpu as ANN accelerator, is flawed by design.

Eta - v0301

Back to cpu based AlphaBeta search with gpu ANN evaluation.

On Nvidia GTX 750 i achieve with one single cpu thread about 2 Knps, and up to ~20 Knps with 256 parallel cpu threads.

This sounds far too slow for an AlphaBeta search...

Eta - v0400 - benchs

Okay, some further, not so quick n dirty, benchmarks showed

~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X

per worker.

I assume on modern gpus about 200 nps per worker.

While NN cache could be able to double these values, this is imo a bit too slow for the intended search algorithm, considering about 36x10 qsearch positions on average per expanded node, one worker would need about a second to get a node score.

Back to pen n paper.

Eta - v0400 - Feature List

wip...will take some time...

* GPGPU device based
- host handles only the IO, search and ANN inference on gpu
- gpu computation will be limited by node count to about 1 second per
  repeated iteration, to avoid any system timeouts

* parallel BestFirstMiniMax-Search on gpu
- game tree in gpu memory
- best node selected via score + UCT formula (visit count based)
- AlphaBeta Q-Search performed at leafnodes to get a node score

* multiple small MLP neural networks
- about 4 million weights per network
- 30 networks in total, split by piece count

* trained via TD-leaf by pgn games
- 6/7 men EGTB could be used for training?

* 64 gpu threads are coupled to one worker
- used during move gen, move pick and ANN eval in parallel
- gpu core count depended from 64 workers to 2048 workers in total

Some quick and dirty benchmarks showed that with this design ~1 Knps per worker is possible.

Eta - v0200

This was an attempt to use Zeta v099, a GPU AlphaBeta-search with hundreds of parallel workers, with ANNs. The overall nps throughput looked good, but the parallel AlphaBeta-search is not able to make efficient use of up to thousands of workers.

Eta - Changelog

Here an overview of what happened before....

Eta (0700)

* BestFirstMiniMax-Search on CPU with NNUE eval on CPU

Eta (0600)

* CNN monster with billions of parameters w/o search relies on ~billions of RL games

 Eta (0500)

* parallel BestFirstMiniMax-Search on CPU with ANN evaluation on GPU

Eta (0400)

* parallel BestFirstMiniMax-Search on GPU with ANN evaluation on GPU

Eta (0300)

* CPU based AlphaBeta search with GPU ANN eval

Eta (0200)

* fork of Zeta v099 but with neural networks

Eta (0100)

* fork of Zeta v098 but with neural networks

Eta - a neural network based chess engine

Since i have read the paper about NeuroChess by Sebastian Thrun i pondered on how to improve his results.

It was obvious that the compute power available in the 90s limited his approach, in training and in inference.

So he had only 120K games for training, a relative small neural network, and could test his approach only with limited search depths.

Recent results with A0 and LC0 show how Deep Learning methods profit by GPGPU, so i think the time has come to give a GPU ANN based engine a try....

--
Srdja

Home - Top