Eta - v0302 - batches
LC0 uses batch sizes of 256 or 512 to utilize a gpu, i did a quick bench with 256 positions to be evaluated per run...
4096 nps on Nvidia GTX 750
16640 nps on AMD Fury X
Note that nn cache could double these values, but this is still far less than i could achieve when doing all computations directly on gpu device, wo host-device interaction.
And waiting for 256 positions to be evaluated at once is against the serial nature of AlphaBeta search...