LC0 uses batch sizes of 256 or 512 to utilize a gpu,
i did a quick bench with 256 positions to be evaluated per run...

4096 nps on Nvidia GTX 750
16640 nps on AMD Fury X

Note that nn cache could double these values,
but this is still far less than i could achieve when doing all computations
directly on gpu device, wo host-device interaction.

And waiting for 256 positions to be evaluated at once is against the serial
nature of AlphaBeta search...