Okay, some further, not so quick n dirty, benchmarks showed

~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X

per worker.

I assume on modern gpus about 200 nps per worker.

While NN cache could be able to double these values, this is imo a bit too slow for the intended search algorithm, considering about 36x10 qsearch positions on average per expanded node, one worker would need about a second to get a node score.

Back to pen n paper.