Okay, some further, not so quick n dirty, benchmarks showed
~240 nps for Nvidia GTX 750 and
~120 nps for AMD Fury X
I assume on modern gpus about 200 nps per worker.
While NN cache could be able to double these values, this is imo a bit too slow for the intended search algorithm, considering about 36x10 qsearch positions on average per expanded node, one worker would need about a second to get a node score.
Back to pen n paper.