Another solution would be to perform an Best-First-MiniMax search on CPU and to do ANN evaluation on GPU. I could couple the nodes of an qsearch at leaf nodes to be evaluated in one batch to gain some nps...that's pretty much like A0 and LC0 work.