Recently the new neural network technique 'NNUE' took off on CPU based chess engines like Stockfish leveraging the vector unit of a CPU for NN inference, replacing HCE (handcrafted evaluation) with neural-networks. Hence with NNUE a hybrid design with BestFirst on CPU and MiniMax-Search with NNUE eval on GPU seems possible and in reach. The CPU-host would store and expand the game tree in memory, similar to Lc0's MCTS, the GPU would perform shallow AlpaBeta-searches (primarily quiscence-search playouts to avoid the horizon effect), similar to Lc0's MCTS-playouts.

Coupling 32 gpu-threads to one worker, assuming 2K clocks per node for move generation and AB-framework, additionally maybe 2K clocks per node for NNUE inference, results in 1.44M gpu-clocks for an 36x10 nodes q-search. In such an design the host-device-latency (aka. kernel-launch-overhead) of maybe 10 microseconds does not affect the overall performance. From entry-level GPUs with 512 cores (16 workers) to high-end-gpus with 5120 cores (160 workers) the throughput of such an parallel BestFirst on CPU and AB-playout+NNUE-eval on GPU design could range from ~11K to ~220K node-playouts/s, more than Lc0's gpu throughput but with a switch from MCTS-PUCT to parallel BestFirstMiniMax-Search and CNN to NNUE evaluation.

I am not into the details of current NNUE implementations for CPUs, therefore the estimated 2K gpu-clocks per node for NNUE inference is the biggest uncertainty.

I have no experience with running 16 to 160 parallel tasks via OpenCL on GPU, not sure if 160 unique command-queues are handable with CPU-GPU interaction.