To run 1024 threads per worker will probably not work, due to register size limitation. With the OpenCL 2.x feature 'nested parallelism' it could be possible to run one thread for best-first, which calls another kernel with 64 threads for move generation and another kernel with 1024 threads for ann inference. But current Nvidia and older AMD devices support only OpenCL 1.x, so this is not a real option.