To run 1024 threads per worker will probably not work, due to register size
limitation. With the OpenCL 2.x feature 'nested parallelism' it could be
possible to run one thread for best-first, which calls another kernel with
64 threads for move generation and another kernel with 1024 threads for ann
inference. But current Nvidia and older AMD devices support only OpenCL 1.x,
so this is not a real option.