To run 896 gpu threads as one worker, and multiple workers in parallel, does not
pay off. There is an 12x slow down when increasing from 64 to 896 threads.
Actually this seems only natural, but i thought the multiple threads would
handle somehow the memory latency better.