To run 896 gpu threads as one worker, and multiple workers in parallel, does not pay off. There is an 12x slow down when increasing from 64 to 896 threads. Actually this seems only natural, but i thought the multiple threads would handle somehow the memory latency better.