fib isn't really a parallelize-able algorithm and your impl does worse than nothing. Theoretically, if setting up call stacks is slow, it could be faster if you compute lower Fibonacci numbers while the higher ones are setting up their call stacks. I actually implemented this as a test, because I'm looking for a CL job (:waves:). On my machine at least, this parallelized version isn't better, no surprise there.
The result, even with 40,000... the standard dfib is beating the pdfib. The pdfib on average is using 120% CPU. I could probably partition the threads better, but, its pretty much useless to try to race call stack creation with the computation.