I will not waste a lot time explaining how to use Fork/Join in a program since that is already covered well at: Fork/Join tutorial
However, this is what I did to test it out.
Disclaimer: Java's time calculation is surely debatable and there are quite a few talks out there that criticize micro bench marking as well. So please keep that in mind before expecting the results to be accurate in a production system.
First the (imaginary) problem
Assumptions:
- Add operation is expensive
Problem:
- Input: the program is given an array of numbers
- Output: the program must compute the summation of the given numbers
The program first iterates over an array of SlowNumber objects and calculates the sum of it. The time needed to perform this operation is what is later in the post referred to as Serial Execution Time. 
After that, the program:
- Breaks down the array of SlowNumber objects into smaller arrays with start and end indexes
- Passes these smaller segments into separate forks
- Keeps a small segment for itself to calculate the sum
- Gets the sums calculated by the child-tasks and adds them together with the sum found in the previous step
This time needed by the whole process of calculating the sum in parallel is referred to as Parallel Execution Time later in the post.
I have created a SlowNumber class that allows us to add another SlowNumber to it. Every add operation sleeps for a certain amount of time simulating a time consuming operation.
private class SlowNumber {
private Long value = 0L;
//
public void add(final SlowNumber number) {
sleep(TIME_PER_OPERATION_IN_MILLIS);
value += number.value;
}
// rest of the class
}Now the interesting part. We have a SlowAdd task that extends RecursiveTask<slownumber>. The compute method checks if the given number of SlowNumber objects are more than a certain size. If so, it simply forks more SlowAdd tasks and divides the array of SlowNumber objects into smaller parts keeping the last set of numbers for itself.
Here is how the first part of the compute method in SlowAdd looks like:
class SlowAdd extends RecursiveTask{ 
//...
@Override
protected SlowNumber compute() {
List<ForkJoinTask<SlowNumber>> childAddTasks = new ArrayList<>();
// start from fromIndex
int computeBeginning = fromIndex;
// if more numbers than NUMBERS_PER_FORK
if (toIndex - fromIndex > NUMBERS_PER_FORK) {
// create forks with own fromIndex and toIndex to perform add operation
for (int i = fromIndex; i < toIndex; i += NUMBERS_PER_FORK) {
if (i + NUMBERS_PER_FORK >= toIndex) {
// ok, have less or equal to NUMBERS_PER_FORK number of items, so don't fork
computeBeginning = i;
break;
}
SlowAdd slowAdd = new SlowAdd(numbers, i, i + NUMBERS_PER_FORK);
childAddTasks.add(slowAdd);
slowAdd.fork();
}
}
// ...
// now perform addition on own share of numbers
SlowNumber total = new SlowNumber();
for (int i = computeBeginning; i < toIndex; i++) {
total.add(numbers[i]);
}
if (!childAddTasks.isEmpty()) {
for (ForkJoinTaskchildAddTask : childAddTasks) { 
// add results created by children
total.add(childAddTask.join());
}
}
return total;
}
// ...
}
Now to execute the SlowAdd task, all we need to do is:
SlowAdd slowAdd = new SlowAdd(slowNumbers, 0, slowNumbers.length);
ForkJoinPool addTaskPool = new ForkJoinPool(i + 1);
final SlowNumber parallelValue = addTaskPool.invoke(slowAdd);
After running the example code quite a few times on a HP EliteBook 8540w running Ubuntu 11.04, the program outputs something very similar to what follows:
Expected total: 164, took: 333 ms
Found slowAdd total: 164, number of CPU: 1, took: 348 ms
Speed up by: 95.0%
Found slowAdd total: 164, number of CPU: 2, took: 313 ms
Speed up by: 106.0%
Found slowAdd total: 164, number of CPU: 3, took: 172 ms
Speed up by: 193.0%
Found slowAdd total: 164, number of CPU: 4, took: 132 ms
Speed up by: 252.0%
Note that, the delay per add operation is set to 10ms. Reducing the delay to only 1ms results into something as follows:
Expected total: 164, took: 36 ms
Found slowAdd total: 164, number of CPU: 1, took: 42 ms
Speed up by: 85.0%
Found slowAdd total: 164, number of CPU: 2, took: 34 ms
Speed up by: 105.0%
Found slowAdd total: 164, number of CPU: 3, took: 19 ms
Speed up by: 189.0%
Found slowAdd total: 164, number of CPU: 4, took: 15 ms
Speed up by: 240.0%
Still a 240% performance improvement despite the overload of breaking the input into smaller pieces, creating new tasks and switching of tasks by the task-manager.
Here is a chart I made for a visual understanding of the difference:

Thought, this micro benchmark looks straight forward enough to motivate all of us to convert all our existing code to use Fork/Join, it should be noted that the problem I chose to demonstrate here is straight forward enough to be broken down into smaller problems which can be executed independently in parallel.
If you think about the popular Map/Reduce problem, then the mapping part can definitely be performed in parallel using fork/join approach to speed up the performance. However, if a set of tasks are interdependent and depends on each others results, then the outcome will surely be different.
In the worst case, if a set of tasks are dependent in a way that results in a linear dependency graph i.e. A -> B -> C -> D, then no matter how many CPUs are available, the tasks will always be executed serially. In such cases, Fork/Join can even decrease the performance of the application since the tasks will be executed serially but with the overhead of the task scheduler.
The first two columns in every bar-chart proves that when the parallel execution has only one CPU, it actually decreases the performance a little. The benefit of parallel execution is only visible when there are more than one CPU.
You can download the full source code here.
Feel free to share your opinion.