Microsoft has now made available a parallel library for .NET, its still in early stages with its aim is to make it easier to write applications to take advantage of multiple cores. Pfx has really two layers one which offers a relatively high level of abstraction, providing a Parallel.For, Parallel.Do and parallel LINQ and a lower level layer allowing programmers to schedule individual Tasks.
This has been a long time coming, .NET support for concurrency is pretty primitive even in .NET 3.0. Simple things like kick of N tasks and wait for them all to complete requires a reasonable amount of code, but on the surface Pfx looks to assist here. I spent a couple of days over xmas looking at it and understanding how it works.
So first task was to find some code I wanted to parallelise I frequently use a piece of logic I stole from the book "How long is a piece of string" that uses many iterations to work out the value of PI as a good async demo, so I thought I would have a go at parallelising this piece of code.
private
const
int N_ITERATIONS = 1000000000;
private
static
double CalcPi()
{
double pi = 1.0;
double multiply = -1;
for (int nIteration = 3; nIteration < N_ITERATIONS; nIteration += 2)
{
pi += multiply * (1.0 / (double)nIteration);
multiply *= -1;
}
return pi * 4.0;
}
Looking at the algorithm it doesn't look that complicated all I need to do is to make each core do a different part of the for statement and then combine the results of each core and multiple by 4.0. In fact Ive done this using the basic BeginInvoke functionality before so I know it scales...
So I first opted for the high level approach offered by Pfx called Parallel.For. Which it states allows me to spread loop iterations across multiple cores. In other words nIteration in my case could be 3 on one core and 5 on another at the same time.
private
static
double CalcPiSimpleParallel()
{
double pi = 1.0;
object sync = new
object();
//for (int nIteration = 3; nIteration < 1000000000; nIteration += 2)
Parallel.For<double>(0, N_ITERATIONS / 2 , 1, () => { return 0; }, (nIteration, localPi) =>
{
double multiply = 1;
// 3 5 7 9 11
if (nIteration % 2 == 0 )
{
multiply = -1;
}
localPi.ThreadLocalState += multiply * (1.0 / (3.0 + (double)nIteration* 2.0 ));
},
(localPi) => { lock (sync) { pi += localPi; } });
return pi * 4.0;
}
This unfortunately this did not achieve the results I had hoped for...it in fact took twice as long...Bizare...Ok so not suprising since it is using delegates. Using reflector you see that Parallel.For is using the underlying Task class provided by Pfx.
A Task is a piece of work that you give to a task manager and it queue's for an available thread. Now the first thing to mention is that tasks do not run on thread pool threads, a task manager has its own pool of threads which it controls depending on the number of available cores and if the tasks running are currently blocked.
Task task = Task.Create((o) =>
{
Console.WriteLine("Task Running..");
} );
task.Wait();
The code above will create a new task, and it will be run on one of the default Task Managers threads, and the main thread will wait for its completion.
Parallel.For, uses a special variant of Task called a replicating task, which means when you create this task the task manager will effectively not create a single instance of the task but in fact as many instances as there are cores. Each of these task clones are then expected to work in harmony to achieve the overall task. A typical implementation might therefore have an outerloop looking for a new piece of work to do in the overall task, when no more work it simply ends the task.
In the case of a Parallel.For you could expect the outerloop to simply perform an Interlocked.Increment on the loop counter until the required value is reached, and then inside the loop to simply invoke the required piece of work, and here lies the problem in my example in that the piece of work that is executing inside the for loop is extremely trivial and doesn't take that many cycles, and so the overhead of invoking a delegate for each iteration is having a large impact on the overall performance.
So first word of warning when using Pfx, you need to make sure that the piece of work inside a given task is of a reasonable size in order for it to actually scale. Ok so it's not as trivial as it first looks to parallelise the calc Pi method. Not put off I refactored the code to make the algorithm have an inner loop, thus increasing the amount of work for each parallel for iteration. To start with I simple broke it down into ten blocks, thinking each block would now be a reasonable amount of work and we would be only paying the overhead of a function call 10 times, as opposed to N_ITERATIONS/2
private
static
double CalcPiDualLoop()
{
double pi = 1.0;
object sync = new
object();
int stride = N_ITERATIONS / 2 / 10;
//for (int nIteration = 3; nIteration < 1000000000; nIteration += 2)
Parallel.For<double>(0, N_ITERATIONS / 2, stride, () => { return 0; }, (nIteration, localPi) =>
{
double multiply = 1;
// 3 5 7 9 11
if (nIteration % 2 == 0)
{
multiply = -1;
}
double piVal = 0;
for (int val = nIteration; val < nIteration + stride ; val++)
{
piVal += multiply * (1.0 / (3.0 + (double)val * 2.0));
multiply *= -1;
}
localPi.ThreadLocalState += piVal;
},
(localPi) => { lock (sync) { pi += localPi; } });
return pi * 4.0;
}
But only partial luck, I got a less than 10% speed up...not what I was expecting. So I had another look at the implmentation of Parallel.For to understand why, and digging into it you realise that each of the task created by parallel for, doesn't just take the next value in the loop and work on that, but takes the next eight values. So what this means is that in my case I have an 2 core CPU, the first task to run takes the first 8 iterations of the loop to work on, and the second task gets just 2. The second task completes its work, and since there is no more ends, leaving the first task to complete its remaining 6 all running on one core, leaving my other core cold.
To make sure I wasn't seeing things I made the number of itterations of Parallel a factor of 8 * number of cores, and hey presto it took just over half the time.
I can see that taking chunks of work at a time reduces contention for work allocation, but Im struggling to see why this would be neccessary since this contention will not be that great especically when in order to make Parallel.For to work efficiently each iteration needs to be of a reasonable size. I really hope they address this issue when they finally release the library, since having to know how many cores you are running on and magic number 8 just seems rather odd and goes against what I believe they were trying to achieve.
I also implemented the method using the low level parallel task library and that scales beautifully too..
private
static
double ParallelCalcPi()
{
object sync = new
object();
int nIteration = 3;
double pi = 1;
int stride = N_ITERATIONS / 4;
Task task = Task.Create((o) =>
{
Console.WriteLine("Running..{0}", Thread.CurrentThread.ManagedThreadId);
double localPi =0.0;
double multiply = -1;
long end;
while ((end = Interlocked.Add(ref nIteration, stride)) <= N_ITERATIONS + 3)
{
//Console.WriteLine("Next Chunk {0} from {1} to {2}", Thread.CurrentThread.ManagedThreadId,
// end - stride, end);
for (int val = (int) (end - stride); val < end; val += 2)
{
localPi += multiply * (1.0 / (double)val);
multiply *= -1;
}
}
lock (sync) { pi += localPi; }
}, TaskCreationOptions.SelfReplicating);
task.Wait();
return pi * 4.0;
}
First impressions is that the Task Library seems ok, but the Parallel.For seems ill thought out. I have also done some experiments with Parallel.Do which allows me to kick of multiple tasks, by supplying a series of Action delegates as parameters. Parallel.Do then blocks waiting for them to complete. Unfortuantly no timeout parameter is present whih means it waits forever, so if an action goes into a rogue state and never completes, your thread that invoked Parallel.Do will hang, not desirable....So my second request is that they add a timeout parameter.