Last week I was teaching Developmentor's Effective .NET 2 course, as part of that course we spend at least a day looking at various patterns for building multi threaded applications. With the rise of multi core machines it is becoming increasingly more important to write algorithms that can scale with the availability of new cores. One of the topics we cover is thread safety, and I typically write an application that simply performs an i++ operation on multiple threads, on a single core does not typically generate a problem but on a multi core means we rarely get the value we expected since i++ is in fact multiple CPU instructions, e.g
One thread could perform the load of i into a register and at the same time another is also doing this thus they both cause i to be simply incremented by 1 and not by 2. To solve this problem we have a variety of synchronization techniques at our disposal some are lighter weight than others
Interlocked.Increment , Monitor.Enter/Exit and OS synchronization primitives like Mutex
In my stress test I conducted in class I had multiple threads updating a single integer...First I demonstrated on a multi core machine that i indeed didn't have the correct value after 10 threads had attempted to increment it 10000 times. I then moved to using Interlocked.Increment and now everything was as expected for the result, but it was slower than a simple i++. All well and good so far...I then moved to using Monitor.Enter and Monitor.Exit and to my amassment that took pretty much the same time as Interlocked...so as all engineers do...we run it again cause that result was just a glitch....but after numerous runs it kept coming out the same....So when I first developed this demo I did so on a single core machine and this was its first outing on a dual core, so what went wrong...all my multi threading life I've been told that interlocked is far cheaper than the heavier weight mutex style of synchronization. I then re-ran the demo with CPU Affinity set to a single CPU, and got the results I would have expected with Interlocked being at least an order of magnitude quicker than mutex
During the lab break I stepped back and had a think, about what could have caused this, the light eventually went on in that all the threads were in fact sharing a common variable, with dual core each core is going to be attempting to cache this value, this has a very negative effect on the cache, why well the cores try and maintain cache coherency by marking parts of the cache dirty when they update them, and forcing the other core to reload its data from main memory...If this wasn't the case in our example we would have had the wrong numbers again...So ok possibly what you would have expected....however the process of marking a piece of memory as dirty is not as simple as marking a single word, its less granular than that the core will mark whats called a cache line as dirty meaning any data on the same cache line as the value being updated is effectively marked as invalid.
So imagine two integers next two each other in memory. Thread A increments integer one and Thread B increments integer two, from a high level programming language perspective this is perfect, each thread has its own private resource and there is no need to synchronise and the threads therefore run freely...The perfect threaded app....however not so fast if both integers occupy the same cache line...we take a performance hit...
To demonstrate this fact I wrote some code that simply implements a parallel i++. There are three scenario's
- Interlocked.Increment( ref i )
- Interlocked.Increment( ref ManyCounters[Thread.CurrentThread.ManagedThreadId ] )
- Interlocked.Increment( ref ManyCounters[Thread.CurrentThread.ManagedThreadId * 10000 ] )
Whilst the last two variants will not produce the correct total, the middle version shows that although the threads are not sharing the same variable the fact that the fact that they probably all live on the same cache line is the crucial factor. Below is the results of running the code using 1 and 2 threads. In the first case, with the high level of contention we see that a single thread would have been more efficient.
Single shared counter with 1 Threads took 00:00:00.2065516
Single shared counter with 2 Threads took 00:00:00.4303586
In the second case with multiple counters we are still taking a hit on performance even though the counters are different ints, and thus we are not incrementing the same location
Multiple counters with 1 Threads took 00:00:00.2184554
Multiple counters with 2 Threads took 00:00:00.5177658
Only in this last case were we ensure the counters are not on the same cache line do we see both cores being used efficiently.
Multiple Sparse Counters ( ~4k apart ) with 1 Threads took 00:00:00.2180101
Multiple Sparse Counters ( ~4k apart ) with 2 Threads took 00:00:00.1603323
So what does all this mean, well it certainly shows that just because your code at a high level programming language has no contention it doesn't mean that it will not have contention when the code finally meets the hardware. This is obviously a big issue and more so with virtualisation in .NET, how do I know the size of the cache lines, or how my data will be laid out in memory...Computing just got fun again.....Code