I enjoy asking an interview question about threading and queues. Candidates usually attempt to solve the problem using one of two Windows synchronization primitives – either the Event or the Semaphore. Sadly, I see a lot more attempts using Events than Semaphores. Sometimes, I’ll ask, “which is faster on Windows, a semaphore or an event?” The answer is “it depends”, but most people don’t say that. Anyway, it is my opinion that Events are awful from a programming model as well as from a performance standpoint. Let’s find out.
To find out, I had to create a little benchmark. Of course, any benchmark is contrived, and this one is no different. Lets create a set of N producers and N consumers putting and getting from a queue. Then let’s implement basically the same queue using each of 3 types: a semaphore based queue, an event based queue (manual reset events) and an event based queue (auto reset events).
I ran each of these 3 algorithms with 8 concurrent consumers and producers through 20,000 iterations (I did 5 samples, threw out the hi/lo, etc etc):
Latency | Max Queue Depth | |
Semaphore | 284.6ms | ~10K |
Event (auto-reset) | 456.2ms | ~150K |
Event (manual reset) | 297.2ms | ~10K |
So the conclusion is that for this test, events and semaphores are pretty close. The semaphore does provide a slight (<10%) performance improvement over the event.
But it also shows that there is clearly a right and wrong way to use the event; auto-reset being bad, and manual-reset being good. I suspect this is basically due to the benchmark itself; the benchmark is greedy – the producers all add items to the queue as quickly as possible. Since usually there is something on the queue, the manual reset event doesn’t need to clear the event nearly as often. I weakly attempted to add a delay to mix things up, but even a modest delay (1ms) basically makes all of these algorithms behave identically unless you crank the thread count through the roof.
Next, I decided to increase the thread count of producers and consumers to 32 instead of 8. Now we get this:
Latency | Max Queue Depth | |
Semaphore | 949.6ms | ~20K |
Event (auto-reset) | 1615.6ms | ~600K |
Event (manual reset) | 621.8ms | ~165K |
Now all you Event fans will tell me to eat crow, clearly the Event is faster, right? Not quite! In this case, the manual-reset event code is cheating. Each time you add a single element to the queue, many consumers awaken because they all see the event as set; since its a greedy benchmark, this is like predicting the future – that an element will be ready by the time the consumer gets through the lock. It’s coincidental to the benchmark that it works. But it is interesting to see. Also note the max queue length now shows a different problem. The semaphore is reasonably “fair” compared to the Event based implementations.
I still dislike events, however. And if you don’t yet agree, take a look at the code. The Semaphore code is incredibly simple compared to the Event based code. (Is there a simplification I could have made?) There is no doubt in my mind that most experienced programmers will botch the Event based algorithm. I’m reasonably confident these solutions are correct, I wouldn’t be completely surprised if some smart reader posts that my implementation is flawed too.
There is one place where Events are useful; if you have a need for synchronization which changes only once (e.g. from set to unset), events are pretty cool. In the driver for this test program, I used an event to start the test after warming up all the threads. Events are great for that. Just don’t use them in loops ever!
Lastly, I’m sure I left out something important. If you’ve got another idea about this, please post a comment! But don’t bark about my lack of error checking or huge queue lengths. I did that on purpose.
Here is the source code to the Semaphore-based Queue:
#pragma once #include <assert.h> class QueueUsingSemaphore { public: QueueUsingSemaphore(int max_items) : head_(0), tail_(0), count_items_(0), max_items_(0), queue_max_(max_items) { items_ = new int[queue_max_]; InitializeCriticalSection(&lock_); semaphore_ = CreateSemaphore(NULL, 0, queue_max_, NULL); } ~QueueUsingSemaphore() { delete items_; assert(head_ == tail_); printf("Max was %dn", max_items_); } int Get() { int rv; WaitForSingleObject(semaphore_, INFINITE); EnterCriticalSection(&lock_); assert(head_ != tail_); rv = items_[head_]; head_ = (++head_) % queue_max_; LeaveCriticalSection(&lock_); return rv; } void Put(int val) { EnterCriticalSection(&lock_); items_[tail_] = val; tail_ = (++tail_) % queue_max_; assert(head_ != tail_); // Queue is overflowing. int count = tail_ - head_; if (count<0) count += queue_max_; if (count > max_items_) { max_items_ = count; } LeaveCriticalSection(&lock_); ReleaseSemaphore(semaphore_, 1, NULL); } private: HANDLE semaphore_; CRITICAL_SECTION lock_; int *items_; int head_; int tail_; int count_items_; int max_items_; int queue_max_; };
Here is the Event based code (manual reset):
#pragma once #include <assert.h> class QueueUsingManualEvent { public: QueueUsingManualEvent(const int max_items) : head_(0), tail_(0), count_items_(0), max_items_(0), queue_max_(max_items) { items_ = new int[queue_max_]; InitializeCriticalSection(&lock_); event_ = CreateEvent(NULL, TRUE, FALSE, NULL); // Manual-reset event. } ~QueueUsingManualEvent() { delete items_; assert(head_ == tail_); printf("Max was %dn", max_items_); } int Get() { bool got_one = false; int rv; while (!got_one) { WaitForSingleObject(event_, INFINITE); // Note: it is possible to get here and have the queue be empty. EnterCriticalSection(&lock_); if (head_ == tail_) { LeaveCriticalSection(&lock_); continue; } got_one = true; rv = items_[head_]; head_ = (++head_) % queue_max_; if (head_ == tail_) ResetEvent(event_); // The queue is now empty. LeaveCriticalSection(&lock_); } return rv; } void Put(int val) { EnterCriticalSection(&lock_); items_[tail_] = val; tail_ = (++tail_) % queue_max_; assert(head_ != tail_); // Queue is overflowing. int count = tail_ - head_; if (count<0) count += queue_max_; if (count > max_items_) { max_items_ = count; } // It's tempting to move this outside the lock. However, // unlike the semaphore, this cannot be moved outside the // lock because the threads are intentionally racing // on the event notification. SetEvent(event_); LeaveCriticalSection(&lock_); } private: HANDLE event_; CRITICAL_SECTION lock_; int *items_; int head_; int tail_; int count_items_; int max_items_; int queue_max_; };
Here is the source to the driver code:
#pragma once #include "queue_semaphore.h" #include "queue_event.h" #include "queue_event_manual.h" class Driver { public: Driver() : queue_(kMaxItems) { start_event_ = CreateEvent(NULL, TRUE, FALSE, NULL); } void StartThreads() { for (int i=0; i<kMaxProducers; i++) threads_[i] = CreateThread(NULL, 0, ProducerMain, this, 0, 0); for (int i=0; kMaxConsumersi++) threads_[kMaxProducers + i] = CreateThread(NULL, 0, ConsumerMain, this, 0, 0); Sleep(500); // Let the threads get ready. start_ticks_ = GetTickCount(); SetEvent(start_event_); } void WaitForFinish() { for (int i=0; i<kMaxProducers + kMaxConsumers; i++) WaitForSingleObject(threads_[i], INFINITE); stop_ticks_ = GetTickCount(); } int TimeTaken() { return stop_ticks_ - start_ticks_; } void Producer() { WaitForSingleObject(start_event_, INFINITE); for(int i=0; i<kMaxIterations; i++) queue_.Put(i); } void Consumer() { WaitForSingleObject(start_event_, INFINITE); for (int i=0; i<kMaxIterations; i++) int x = queue_.Get(); } private: static DWORD WINAPI ProducerMain(void *self) { Driver* driver = static_cast(self); driver->Producer(); return 0; } static DWORD WINAPI ConsumerMain(void *self) { Driver* driver = static_cast (self); driver->Consumer(); return 0; } static const int kMaxProducers = 32; static const int kMaxConsumers = 32; static const int kMaxIterations = 20000; static const int kMaxItems = (kMaxProducers + kMaxConsumers)*kMaxIterations; HANDLE start_event_; //QueueUsingSemaphore queue_; //QueueUsingEvent queue_; QueueUsingManualEvent queue_; HANDLE threads_[kMaxProducers + kMaxConsumers]; int start_ticks_; int stop_ticks_; };
Great post. A couple of notes:
Raymond Chen points out that Events are normally the wrong approach here. If all your consumers are so busy that two items get added to the queue in rapid succession, it’s possible that one of them will be missed. This is definitely a problem with the auto-reset event. It looks like your manual implementation is covering for that.
http://blogs.msdn.com/oldnewthing/archive/2006/06/22/642849.aspx
Your comment about Sleep(1) causing all of the implementations to perform identically may be because sleeping one millisecond is almost equivalent to just giving up the rest of your timeslice. As a result, the bottleneck may be the overhead of the constant context switches.
And lastly, it’s possible–in some cases–to eliminate the CriticalSection from the semaphore approach, making it even faster. If you didn’t have to deal with queue wrap around and you had a maximum number of elements that could ever be queued, you could use Interlocked instructions to move head_ and tail_ without any critical sections. I use this technique in my ray tracer.
Thanks – I definitely agree that events are always the wrong approach here. The reason I post this at all is because I’ve asked this question in interviews hundreds of times; and about half of the respondents attempt to use an event!
I believe these implementations overcome the shortcomings of the event raciness by way of reposting the event (for auto-reset) or by way of manually unsetting (for manual reset). Both approaches are horrible, tricky, and completely avoidable if you use a semaphore (or a condition variable for the non-windows folk)
The sleep problem is just that the idle time in the time slice greatly overshadows the minimalist computation done in this silly benchmark. If you cranked up the number of threads, you could continue to test w/ a drenched CPU.
As to your last point; yes you could use the Interlocked* family to replace the critical sections with a better designed data structure. For this test, where the Events demonstrate unfairness between producing and consuming, the queue sometimes needed to be *very* large 🙂
Pingback:Mike’s Lookout » Blog Archive » Interview Flattery