Lecture from 13.05.2025 | Video: Video ETHZ
Summary/Recap Last Lecture
In our previous session, we covered several advanced locking concepts:
- Optimistic & Lazy List-Based Sets: We explored fine-grained locking strategies for linked lists, moving from basic hand-over-hand locking to optimistic locking (lock-free traversal, lock-then-validate) and lazy synchronization (using markers for logical deletion, enabling wait-free
contains). - Lazy Skip Lists: Introduced skip lists as a practical, probabilistic alternative to balanced trees for concurrent sets, suitable for applying lazy/optimistic locking techniques.
- Lock-Free Introduction & CAS: Defined lock-freedom and wait-freedom, contrasting them with blocking synchronization, and re-introduced Compare-and-Swap (CAS) as the key atomic primitive for building lock-free algorithms.
Learning Goals Today
Today, we will delve into:
- Lock-Free Unbounded Queue: Implementing a queue without locks, suitable for scenarios like OS schedulers, using sentinel nodes.
- Reuse and the ABA Problem: Understanding how memory reuse (common in systems without garbage collection or using object pools) can break naive lock-free algorithms relying on CAS.
- Avoiding the ABA Problem: Examining techniques like Pointer Tagging and Hazard Pointers to mitigate the ABA issue.
Lock-Free Unbounded Queue
Queues are fundamental in many systems, especially operating systems and runtime environments, for managing tasks, threads, or requests.
Motivation: A Lock-Free Operating System Kernel
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211748.png)
Schedulers lie at the heart of operating systems, moving tasks between states (ready, running, waiting) often implemented using queues. Scheduling decisions happen frequently (thread creation/end, blocking/unblocking).
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211804.png)
These internal kernel/runtime data structures must be protected against concurrent access from multiple threads running on different cores, and potentially even from interrupt service routines. While spinlocks are conventionally used, their blocking nature can cause problems (priority inversion, convoying, etc.).
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422213853.png)
Therefore, implementing scheduler queues (and other core structures) in a lock-free manner is highly desirable for performance, scalability, and responsiveness. This requires a lock-free unbounded queue algorithm. However, OS kernels typically cannot rely on automatic Garbage Collection (GC) and must manage memory explicitly, often reusing queue nodes. This reuse creates a significant challenge for lock-free algorithms: the ABA problem.
Basic Queue Node Structure
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422213918.png)
We start with a simple node containing an item and a next pointer.
Blocking Unbounded Queue (Recap)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422213928.png)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422214000.png)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422213950.png)
For reference, a simple blocking queue using synchronized provides mutual exclusion for enqueue and dequeue operations, handling the head and tail pointers. However, this suffers from the general drawbacks of locking.
Challenges for Lock-Free Queues
Implementing this lock-free is tricky because operations involve multiple pointers:
Enqueue: Modifiestail.nextandtail.Dequeue: Modifiesheadand readshead.next. These operations need to appear atomic, but a single CAS operates on only one memory location.
Idea: Sentinel Node
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422214023.png)
To simplify edge cases (especially handling the empty queue), a common technique is to always have a dummy “sentinel” node at the front of the queue. The head pointer always points to the sentinel. The queue is empty when head.next is null (or, more robustly, when head == tail). The actual first element is head.next.
Enqueue with Sentinel
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422214247.png)
- The new node is added after the node currently pointed to by
tail. Requires updatingtail.next. - The
tailpointer itself needs to be updated to point to the newly added node. These two steps (updatingtail.nextandtail) are ideally atomic but involve two separate memory locations.
Dequeue with Sentinel
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422214335.png)
- The item to be dequeued is in the node after
head(i.e.,head.next). The value is read fromhead.next.item. - The
headpointer needs to be updated to point to the dequeued node (head = head.next). This involves readinghead.nextand updatinghead.
Does the sentinel solve the atomicity problem? No. Enqueue still needs to update two pointers (tail.next and tail). This can lead to a transient inconsistency: a thread might update tail.next but get delayed before updating tail. During this time, tail doesn’t point to the true last node. Waiting for consistency reintroduces blocking (“locking camouflaged”). The solution involves helping: threads detect inconsistency and help complete the pending operation (like advancing tail) before proceeding.
Lock-Free Queue Node with AtomicReference
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110121.png)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110147.png)
To implement the queue lock-free, we need atomic operations on the pointers. We use AtomicReference for the head, tail, and the next pointer within each Node.
Michael & Scott Lock-Free Queue Protocol
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110157.png)
This describes the widely used Michael & Scott lock-free queue algorithm:
Enqueuer Protocol:
- Read current
tailinto a local variablelast. - Read
last.nextintonext. - (Consistency Check) If
tailstill points tolast: a. Ifnextisnull(meaninglastis the actual end): Try to atomically link thenew_nodeusingCAS(last.next, null, new_node). b. If the CAS in (a) succeeds, try to swing thetailpointer:CAS(tail, last, new_node). This second CAS is a “helping” step; it’s okay if it fails (another thread might have already done it), but attempting it helps maintain progress. c. Ifnextis notnull(meaningtailis lagging behind the actual end): Help another thread by trying to swing the tail forward:CAS(tail, last, next). - (Retry) If
tailchanged between readinglastand the consistency check, or if the primaryCAS(last.next, ...)failed due to a race, retry the entire process.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110210.png)
Why might the CAS(tail, last, new_node) fail? Either another enqueuer already updated tail, or the thread performing the CAS died before completing it.
Dequeuer Protocol:
- Read
headintofirstandtailintolast. - Read
first.nextintonext. - (Consistency Check) If
headstill points tofirst: a. Iffirst == last(head and tail point to the same node, possibly sentinel): i. Ifnext == null, the queue is truly empty. Return null/failure. ii. Ifnext != null, the queue is in a transient intermediate state (an enqueue finished linking but didn’t update tail yet). Help by tryingCAS(tail, last, next). Then retry the dequeue. b. Iffirst != last(queue appears non-empty): i. Read the value from the node to be dequeued:value = next.item. ii. Try to swing theheadpointer forward:CAS(head, first, next). iii. If CAS succeeds, returnvalue. iv. If CAS fails (another dequeue happened), retry the entire process. - (Retry) If
headchanged between readingfirstand the consistency check, retry.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110337.png)
Why might CAS(head, first, next) fail? Another dequeuer successfully executed its CAS first.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110356.png)
A specific inconsistency: Thread A enqueues the first element (node) by setting S.next = node but gets delayed before setting tail = node. Thread B then dequeues S (the sentinel) by setting head = node. Now tail (still pointing to S) points to a dequeued node. The helping mechanism (tail.compareAndSet(last, next) in both enqueue and dequeue) addresses this.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110644.png)
This code implements the enqueue logic with the helping step (else tail.compareAndSet(last, next)).
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423110733.png)
This code implements the dequeue logic, including the check for empty vs. intermediate state (if (first == last)) and the helping step.
This Michael & Scott algorithm provides a correct lock-free unbounded queue implementation.
Reuse and the ABA Problem
The lock-free stack and queue implementations work fine if nodes, once removed, are never reused. However, in many contexts (OS kernels, systems without GC, performance-critical code using memory pools), node reuse is necessary. This reuse exposes a critical flaw in algorithms relying solely on CAS: the ABA problem.
Back to the Lock-Free Stack
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423113206.png)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423113216.png)
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250423113226.png)
Continue here: 24 ABA Problem, DCAS, GC, Pointer Tagging, Hazard Pointers, Linearizability