Lecture from 07.05.2025 | Video: Video ETHZ
Today, we’ll expand on fine-grained locking techniques, exploring practical implementations and trade-offs. This leads us naturally into the realm of lock-free programming, where we aim to achieve concurrency without using traditional mutual exclusion locks.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250421174237.png)
Summary/Recap Last Lecture
In our previous lecture, we covered:
- Explicit/Interface Locks (
Lock,Condition): Explored the JavaLockinterface and associatedConditionvariables, noting their increased power and flexibility compared to intrinsic monitors (synchronized/wait/notify), but also their increased potential for programmer error. - Sleeping Barber Example: Analyzed this classic problem to illustrate how to avoid excessive/useless notifications in condition variable patterns by keeping track of waiting thread counts, preventing lost signals.
- Reader/Writer Locks (
ReentrantReadWriteLock): Discussed locks that allow multiple concurrent readers or a single exclusive writer, examining different fairness policies (reader vs. writer priority).
Learning Goals Today
Building on that foundation, our goals for today are:
- Lock Granularity in Practice: Apply fine-grained, optimistic, and lazy locking strategies to list and stack examples.
- Various Locking Modes: Understand the implementation details and trade-offs of these different locking approaches.
- Lock-Free Data Structures:
- Implement a lock-free stack using atomic operations (Compare-and-Swap).
- Examine lock-free queues, a common requirement in OS structures.
- The ABA Problem: Understand this subtle issue that arises with Compare-and-Swap when memory is reused.
- Avoiding the ABA Problem: Discuss techniques like pointer tagging and hazard pointers.
- (Introduction to Formal Models): Briefly introduce Linearizability and Sequential Consistency.
- (Introduction to Consensus): Define the consensus problem and its relation to the power of atomic operations.
Compare-and-Swap (CAS) Revisited
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250421174413.png)
CAS(memory_location, expected_old_value, new_value)- Atomically:
- Reads current value at
memory_location. - Compares it with
expected_old_value. - If they match, writes
new_valuetomemory_locationand returns success (or old value). - If they don’t match, does not write, and returns failure (or current value).
- Reads current value at
- More powerful than TAS. Can often be implemented wait-free in hardware.
Non-blocking Counter using CAS
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250421174436.png)
- Mechanism: Read-modify-try-CAS. If
compareAndSetfails, it means another thread modifiedvaluebetween steps (a) and (c), so we read again and retry. - Progress: This is lock-free. If multiple threads attempt
inc()concurrently, at least one will eventually succeed in its CAS operation, ensuring system progress. It’s not wait-free, as a thread might theoretically fail CAS indefinitely if there’s extreme contention (though unlikely in practice). - Deadlock/Starvation: Immune to deadlock. Starvation is possible but rare.
- Thread Death: If a thread dies between reading
vand the CAS, other threads are unaffected and can continue to make progress.
Handle CAS With Care
- A successful CAS suggests no other thread wrote between the read (
get) and the CAS. - Warning: This suggestion is not always true due to the ABA problem.
- CAS (or LL/SC) is the fundamental mechanism for checking conditions and making atomic changes in most lock-free algorithms.
- Alternatives like Transactional Memory might simplify things but have their own challenges.
Lock-Free Stack
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192353.png)
Let’s implement a lock-free stack using an AtomicReference for the top pointer.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192416.png)
This is the baseline lock-based version for comparison.
Lock-free stack structure:
public class ConcurrentStack {
// Use AtomicReference to allow atomic updates (CAS) on the top pointer
AtomicReference<Node> top = new AtomicReference<Node>();
// Methods use CAS loops instead of locks
public void push(Long item) { /* ... */ }
public Long pop() { /* ... */ }
}Lock-Free Pop
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192550.png)
Logic: Repeatedly read the current head and its next. Try to atomically swing the top pointer from head to next. If top changed between the get() and the compareAndSet() (because another thread pushed or popped), the CAS fails, and the loop repeats.
Lock-Free Push
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192618.png)
Logic: Create the newi node. Repeatedly read the current head, set newi.next to head, and try to atomically swing top from head to newi. If top changed concurrently, CAS fails, and the loop retries with the updated head.
Benefits and Performance
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192737.png)
- Benefit: Lock-free algorithms are inherently deadlock-free.
- Performance:
- The graph shows that without backoff, the lock-free stack performs worse than the simple blocking (
synchronized) stack under contention. This is because failed CAS attempts still consume resources and potentially cause cache contention, similar to spinlocks. - Atomic operations (like CAS used by
AtomicReference) are relatively expensive compared to uncontended lock acquires/releases.
- The graph shows that without backoff, the lock-free stack performs worse than the simple blocking (
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192815.png)
Lock-free doesn’t automatically mean faster. Contention on the underlying atomic operations can still be a bottleneck.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192826.png)
Backoff: Adding exponential backoff to the retry loops (pausing after a failed CAS) significantly improves the performance of the lock-free stack under contention, making it outperform the blocking version.
Lock-Free List-Based Set (Introduction)
Implementing a lock-free linked list is significantly more complex than a stack due to needing to potentially modify two pointers (pred.next and the new node’s next for add, or pred.next and marking for remove) somewhat “atomically”. (We are not discussing skip lists here).
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422192858.png)
Can simple CAS handle concurrent add and remove? A: remove(c) needs CAS(b.next, c, d). B: add(b') needs CAS(b.next, c, b'). CAS hardware arbitrates, only one succeeds, the other retries. This seems okay.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422193025.png)
What about A: remove(c) needing CAS(b.next, c, d) and B: remove(b) needing CAS(a.next, b, c)? If B succeeds first, A’s CAS on b.next will fail (as b is removed), but c remains incorrectly linked from a. Simple CAS on single pointers is insufficient.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422193309.png)
Adding a mark bit for lazy deletion (A: CAS(c.mark, false, true) then CAS(b.next, c, d)) also doesn’t solve the problem directly. Thread B might add c' (CAS(c.next, d, c')) after A marks c but before A physically removes c. c' gets added to a logically deleted node.
The Problem: We need to atomically modify or validate two pieces of information: the next pointer and a state marker (like a delete bit) or potentially two pointers (tail.next and tail in a queue). Standard CAS only works on a single memory location.
Java Solution: AtomicMarkableReference
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422193503.png)
Java provides AtomicMarkableReference<V> to address this. It atomically manages a reference V and a single boolean mark.
compareAndSet(expectedRef, newRef, expectedMark, newMark): Atomically sets the reference tonewRefand the mark tonewMarkif and only if the current reference isexpectedRefAND the current mark isexpectedMark.attemptMark(expectedRef, newMark): Atomically sets the mark tonewMarkif the current reference isexpectedRef.- Other methods:
get(),getReference(),isMarked(),set().
This effectively provides a way to perform a Double CAS (DCAS) on the reference and the mark bit together (often leveraging hardware support like double-word CAS if available, or internal locking if not). The slide notes that one bit is often “free” in aligned pointer references, allowing this mark bit to be stored efficiently alongside the pointer.
Algorithm using AtomicMarkableReference
We can now revisit the lazy list approach using AtomicMarkableReference for the next field in each node.
- The
removeoperation becomes two distinct conceptual steps:- Logical Delete: Atomically set the mark bit of the target node’s
nextfield (usingattemptMarkorcompareAndSeton the target node itself if the mark is stored there). - Physical Delete: Atomically update the predecessor’s
nextfield to skip the marked node (usingcompareAndSet, checking both the reference and the mark of the predecessor’snextfield).
- Logical Delete: Atomically set the mark bit of the target node’s
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422194205.png)
Example: Removing c (between b and d).
- Mark
c:c.attemptMark(false, true)(assuming mark is on the node). - Swing
b.next:b.next.compareAndSet(c, d, false, false)(assumingcanddare unmarked initially). This tries to changeb’snextfrom(ref=c, mark=false)to(ref=d, mark=false). This step physically removesc.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422194312.png)
If A tries to remove c and B tries to remove b concurrently:
- A marks
c.nextpointer (not c itself). - B marks
b.nextpointer (not b itself). - A tries
b.next.compareAndSet(c, d, false, false). Fails becauseb.nextis now marked by B. A might then help physically removebbefore retrying onc. - B tries
a.next.compareAndSet(b, c, false, false). Succeeds (assuminga,b,cinitially unmarked).bis physically removed.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422210218.png)
Helping: If a thread encounters a marked node during traversal, it should attempt to “finish the job” by performing the physical deletion (CASing the predecessor’s next pointer) before continuing its own operation. This “helping” mechanism is crucial for ensuring lock-free progress, as it prevents threads from getting stuck behind nodes that are logically deleted but not yet physically unlinked.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422210233.png)
The find method now incorporates logic to detect marked nodes (marked[0]) and attempt physical deletion (pred.next.compareAndSet(curr, succ, false, false)) before restarting the search if the physical deletion fails due to interference.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211228.png)
The remove method uses find to locate pred and curr. It first tries to logically delete curr by marking its successor reference (curr.next.attemptMark(succ, true)). If successful, it then tries to physically delete it (pred.next.compareAndSet(curr, succ, false, false)). If logical deletion fails (!snip), it continues the outer retry loop.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211618.png)
The add method uses find to locate pred and curr. It creates the new node and attempts to insert it using pred.next.compareAndSet(curr, node, false, false), retrying if the CAS fails (because pred or curr changed or were marked).
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211650.png)
- Using
AtomicMarkableReferenceeffectively simulates a DCAS on the reference and the mark bit. - The lazy marking combined with helping (physically deleting marked nodes encountered during traversal) is key to making the list operations lock-free.
Lock-Free Unbounded Queue
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211748.png)
Motivation: Implementing core OS or runtime system components, like thread schedulers, often requires highly efficient, non-blocking concurrent data structures. Scheduler queues are a prime example.
/Semester-2/Parallel-Programming/Lecture-Notes/attachments/Pasted-image-20250422211804.png)
These queues need protection from concurrent access by multiple threads and potentially interrupt handlers running on different cores. Using locks can lead to the problems discussed earlier (deadlocks, convoying, priority inversion), making lock-free implementations desirable.
Continue here: 23 Lock-Free Unbounded Queue Protocol