This article presents an implementation of atomic operations optimized for low latency in multithreaded JVM environments. The implementation achieves approximately 2x lower latency under high concurrency through adaptive backoff mechanisms in compare and swap loops.
Implementation Strategy
The standard java.util.concurrent.atomic package implements operations
such as getAndAdd and addAndGet using the lock addq CPU
instruction or volatile variable increments, optimizing for single
threaded contexts while degrading significantly under contention; all
operations in this implementation, including increment and decrement
variants, utilize compare and swap loops with adaptive backoff instead.
The critical design decision involves inserting minimal thread
suspension periods when a CAS operation fails to acquire the value,
preventing wasteful spinning where threads continuously attempt updates
while another thread holds the lock; this transforms CPU bound
contention into a more efficient scheduling problem. The suspension
mechanism requires careful calibration; excessive delays degrade
performance in low contention scenarios, while insufficient delays fail
to mitigate high contention pathologies.
Performance Characteristics
Performance analysis was conducted on an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz running OpenJDK 9.0.1 under Debian, comparing latency
distributions across varying thread counts for both the standard
java.util.concurrent.atomic implementation and the optimized
variant.1
| Benchmark | Threads | p0.00 | p0.50 | p0.90 | p0.95 | p0.99 | p0.999 | p0.9999 | p1.00 |
|---|---|---|---|---|---|---|---|---|---|
| j.u.c.atomic | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 138 | 479 |
| Korinsky | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 560 |
| j.u.c.atomic | 2 | 8 | 74 | 108 | 113 | 123 | 206 | 234 | 1'716 |
| Korinsky | 2 | 8 | 14 | 64 | 92 | 164 | 300 | 536 | 2'068 |
| j.u.c.atomic | 4 | 9 | 188 | 579 | 721 | 973 | 1'292 | 1'476 | 20'576 |
| Korinsky | 4 | 9 | 137 | 340 | 490 | 1'036 | 2'368 | 4'136 | 10'384 |
| j.u.c.atomic | 8 | 9 | 550 | 868 | 1'294 | 2'976 | 5'472 | 8'095 | 131'840 |
| Korinsky | 8 | 9 | 402 | 919 | 1'140 | 1'612 | 2'168 | 2'596 | 130'432 |
| j.u.c.atomic | 16 | 9 | 767 | 921 | 1'224 | 4'424 | 130'816 | 220'672 | 1'161'216 |
| Korinsky | 16 | 9 | 772 | 1'170 | 1'334 | 1'706 | 130'816 | 220'672 | 1'290'240 |
| j.u.c.atomic | 32 | 9 | 773 | 924 | 1'202 | 4'824 | 330'752 | 893'952 | 3'313'664 |
| Korinsky | 32 | 9 | 782 | 1'190 | 1'360 | 1'834 | 330'240 | 840'704 | 2'838'528 |
| j.u.c.atomic | 64 | 9 | 772 | 926 | 1'146 | 4'848 | 700'416 | 1'812'480 | 7'282'688 |
| Korinsky | 64 | 9 | 785 | 1'204 | 1'384 | 1'952 | 660'480 | 1'730'560 | 6'684'672 |
Single threaded performance remains equivalent; the optimized implementation exhibits comparable latency to the standard library at all percentiles. As thread count increases, the backoff strategy provides substantial benefits; at 8 threads, median latency improves from 550ns to 402ns with the 99th percentile improving from 2'976ns to 1'612ns. The advantage becomes more pronounced at higher thread counts; with 64 concurrent threads, the 95th percentile latency improves from 4'848ns to 1'952ns, a reduction of approximately 60%.
JVM Sleep Mechanisms
Adaptive backoff requires suspending thread execution for minimal
durations; the JVM provides several primitives for thread suspension,
each with distinct performance characteristics and semantic guarantees.
Platform dependent behavior, combined with highly variable relationships
between requested and actual sleep duration, necessitates an adaptive
approach rather than fixed sleep periods. The selected mechanism is
Thread.yield() which instructs the scheduler to bypass the thread,
effectively inserting a pause CPU instruction equivalent into the
execution path.
Backoff Strategy Analysis
Detailed benchmarking across multiple backoff strategies reveals performance sensitivity to both the suspension primitive chosen and the number of iterations before suspension; the optimal strategy balances immediate retry attempts against thread suspension to minimize both busy waiting and scheduling overhead. Selected results for 16 concurrent threads:
- Plain CAS loop without backoff: median 921ns, p0.95 4'424ns
- Single
Thread.onSpinWait()call: median 2'000ns, p0.95 2'772ns - Single
Thread.yield()call: median 2'240ns, p0.95 4'272ns Thread.onSpinWait()followed byLockSupport.parkNanos(1): median 1'368ns, p0.95 2'072ns
Pure spin waiting (Thread.onSpinWait() only) increases median latency
but reduces tail latencies by preventing extended contention periods;
combining spin waiting with minimal park durations provides the best
balance, achieving lower median latencies than pure yielding while
maintaining acceptable tail behavior. At 32 and 64 threads, performance
characteristics shift; higher contention increases the value of more
aggressive suspension strategies. With 64 threads, approaches using
LockSupport.parkNanos() with values between 1 and 10 nanoseconds
achieve median latencies around 1'200ns compared to 926ns for
unoptimized code, but dramatically improve the 95th percentile from
4'848ns to approximately 2'000ns.
Afterword
Adaptive backoff in CAS loops provides substantial latency improvements in highly concurrent environments without degrading single threaded performance; the approximately 2x reduction in high percentile latencies at 64 concurrent threads validates the approach for applications requiring low latency atomic operations under contention.2
All latency measurements are reported in nanoseconds; percentile notation follows standard convention where p0.99 represents the 99th percentile of observed latencies. ↩︎
The implementation is available on Maven Central under artifact coordinates
ky.korins:atomicas a drop in replacement forjava.util.concurrent.atomic, requiring only import statement modifications fromjava.util.concurrent.atomictoky.korins.atomic. The full code and benchmarks are available at GitHub; additional documentation and resources can be found on the project website. This work builds upon prior research into JVM sleep mechanisms. ↩︎