Table of contents
Open Table of contents
- Intro — Is Pool Exhaustion an Application Problem, or a JVM Problem?
- 1. The Operational Surface — What You See When the Pool Exhaustion Alert Fires
- 2. The Entry Points to a Thread Dump — Where and How to Capture One
- 3. Dissecting a Thread Dump — JVM at the Moment of Pool Exhaustion
- 4. HikariCP Internals — JVM-Side View of ConcurrentBag and SynchronousQueue
- 4.1 ConcurrentBag — The Core Data Structure of the Connection Pool
- 4.2 SynchronousQueue — A Zero-Capacity Hand-Off Queue
- 4.3 LockSupport.parkNanos — How the JVM Puts a Thread to Sleep
- 4.4 Mapping the transaction-with-external-call pool-exhaustion measurements to the code
- 4.5 connectionTimeout / idleTimeout / maxLifetime — JVM Side
- 5. The JVM Thread State Machine — 6 States and Transitions
- 6. The three-prescription comparison’s 9 Scenarios → How the Thread Dump Changes
- 7. Operational Monitoring — JVM Metrics + Automated Thread Dump Capture
- 8. Production Failure Scenarios (3 AM Edition)
- 9. Big-Tech Cases — Real-World Dumps / GC / Concurrency
- 10. Recap — putting this article in your own words
- 11. What I Learned
- 12. In the Next Post
- References
Intro — Is Pool Exhaustion an Application Problem, or a JVM Problem?
3 AM. You get an alert: payment API P99 jumped from 200ms to 6 seconds. The code is identical to yesterday’s. The external PG status page is green. Yet the system is melting down.
At this point, staring at logs and metrics yields nothing. error_rate is 0%, success_rate is 100% — yet users are looking at frozen screens for 6 seconds. The monitoring tells you the system is fine while it is in fact completely stuck.
The real evidence sits in one place — the Thread Dump.
"http-nio-8080-exec-3" #42 daemon prio=5 tid=0x00007f8b... nid=0x103
java.lang.Thread.State: TIMED_WAITING (parked)
at jdk.internal.misc.Unsafe.park(java.base@21/Native Method)
- parking to wait for <0x00000007a5b30c10> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(...)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(...)
at com.zaxxer.hikari.util.ConcurrentBag.borrow(...)
at com.zaxxer.hikari.pool.HikariPool.getConnection(...)
at com.zaxxer.hikari.HikariDataSource.getConnection(...)
at org.springframework...DataSourceUtils.fetchConnection(...)
...
The answer is in a single dump. Every worker thread is in TIMED_WAITING (parked) inside HikariCP’s getConnection(). What is happening inside the JVM right now — that is the truth the dump is telling you.
This post takes that dump apart line by line.
- The sister post — Spring Transactions and External API Calls — Reproducing Pool Exhaustion and Comparing Three Remedies (Simple Split, Saga, Outbox) by Measurement — covered the same incident from a business pattern angle (Saga / Outbox). This post replays the same transaction-with-external-call pool-exhaustion measurements through a JVM lens — Thread Dump / Thread State / HikariCP internals / LockSupport / GC. The two posts are paired.
- Input: the transaction-with-external-call pool-exhaustion measurement [measured — Java/Spring] (timeout 5s = 100% pass / P99 6.3s, timeout 1s = 16.7% pass / 50 timeouts) + the three-prescription comparison’s 9-scenario matrix.
- Depth: L3-L4 (JVM/Java Mastery series Part 1 — measurement + JVM mechanics + big-tech operations + recap questions).
1. The Operational Surface — What You See When the Pool Exhaustion Alert Fires
The sister post handled the same transaction-with-external-call pool-exhaustion measurement from the business angle, so I will only briefly reprise the operational surface here.
1.1 Two Faces of Pool Exhaustion [measured — Java/Spring]
Same pool exhaustion, but the connection-timeout value flips the signal the operations team receives:
| Metric | timeout 5s | timeout 1s |
|---|---|---|
| Success rate | 100% | 16.7% |
| P99 | 6,350 ms | 3,302 ms |
| Pool timeout | 0 | 50 |
awaitingConnection peak | 20 | 50 |
| What monitoring sees | ”fine" | "incident” |
The sister post moved on to Saga / Outbox / Simple Split from here. This post moves inside the JVM at the moment of dump capture.
1.2 Why the Code Alone Doesn’t Tell You Anything
@Transactional
public void confirm(PaymentRequest req) {
Payment p = repo.find(req.id());
PgResponse r = pgClient.confirm(req); // external call — usually 200ms
p.applyResult(r);
repo.save(p);
}
You can stare at this code all day. Nothing is wrong. The observation that the external PG slowed from 200ms to 3,000ms is not on the status page. Where, inside the JVM, the thread is waiting for what — that is the thread dump’s job.
2. The Entry Points to a Thread Dump — Where and How to Capture One
2.1 Four Capture Tools
| Tool | Command | Use | Overhead |
|---|---|---|---|
jstack | jstack <pid> | One-shot thread dump | ~0 |
kill -3 | kill -3 <pid> | SIGQUIT — dump to stdout (stderr) | ~0 |
jcmd | jcmd <pid> Thread.print | jstack replacement (recommended Java 8+) | ~0 |
| JFR | jcmd <pid> JFR.start duration=30s | 30-second continuous profile (incl. lock contention) | < 1% |
| async-profiler | ./profiler.sh -d 30 <pid> | Flame graph (incl. native frames) | < 1% |
The standard production runbook:
- First 5 minutes —
jcmd <pid> Thread.print > dump.txtimmediately, 3 times at 5-second intervals (to distinguish momentary from persistent) - Next 30 minutes — JFR 30-second capture (lock contention / Java Monitor Wait events)
- Long term — async-profiler flame graph (native frames included — JNI / Direct Memory issues)
2.2 Spring Actuator /actuator/threaddump
If Spring Boot is already deployed, you can pull a dump without touching code:
curl -s http://localhost:8080/actuator/threaddump | jq '.'
JSON-shaped, which is convenient for automated analysis. Just be careful with production exposure — keep it behind a security group or a separate actuator port.
2.3 Three Captures, Not One
A single dump is a snapshot. If the same thread sits in the same stack frame across multiple dumps — that is the real stuck.
# Capture 3 times at 5-second intervals
for i in 1 2 3; do
jcmd <pid> Thread.print > dump-$i.txt
sleep 5
done
If the same thread sits in the same frame across all three dumps, you have evidence of being frozen for 15 seconds. Never draw a conclusion from a single dump.
3. Dissecting a Thread Dump — JVM at the Moment of Pool Exhaustion
The main course. Here is the shape of a dump captured during the pool-exhaustion measurement’s Run #2 (timeout 1s, concurrent 60, extDelay 3s), unpacked line by line.
3.1 A Healthy Thread (RUNNABLE)
First, normal. Pool not empty:
"http-nio-8080-exec-1" #41 daemon prio=5 os_prio=31 cpu=12.34ms
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(java.base@21/Native Method)
at java.net.SocketInputStream.socketRead(...)
at com.mysql.cj.protocol.a.SimpleSocketConnection.read(...)
at com.mysql.cj.protocol.a.NativePacketReader.readMessageLocal(...)
...
at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(...)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(...)
at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(...)
...
at com.example.OrderService.confirm(OrderService.java:42)
Line by line:
| Line | Meaning |
|---|---|
Thread.State: RUNNABLE | OS-runnable. The JVM’s logical classification — actually it might be blocked in a socket read (paradoxically, JVM treats native I/O as RUNNABLE) |
socketRead0 | Waiting for the DB response — scheduled out by the OS inside native code |
executeQuery | SQL in flight |
OrderService.confirm:42 | Business code entry |
This thread is doing real work. No issue.
3.2 A Thread at Pool Exhaustion (TIMED_WAITING parked)
"http-nio-8080-exec-3" #42 daemon prio=5 os_prio=31 cpu=0.05ms tid=0x00007f8b001
java.lang.Thread.State: TIMED_WAITING (parked)
at jdk.internal.misc.Unsafe.park(java.base@21/Native Method)
- parking to wait for <0x00000007a5b30c10> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:401)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:903)
at com.zaxxer.hikari.util.ConcurrentBag.borrow(ConcurrentBag.java:162)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:179)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:144)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:99)
at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(...)
...
at com.example.OrderService.confirm(OrderService.java:38)
This is the signature of pool exhaustion:
| Line | Meaning |
|---|---|
Thread.State: TIMED_WAITING (parked) | A time-bounded park. Entered via parkNanos(N) |
Unsafe.park (Native Method) | OpenJDK’s Unsafe.park — the OS thread is genuinely scheduled out |
parking to wait for <0x...> (SynchronousQueue$TransferStack) | Which object it is parked on — SynchronousQueue’s transfer stack |
LockSupport.parkNanos:269 | parkNanos(blocker, nanos) — JVM’s standard park entry |
SynchronousQueue.poll(timeout) | Hand-off queue — waiting for a producer |
ConcurrentBag.borrow:162 | HikariCP’s connection borrow — falls into the SynchronousQueue path when the pool is empty |
OrderService.confirm:38 | Business code — the line that calls dataSource.getConnection() |
Every line carries meaning. Pool exhaustion is, mechanically, threads waiting to be unparked from LockSupport.parkNanos.
3.3 Thread State Distribution from a Single Dump — ASCII Bar
A dump from the pool-exhaustion measurement’s Run #2 (60 workers, pool=10):
At idle:
RUNNABLE ████ 4
TIMED_WAITING ██████████████ 14 (Hikari housekeeper, scheduled tasks, etc.)
WAITING █████████ 9
BLOCKED 0
─────────────────────── 27 threads
At pool exhaustion (concurrent 60 → pool 10):
RUNNABLE ██████████ 10 (active connections at work)
TIMED_WAITING ██████████████████████████████████████████████████████████ 50 ← 50 workers all in parkNanos
(HikariCP getConnection)
WAITING █████████ 9
BLOCKED 0
─────────────────────── 69 threads (60 workers + HikariCP & friends)
TIMED_WAITING (parked) spikes to 50 — exactly matching awaitingConnection=50 from the pool-exhaustion measurement’s Run #2. The dump’s thread state and Hikari’s MXBean metric are two facets of the same event.
4. HikariCP Internals — JVM-Side View of ConcurrentBag and SynchronousQueue
Now I unpack Section 3’s stack trace at the code level. Why does HikariCP behave this way?
4.1 ConcurrentBag — The Core Data Structure of the Connection Pool
Simplified borrow from HikariCP ConcurrentBag.java:
// HikariCP ConcurrentBag.java — borrow() core logic (simplified)
public T borrow(long timeout, TimeUnit timeUnit) throws InterruptedException {
// Step 1: try ThreadLocal cache first
final var list = threadList.get();
for (int i = list.size() - 1; i >= 0; i--) {
final var entry = list.remove(i).get();
if (entry != null && entry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {
return entry;
}
}
// Step 2: try CAS on the shared list
final int waiting = waiters.incrementAndGet();
try {
for (T bagEntry : sharedList) {
if (bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {
if (waiting > 1) {
listener.addBagItem(waiting - 1); // signal pool growth
}
return bagEntry;
}
}
// Step 3: pool empty — wait on handoffQueue
listener.addBagItem(waiting);
timeout = timeUnit.toNanos(timeout);
do {
final long start = currentTime();
final T bagEntry = handoffQueue.poll(timeout, NANOSECONDS); // ← parkNanos here
if (bagEntry == null || bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {
return bagEntry;
}
timeout -= elapsedNanos(start);
} while (timeout > 10_000);
return null; // timeout — caller turns this into SQLTransientConnectionException
} finally {
waiters.decrementAndGet();
}
}
The threads at pool exhaustion are stuck inside Step 3, in handoffQueue.poll(timeout, NANOSECONDS). The handoffQueue is a SynchronousQueue — see Section 4.2.
4.2 SynchronousQueue — A Zero-Capacity Hand-Off Queue
From OpenJDK SynchronousQueue.java:
SynchronousQueue is a BlockingQueue with capacity 0.
put(x)waits until another thread calls take();take()waits until another thread calls put(). The queue stores no element internally — it is a pure hand-off channel.
Why HikariCP uses a SynchronousQueue:
- Fairness —
SynchronousQueue(true)enforces FIFO. The thread that started waiting first gets served first. - Zero-copy hand-off — the connection object is passed directly, never stored. Minimal GC pressure.
- Wait-free poll —
poll(0)on an empty queue returns null immediately. The common path (pool not empty) stays fast.
// SynchronousQueue.poll(timeout) — calls TransferStack.transfer
public E poll(long timeout, TimeUnit unit) throws InterruptedException {
Object e = transferer.transfer(null, true, unit.toNanos(timeout));
if (e != null || !Thread.interrupted()) {
return (E)e;
}
throw new InterruptedException();
}
// TransferStack.transfer — the waiting core
Object transfer(Object e, boolean timed, long nanos) {
// If a matching producer exists, hand off immediately.
// Otherwise push an SNode and call awaitFulfill(s, timed, nanos) → LockSupport.parkNanos
}
The SynchronousQueue$TransferStack.transfer and LockSupport.parkNanos lines in your dump map to exactly this code.
ConcurrentBag.borrow as a sequence — how it gets to park when the pool is empty:
sequenceDiagram
participant W as worker thread
participant TL as ThreadLocal cache
participant SL as sharedList
participant SQ as SynchronousQueue<br/>(handoffQueue)
participant LS as LockSupport
participant OS as OS thread scheduler
W->>TL: Step 1 — check ThreadLocal cache
TL-->>W: miss (empty)
W->>SL: Step 2 — CAS borrow on sharedList
SL-->>W: no STATE_NOT_IN_USE (pool full)
W->>SQ: Step 3 — handoffQueue.poll(1s)
SQ->>LS: TransferStack.awaitFulfill
LS->>OS: Unsafe.park(false, 1_000_000_000L)
Note over OS: thread scheduled out<br/>CPU usage 0
Note over W,OS: Thread.State == TIMED_WAITING (parked)
alt connection returned (other thread requite)
SQ->>LS: producer arrives → unpark(W)
LS-->>W: borrow success
else 1-second timeout
OS-->>LS: timeout wakeup
LS-->>SQ: poll returns null
SQ-->>W: SQLTransientConnectionException
end
In the pool-exhaustion measurement’s Run #2, 50 threads exited via the latter path (1-second timeout → SQLTransientConnectionException).
4.3 LockSupport.parkNanos — How the JVM Puts a Thread to Sleep
From OpenJDK LockSupport.java:
public static void parkNanos(Object blocker, long nanos) {
if (nanos > 0) {
Thread t = Thread.currentThread();
setBlocker(t, blocker); // ← "parking to wait for" in the dump
try {
U.park(false, nanos); // ← Unsafe.park — native call
} finally {
setBlocker(t, null);
}
}
}
What Unsafe.park(false, nanos) means:
false= relative nanos, not absolute timenanos= wake-up window (1s = 1,000,000,000)- The OS thread is truly scheduled out — 0 CPU
- Wake conditions: (a)
nanosexpires, (b) another thread callsunpark(thread), (c) interrupt, (d) spurious wakeup
For the pool-exhaustion measurement’s Run #2 (timeout 1s), the 50 workers all call parkNanos(blocker, 1_000_000_000L) and sleep up to 1 second. After 1 second, if no thread has been unparked (no connection returned), poll returns null → HikariCP raises SQLTransientConnectionException.
4.4 Mapping the transaction-with-external-call pool-exhaustion measurements to the code
| Measurement | Cause in code |
|---|---|
| timeout 5s, 100% pass (P99 6.3s) | parkNanos(5_000_000_000L) — within 5 seconds, a producer (some other worker returning a connection) always arrives → all hand-offs succeed |
| timeout 1s, 16.7% pass (50 timeouts) | parkNanos(1_000_000_000L) — no producer within 1 second. First wave of 10 borrows immediately; the other 50 return null after 1 second |
awaitingConnection peak = 50 | The thread count after waiters.incrementAndGet() and entry into handoffQueue.poll. Exposed by the MXBean |
| P99 6.3s = 3 waves × 2s | First wave immediate / second wave 2s later (after a connection is returned) / third wave 4s later. parkNanos sleeps exactly that long |
Dump line ↔ code ↔ measurement are 1:1:1. This is why the thread dump is the real evidence.
4.5 connectionTimeout / idleTimeout / maxLifetime — JVM Side
| Parameter | Meaning | JVM behavior |
|---|---|---|
connectionTimeout (default 30s) | getConnection() wait timeout | The nanos value in parkNanos(timeout) |
idleTimeout (default 10min) | Idle-connection eviction threshold | HikariCP housekeeper periodically evicts |
maxLifetime (default 30min) | Connection retirement threshold | Set shorter than the DB’s wait_timeout (typically 28min) — to avoid the MySQL closes first race |
validationTimeout (default 5s) | Connection alive-check timeout | Connection.isValid(timeoutSeconds) |
Operational traps:
- A
connectionTimeouttoo long (60s+) makes pool exhaustion show up as latency and slip past monitoring (the “silent latency” pattern from Section 1.2 in the sister post). - A
connectionTimeouttoo short (≤1s) causes fail-fast cascades on routine external spikes. - Practical recommendation: 30s (Hikari default) — but always paired with an
awaitingConnection > 0alert.
5. The JVM Thread State Machine — 6 States and Transitions
Reading Thread.State: TIMED_WAITING (parked) in a dump requires understanding the Thread.State machine.
5.1 6 States (Oracle Thread.State javadoc)
| State | Meaning | Where you see it in dumps |
|---|---|---|
NEW | After construction, before start() | Almost never seen (ephemeral) |
RUNNABLE | JVM-classified as runnable (the OS may have scheduled it out anyway) | Healthy work threads / native I/O waits (paradox) |
BLOCKED | Waiting on a synchronized monitor | synchronized entry / native lock waits |
WAITING | Untimed wait (Object.wait(), LockSupport.park()) | Untimed await — take() / put() |
TIMED_WAITING | Timed wait (wait(N), parkNanos(N), sleep(N)) | Pool exhaustion — the parkNanos call |
TERMINATED | After run() returns | Not visible (reaped) |
5.2 Transition Diagram
stateDiagram-v2
[*] --> NEW
NEW --> RUNNABLE: start()
RUNNABLE --> BLOCKED: synchronized entry contention
BLOCKED --> RUNNABLE: monitor acquired
RUNNABLE --> WAITING: wait() / park()
WAITING --> RUNNABLE: notify() / unpark()
RUNNABLE --> TIMED_WAITING: wait(N) / parkNanos(N) / sleep(N)
TIMED_WAITING --> RUNNABLE: timeout / notify() / unpark()
RUNNABLE --> TERMINATED: run() returns
BLOCKED --> TERMINATED: exception
WAITING --> TERMINATED: interrupt
TIMED_WAITING --> TERMINATED: interrupt
TERMINATED --> [*]
5.3 The Exact Place Where a Pool-Exhausted Thread Sits
Path:
RUNNABLE
→ (calls dataSource.getConnection())
→ ConcurrentBag.borrow() steps 1, 2 fail (pool empty)
→ handoffQueue.poll(1s, NANOSECONDS)
→ SynchronousQueue.TransferStack.awaitFulfill(...)
→ LockSupport.parkNanos(blocker, 1_000_000_000L)
→ Unsafe.park (native)
→ OS thread scheduled out
⇒ Thread.State == TIMED_WAITING (parked)
The thread sits in TIMED_WAITING for up to 1 second, then exits one of two ways:
- Another thread returns a connection → SynchronousQueue.put → unpark(this) ⇒ TIMED_WAITING → RUNNABLE
- 1-second timeout → auto-wake → poll() returns null ⇒ TIMED_WAITING → RUNNABLE → throw SQLTransientConnectionException
In the pool-exhaustion measurement’s Run #2, 50 threads exit via path (2) (50 pool timeouts). In Run #1 (timeout 5s), every thread takes path (1) — wave by wave, threads get unparked.
5.4 The RUNNABLE Trap — JVM’s Logical State vs the OS’s Actual State
The most confusing part of dump reading.
RUNNABLE does not mean “currently executing on a CPU”. It means the JVM has classified the thread as runnable — the OS might have scheduled it out, or it might be stuck inside native I/O (a socket read).
"http-nio-8080-exec-1"
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(java.base@21/Native Method)
This thread is RUNNABLE while OS-blocked in a socket read. From the JVM’s perspective, it’s RUNNABLE because this isn’t a JVM-managed wait — it’s native I/O. A common point of confusion in production dump analysis.
Heuristic:
| Top of stack | Real state |
|---|---|
socketRead0 / epoll_wait / Native Method | OS-level block (DB / network I/O wait) |
LockSupport.park* | JVM-level park (Lock / SynchronousQueue / etc.) |
| Business method | Genuinely executing |
When reading dumps, don’t look at Thread.State alone — always read the top frame of the stack too.
6. The three-prescription comparison’s 9 Scenarios → How the Thread Dump Changes
How does the dump differ for Simple Split / Saga / Outbox from the sister post? A short JVM-side comparison.
6.1 Per-Pattern Dump Signatures
| Pattern | Worker thread state distribution | Auxiliary threads |
|---|---|---|
| No split (Step 1 baseline) | RUNNABLE 10 (in external call) + TIMED_WAITING 50 (parkNanos waiting on pool) | — |
| Simple split | RUNNABLE 10 (in external call) + WAITING 50 (sleeping outside the transaction) | — |
| Saga | RUNNABLE 10 + WAITING 50 + sweeper 1 (TIMED_WAITING Thread.sleep) | sweeper |
| Outbox | RUNNABLE 0 / all workers terminate after ACK + poller 1 (socketRead0 on external call) | poller |
Only No split has 50 threads parked inside HikariCP. The split patterns sleep or terminate outside Hikari during the external call.
6.2 The Threads Disappearing Effect of Outbox
No split (60 workers concurrent):
─────────────────────────────────────────────
RUNNABLE ██████████ 10
TIMED_WAITING ██████████████████████████████████████████████████ 50
─────────────────────────────────────────────
HEAP : 60 worker stacks accumulated
Outbox (60 workers, immediate ACK):
─────────────────────────────────────────────
RUNNABLE ██ 2 (poller 1 + housekeeper 1)
TIMED_WAITING █████ 5
─────────────────────────────────────────────
HEAP : worker stacks immediately GC-eligible
The thread count itself drops. From a JVM stack-memory standpoint, this is also significant — 60 stacks (1MB each by default) = 60MB saved.
6.3 The Saga Sweeper Thread
A separate thread loops every 5 seconds and runs an UPDATE. In the dump:
"saga-sweeper-1" #87 prio=5
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep0(java.base@21/Native Method)
at java.lang.Thread.sleep(...)
at com.example.SagaSweeper.run(SagaSweeper.java:45)
Thread.sleep is implemented similarly to parkNanos internally (it’s an Object.wait variant). In the dump it is distinguished by the (sleeping) qualifier.
6.4 The Dump-Side View of the three-prescription comparison’s A/OFF awaiting=57 [measured]
In Section 3.1 of the sister post, we measured “after sleep(3,000ms) ends, 60 workers issue INSERTs simultaneously → pool of 10 saturated → awaiting=50+ spike”.
A dump captured at that instant:
RUNNABLE ██████████ 10 (running INSERT)
TIMED_WAITING █████████████████████████████████████████████████████████ 57 (parkNanos)
─────────────────────────────────────────────
Duration: ~50ms (10 INSERTs × 5ms each)
A momentary spike. After 50ms the pool drains and the dump returns to normal. Whether you catch it depends on capture timing luck — which is exactly why Section 2.3 stresses three captures.
7. Operational Monitoring — JVM Metrics + Automated Thread Dump Capture
Dumps are post-hoc analysis tools. The real-time signal must come from metrics.
7.1 JVM Metrics for Grafana (Prometheus + Micrometer)
| Metric | Threshold | Meaning |
|---|---|---|
hikaricp_pending_threads | > 0 (sustained) | awaitingConnection — direct signal of pool exhaustion |
hikaricp_active_connections / hikaricp_max | > 0.8 (sustained) | Utilization above 80% |
jvm_threads_states_threads{state="timed-waiting"} | spike detection | TIMED_WAITING surge |
jvm_threads_states_threads{state="blocked"} | > 5 | synchronized contention |
jvm_gc_pause_seconds | P99 > 200ms | Anomalous GC pause |
jvm_memory_used_bytes{area="heap"} | > 0.85 × max | OOM risk |
process_cpu_usage | sustained > 0.8 | CPU saturation |
The exact definition of a “pool exhaustion alert”:
hikaricp_pending_threads > 0 for 30s
AND hikaricp_active_connections == hikaricp_max
→ Ignore momentary spikes (50ms); alert only on 30-second persistence. This threshold also naturally filters out the A/OFF spike from Section 6.4.
7.2 Automated Thread Dump Capture — Pull Dumps at Alert Trigger
A production best practice — you need the dump at the moment the alert fires. A dump pulled by a human SSH-ing in 30 minutes later is already past the incident window.
Three implementation options:
-
Prometheus alertmanager → webhook → dump capture script
#!/bin/bash # /etc/alertmanager/scripts/capture-dump.sh PID=$(pgrep -f "java.*app.jar") for i in 1 2 3; do jcmd $PID Thread.print > /var/log/dumps/dump-$(date +%s)-$i.txt sleep 5 done # Upload to S3 or central log store -
JFR continuous recording — extract a dump on alert
# Boot-time java -XX:StartFlightRecording=name=cont,maxsize=200M,disk=true ... # On alert jcmd $PID JFR.dump name=cont filename=/var/log/jfr/snap.jfr -
Datadog / NewRelic / Pinpoint APM — Continuous Profiling
- Datadog Java Profiler: thread dump + lock contention + allocation, automatically
- 5-minute window automatically retained at alert time
- Pros: minimal infrastructure burden / Cons: cost + lock-in
7.3 Spring Actuator + Custom Endpoint
@RestController
public class DumpController {
@GetMapping("/admin/threaddump")
public Map<String, Object> dump() {
var bean = ManagementFactory.getThreadMXBean();
var infos = bean.dumpAllThreads(true, true);
return Map.of(
"timestamp", Instant.now(),
"threads", Arrays.stream(infos)
.map(t -> Map.of(
"name", t.getThreadName(),
"state", t.getThreadState().name(),
"stack", Arrays.stream(t.getStackTrace())
.map(StackTraceElement::toString)
.toList()
)).toList()
);
}
}
JSON-shaped, friendly to automated analysis. You can immediately compute count-by-state / threads sharing the same stack / threads holding HikariCP frames as metrics.
8. Production Failure Scenarios (3 AM Edition)
8.1 Sudden Pool Exhaustion — First 5 Minutes
Alert: hikaricp_pending_threads > 0 for 30s + P99 spike
| Min | Action | Tool |
|---|---|---|
| 0:00 | Receive alert | PagerDuty |
| 0:01 | Verify automated dump capture (did the webhook fire?) | S3 / central logs |
| 0:02 | Inspect dump’s thread state distribution — confirm TIMED_WAITING 50+ → pool exhaustion confirmed | Dump analyzer |
| 0:03 | Look at frames just above parkNanos — OrderService.confirm:38 → suspect external call | Dump |
| 0:04 | Check external PG status page / latency metrics | Grafana |
| 0:05 | Confirm external mean 200ms → 5,000ms → root cause confirmed | Metrics |
Five minutes to root cause. Without automated dump capture, this stretches to 30 minutes.
8.2 Suspected Memory Leak — JFR 30-Second Capture
Alert: heap usage approaching 95% + GC frequency rising
# Capture immediately, no need to enter the production stack
jcmd <pid> JFR.start name=leak duration=30s filename=/tmp/leak.jfr
sleep 30
# Analyze with JMC (Java Mission Control) — Old Object Sample / Allocation Profile
JFR’s Old Object Sample event tells you directly which class is not being GC’d. You can chase a leak without taking a multi-GB heap dump.
8.3 GC Pause Co-Occurrence — Enable -Xlog:gc*
Alert: P99 latency 200ms → 2,000ms + heap usage normal
A dump alone isn’t enough. During GC, all threads briefly pause (STW — Stop The World) — and you can’t take a dump in that instant either.
# At boot (if you can restart)
java -Xlog:gc*:file=/var/log/gc.log:time,uptime,level,tags ...
# At runtime (no restart needed)
jcmd <pid> VM.log decorators=time,level output=/var/log/gc.log what=gc*=info
GC log analysis:
[2026-05-03T03:14:23.456+09:00][info][gc] GC(42) Pause Young (G1 Evacuation Pause) 245ms
↑ normal (< 100ms expected)
[2026-05-03T03:15:01.891+09:00][info][gc] GC(43) Pause Full (G1 Compaction Pause) 2,340ms
↑ abnormal — Full GC fired, suspect Old fragmentation
If Full GCs occur back to back, heap is short → bump -Xmx or chase the leak (Section 8.2).
9. Big-Tech Cases — Real-World Dumps / GC / Concurrency
How the same patterns measured in this post have been handled in the wild.
9.1 Toss SLASH — A Week to the Customer (Distributed Lock + JPA OptimisticLock)
haon.blog SLASH22 — broker issue / concurrency / network latency covers a Toss case:
- External broker latency increased → JPA
@Version(Optimistic Lock) conflict spike - Thread dump analysis → contention inside
EntityManager.flush()versioncomparisons - Resolution: distributed lock + retry policy adjustment
The same structural incident as “external call latency rises → pool exhaustion” in this post. The broker plays the role of outbox publish; JPA plays the role of the confirm transaction.
9.2 KakaoPay — JPA Transactional readOnly + set_option, +58% QPS
tech.kakaopay.com — JPA Transactional readOnly — a single transaction-attribute change yields 58% QPS gain.
- The effect of
@Transactional(readOnly = true): skips MySQLset autocommit=0 - Pool occupancy time drops →
awaitingConnectionstays at 0 in steady state - Thread dump analysis surfaced unnecessary transactions
The same mechanism as “pool occupancy = duration of external call”. readOnly is the same lever applied differently.
9.3 Netflix — Java in Flames
Netflix Tech Blog — Java in Flames — async-profiler + flame graph in production.
- async-profiler runs continuously across every JVM in production
- Four flame graphs: CPU / wall-clock / lock contention / allocation
- The key insight beyond a thread dump is the time axis — how often you sit in this frame
This is exactly the async-profiler tool from Section 2.1, generalized as a production-wide best practice.
9.4 Uber — JVM Profiler (Open Source)
github.com/uber-common/jvm-profiler — Uber’s distributed JVM profiler.
- Profiles thousands of executors simultaneously across Spark / Flink
- Unified collection of thread dumps + GC + memory + CPU
- Publishes to Kafka → centralized analysis
The Section 7.2 “automated dump capture” pattern, scaled out for distributed environments.
9.5 Datadog — Continuous Profiling for Java
docs.datadoghq.com — Profiler — JFR-based continuous profiling SaaS.
- Wall Time view: how long each method waits — directly identifies pool exhaustion
- Lock Hold Time view: lock-hold time — directly surfaces
synchronizedcontention - 5-minute window auto-retained at alert time — option (3) in Section 7.2
This post’s dump analysis, fully automated and continuous. A cost-vs-operational-burden trade-off.
9.6 Woowahan — DB Connection Holding Trap
techblog.woowahan.com — MySQL Distributed Lock GET_LOCK Same-Connection Trap — a distributed lock holding the same connection, exhausting the pool.
GET_LOCK('key', timeout)is bound to the calling connection. Other work on the same connection deadlocks.- The dump shows every thread stuck inside
GET_LOCK - Resolution: route the distributed lock through a separate pool, or switch to Redisson
A different take on this post’s lesson that “the DB Connection becomes a synchronization resource”.
10. Recap — putting this article in your own words
If someone who just finished this article were to summarize it through four core questions, here’s how the measurements answer them.
Q. “When the pool exhaustion alert fires, what do you capture first?”
What this article showed by measurement is — take a thread dump 3 times at 5-second intervals (jcmd <pid> Thread.print). A single dump can’t distinguish momentary from persistent. If the same threads sit in the same stack frames across all three, they are genuinely stuck. Then look at the thread state distribution — if TIMED_WAITING (parked) spikes, pool exhaustion; if BLOCKED spikes, synchronized contention; if RUNNABLE with socketRead0 on top of stack, external I/O wait. In this article’s transaction-with-external-call pool-exhaustion measurement [measured] Run #2, all 50 workers shared a LockSupport.parkNanos and ConcurrentBag.borrow stack — pool exhaustion confirmed from a single dump.
Q. “What does the PARKED state in a thread dump actually mean?”
What this article defined is — it’s a substate of TIMED_WAITING in the JVM’s 6-state model. You enter it via LockSupport.parkNanos(blocker, nanos). Internally that calls Unsafe.park(false, nanos) — the OS thread is genuinely scheduled out. CPU usage drops to 0. The wake conditions are (a) timeout, (b) another thread calls unpark(thread), (c) interrupt, (d) spurious wakeup. For HikariCP it’s (a) or (b) — if a connection is returned, (b) wakes you up to RUNNABLE; if not, (a) wakes you up only to throw SQLTransientConnectionException. The parking to wait for <0x...> line in the dump tells you which object is the blocker — for HikariCP it’s SynchronousQueue$TransferStack. Those two pieces confirm pool exhaustion.
Q. “Why does HikariCP use SynchronousQueue?”
What this article traced is — SynchronousQueue is a BlockingQueue with capacity 0. put() waits for another thread to take(), and take() waits for another thread to put() — nothing is ever stored in the queue. That’s exactly the right shape for handing off a connection. First, zero-copy hand-off — connections are never stored, so zero GC pressure. Second, with SynchronousQueue(true) you get FIFO fairness — the first thread to wait gets served first. Third, poll(0) returns null immediately on an empty queue — the common path (pool not empty) stays fast. Even in the pool-exhaustion measurement the 50 awaiting threads all parked precisely inside SynchronousQueue.poll(timeout, NANOSECONDS).
Q. “What can a thread dump not tell you?”
What this article catalogued as blind spots are three things. First, GC pauses — during STW all threads pause, and you can’t take a dump in that instant either. You need -Xlog:gc* and a separate GC log. Second, memory leaks — a dump doesn’t tell you which objects are held in the heap. You need jmap -dump (heap dump) or JFR Old Object Sample. Third, the time axis — a dump is a single snapshot. How often you sit in a frame is invisible. A wall-clock profiler like async-profiler or Datadog Continuous Profiling shows the time fraction. In production you run dump + GC log + JFR + APM in concert — any single tool always leaves a blind spot.
11. What I Learned
11.1 The One-Liner
“A pool exhaustion alert isn’t an application-code bug — it’s a JVM-internal state where threads are stuck inside
LockSupport.parkNanos.” A single dump proves this line by line. The transaction-with-external-call pool-exhaustion measurement [measured]: 50 workers sharing theConcurrentBag.borrow→SynchronousQueue.poll→LockSupport.parkNanos→Unsafe.parkstack is the evidence.
11.2 Assumptions Broken by Measurement
- “Pool exhaustion = code bug” → No — code is fine; this is a function of external dependency latency × pool size.
- “RUNNABLE = currently running” → Half-truth — it’s a logical JVM classification. It might be native I/O.
- “One dump is enough” → No — three captures at 5-second intervals are the minimum unit for distinguishing momentary from persistent.
- “Hikari uses a SynchronousQueue, that’s just a simple queue, right?” → No — capacity-0 hand-off / FIFO / zero-copy are intentional.
11.3 Where This Post Sits in the JVM Mastery Series
Part 1 (the flagship) of the JVM/Java Mastery series — a deep-dive that ties the operational graph to JVM mechanics through one incident. Continues in:
- Part 2 — Java Concurrency (
synchronized/Lock/Atomic/LongAdder) — extending this post’s lesson that the DB Connection becomes a synchronization resource into explicit synchronization primitives - Part 3 — JVM Memory Layout (Heap / Metaspace / Direct / Stack) — broadening this post’s 60MB of thread stacks into the full memory geography
- Part 4 — GC Algorithms (G1 / ZGC / Shenandoah) — the proper measurement of Section 8.3’s GC pause co-occurrence scenario
- Part 8 — CompletableFuture — async fan-out measurements for the Outbox poller in this post
- Part 10 — JFR / async-profiler — the full-blown version of Section 7~Section 8’s monitoring + automated dump capture
- Part 11 — Virtual Threads (Loom) — what happens to parkNanos under carrier thread pinning
12. In the Next Post
- The lock comparison measurement (optimistic / pessimistic / GET_LOCK / Redisson) — thread state differences between
synchronizedandReentrantLock(planned in a follow-up series) - Spring Batch 1M-row backfill — G1 vs ZGC pause distribution (follow-up measurement planned)
- Coroutines vs Virtual Thread comparison — Virtual Thread vs Coroutines for 100k I/O — how
parkNanosbehaves on a carrier thread (follow-up measurement planned)
References
Specifications and Source
- Oracle — Thread.State javadoc (Java 21) — definitions of the 6 states
- Oracle — Troubleshooting Guide for HotSpot VM (Java 21) — dump / JFR / heap dump tooling
- HikariCP
ConcurrentBag.java— borrow / requite - HikariCP wiki — About Pool Sizing — pool-size formula
- OpenJDK
SynchronousQueue.java— TransferStack / TransferQueue - OpenJDK
LockSupport.java— park / parkNanos - JEP 328 — Flight Recorder — JFR overhead < 1%
Big-Tech Cases
- Toss SLASH22 — Broker Issue / Concurrency / Network Latency — JPA OptimisticLock incident
- KakaoPay — JPA Transactional readOnly QPS +58% — readOnly transaction effect
- Netflix Tech Blog — Java in Flames — async-profiler in production
- Uber — JVM Profiler (open source) — distributed JVM profiler
- Datadog — Continuous Profiling for Java — JFR-based SaaS profiler
- Woowahan — MySQL Distributed Lock GET_LOCK Same-Connection Trap — connection-bound lock incident
- NAVER D2 — Understanding Commons DBCP — pool size + TPS calculation
Authors and Textbooks
- Java Concurrency in Practice — Brian Goetz (Ch. 10 Liveness / Ch. 13 Explicit Locks)
- Aleksey Shipilëv — Synchronization Revisited —
synchronizedlock states - Doug Lea — A Java Fork/Join Framework — work-stealing
- Ron Pressler — Project Loom Slide — virtual thread park mechanics
Sister Post
- Spring Transactions and External API Calls — Reproducing Pool Exhaustion and Comparing Three Remedies (Simple Split, Saga, Outbox) by Measurement — the same transaction-with-external-call pool-exhaustion measurement from a business pattern angle
NDA Guardrails: All measurements in this post are labeled
[measured — Java/Spring], the external platform is abstracted asPlatformA(further generalized for the blog), and no internal company code paths are referenced.