Skip to content
Forward Engineering
Go back

RDB Mastery #1 — InnoDB Index Internals: From No-Index to Multi-Index, the Real Picture B-trees Draw

- views

Table of contents

Open Table of contents

Intro — Even without an index, InnoDB already stores rows sorted inside a B-tree

Imagine you get this question in an interview: “If a table has no index, how is it stored?” Two common answers:

  1. “It piles up on disk in INSERT order.”
  2. “It just goes in unsorted, in the order of arrival.”

Both are wrong inside InnoDB. InnoDB stores rows inside a B-tree, already sorted, even when you don’t define an index. If a PK is declared, the PK becomes the tree’s key. If there is no PK, InnoDB auto-generates a 6-byte hidden ROWID and uses that as the key. A table with zero indexes does not exist inside InnoDB. There is always at least one B-tree.

Once that single fact wobbles, every following sentence wobbles too:

This post unwinds every one of those lines with 10 diagrams + 10M-row [measured — Java/Spring] numbers.


1. The basic units of InnoDB — Page (16KB) / Row / Index

1.1 page = InnoDB’s physical unit

InnoDB handles all data in pages (16KB by default). Disk reads are 16KB. Buffer-pool slots are 16KB. B-tree nodes are 16KB.

MySQL — InnoDB Disk Layout states:

“InnoDB stores all data in pages. The page size is fixed at 16KB by default and is determined by the innodb_page_size variable at the time the MySQL instance is initialized.”

→ Reading a single row still pulls the entire 16KB page that contains it. This is why random I/O is expensive. One row = one page = 16KB.

1.2 row = a record inside a page

A page holds many rows in PK-sorted order. Diagram 1 — ASCII layout of a single page:

┌──────────────────── page (16KB) ────────────────────┐
│ FIL Header (38B): checksum / page_no / prev / next  │
├─────────────────────────────────────────────────────┤
│ Page Header (56B): n_recs / free space / index_id   │
├─────────────────────────────────────────────────────┤
│ Infimum / Supremum (two virtual records)            │
├─────────────────────────────────────────────────────┤
│ User Records (actual rows — sorted by PK)           │
│  ┌─────────────────────────────────────┐           │
│  │ Row #1: id=1001, name='A', amount=… │           │
│  ├─────────────────────────────────────┤           │
│  │ Row #2: id=1002, name='B', amount=… │           │
│  ├─────────────────────────────────────┤           │
│  │ Row #3: id=1003, name='C', amount=… │           │
│  └─────────────────────────────────────┘           │
│  ... ~100 rows / page (depending on row size)       │
├─────────────────────────────────────────────────────┤
│ Free Space (room for future INSERTs)                │
├─────────────────────────────────────────────────────┤
│ Page Directory (slot array — for binary search)     │
├─────────────────────────────────────────────────────┤
│ FIL Trailer (8B): checksum verification             │
└─────────────────────────────────────────────────────┘

→ Reading diagram 1: user records inside the page are sorted by PK (even when no index is declared). The page directory is a sparse slot array that lets you do binary search inside the page — instead of comparing 100 rows linearly, you compare log(N) times to land on a single row.

The prev / next pointers form a doubly-linked list that connects this page to its siblings at the same B-tree level. This is the physical foundation for the reverse scan in Section 5.

MySQL — InnoDB Row Format plus Jeremy Cole — InnoDB Page Anatomy cover the byte-level breakdown of a page in the most detail.

1.3 An index = pages connected as a B+-tree

A one-line recap of what we’ve covered so far — a page is InnoDB’s 16KB physical unit, rows live inside it sorted by PK, and each page is connected to its siblings via prev / next pointers. Now: how those pages are wired into parent-child relationships to form an index.

If a page holds 100 rows, then 10M rows = 100K pages. Those 100K pages are wired into a B+-tree with root → internal → leaf levels. The height is typically 3–4: even with 10M rows, 3–4 page reads reach any single row.

This is the first principle of index design: disk seek count = tree height + leaf scan. With an index hit, disk seeks = 3–4. With a full scan, disk seeks = 100K. That gap is the source of 1,000x–10,000x latency differences.

→ Memory and logging aspects of InnoDB (buffer pool / redo log / undo log) are covered in MySQL InnoDB Architecture Deep Dive. This post sticks to index structure.


2. Every InnoDB table is a B-tree — the Clustered Index

2.0 Definition — three meanings packed into one sentence

Clustered Index = a single B-tree where the table’s data itself is stored, sorted by the key.

Three things are bundled into that one sentence:

→ Everything below follows from that definition:

Sections 2.1 and onward unpack each of those three meanings in order.

2.1 With a PK: PK = clustered index = the table itself

The first fundamental fact about InnoDB:

“Every InnoDB table is stored as a clustered index. If a PK is defined, the PK is the clustered index’s key. The leaf nodes of the clustered index hold PK + every column value — making it identical to the table itself.”

MySQL — Clustered and Secondary Indexes states:

“Each InnoDB table has a special index called the clustered index where the data for the rows is stored. Typically, the clustered index is synonymous with the primary key.”

Diagram 2 — shape of a clustered-index B-tree:

graph TB
    subgraph "Clustered Index (= the table)"
        Root["Root Page<br/>key range: id 1~10M"]
        I1["Internal Page<br/>id 1~5M"]
        I2["Internal Page<br/>id 5M+1~10M"]
        L1["Leaf Page<br/>id 1~100<br/>+ name, amount, ... full row"]
        L2["Leaf Page<br/>id 101~200<br/>+ full row"]
        L3["Leaf Page<br/>id 5M~5M+100<br/>+ full row"]
        L4["Leaf Page<br/>id 9.9M~10M<br/>+ full row"]
    end
    Root --> I1
    Root --> I2
    I1 --> L1
    I1 --> L2
    I2 --> L3
    I2 --> L4
    L1 <-.->|doubly linked| L2
    L2 <-.-> L3
    L3 <-.-> L4

→ Reading diagram 2: the key point is that leaf nodes contain the full row. The PK directly determines the row’s physical location. PK 1 and PK 2 sit on adjacent (or the same) leaf, while PK 1 and PK 9,999,999 sit on distant leaves. This is what “the table is physically sorted by PK” actually means.

2.2 Without a PK: hidden 6-byte ROWID is auto-generated

“What if I don’t define a PK?” — InnoDB silently generates a 6-byte hidden integer. MySQL — Clustered Index:

“If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted.”

→ The common myth “without a PK, the table is stored without an index” is wrong. A zero-index table cannot exist inside InnoDB. Without a PK there is hidden ROWID; with a UNIQUE NOT NULL column, that takes priority. Always exactly one clustered index.

The pitfall of hidden ROWID isn’t its size — at 6 bytes it’s actually smaller than BIGINT (8 bytes). The real pitfall is it’s invisible to your application:

That’s why the operational standard is always an explicit PK. The reason id BIGINT PRIMARY KEY AUTO_INCREMENT became the canonical pattern boils down to three points:

  1. AUTO_INCREMENT avoids page splits. New rows always go to the last leaf of the clustered index, so the index stays sorted in INSERT order with almost no page splits. UUIDs / random PKs do the opposite — every INSERT lands at an arbitrary leaf, exploding fragmentation (which is exactly why sortable IDs like UUIDv7 / ULID emerged).
  2. BIGINT is overflow-safe. INT (4 bytes, signed max ~2.1 billion) hits the ceiling within 4–5 years in fast-growing domains (orders / event logs / tracking). BIGINT (8 bytes, ~9 quintillion) is effectively infinite. “Started with INT, then ALTER to BIGINT” is the #1 ops incident — start with BIGINT and you’re safe forever.
  3. External-system compatibility. JSON / Java long / TypeScript bigint are all 8-byte integer standards. Distributed systems / API serialization / cache keys — every boundary uses a consistent type.

→ The 6-byte storage win of hidden ROWID never outweighs those three operational benefits. That’s why the standard is BIGINT.

2.3 The truth about the .ibd file — the table body literally is B-tree pages

An InnoDB table is stored on disk as a single .ibd file (with innodb_file_per_table=ON by default). Open that file as binary, and what you see is a stream of B-tree pages laid out in 16KB units:

orders.ibd file layout (10M rows, ~1.6GB):
  Page 0   FSP Header (file-space metadata)
  Page 1   IBUF Bitmap
  Page 2   INODE
  Page 3   Clustered Index Root         ← B-tree entry point
  Page 4   Internal Page (id 1~5M)
  Page 5   Internal Page (id 5M+1~10M)
  Page 6   Leaf Page (id=1~100, + full row data)
  Page 7   Leaf Page (id=101~200, + full row data)
  ...
  Page N   Leaf Page (id=9.9M~10M)

  every page = exactly 16KB

→ “The table has data” is synonymous with “row bytes live inside the clustered-index leaf pages of this .ibd file”. There is no separate raw data file. mysqldump is ultimately just a tool that walks these leaf pages and serializes them row-by-row.

2.4 Implication — full table scan = clustered-index full scan

A direct consequence: in InnoDB, what people call a “full table scan” actually means walking the leaf level of the clustered index from start to end. PK-ordered walk. This is different from PostgreSQL’s heap scan (walk in physical insertion order, no sort guarantee).

type=ALL in EXPLAIN = clustered-index full scan. Revisited in Section 11.

2.5 Recap — Clustered Index = 1, Secondary Index = N

Everything Section 2 covered so far is about the clustered index. A common point of confusion for outside readers:

If you build 5 indexes on the same table orders:

The next section (Section 3) unpacks the two-lookup mechanism of secondary indexes.


3. Secondary Index — a separate B-tree that points to PK

3.1 Secondary-index leaves contain PK values

When you run CREATE INDEX idx_owner_id ON orders (owner_id), InnoDB builds a separate B-tree. Its leaf entries are:

(owner_id value, PK value)

Not the physical row location (page id + slot id). The PK value. This is the essence of the two-step lookup.

Diagram 3 — the two-stage relationship between secondary and clustered indexes:

graph TB
    subgraph step1 ["Step 1 — Secondary index walk (idx_owner_id)"]
        direction TB
        S_Root["Root<br/>owner_id 1~10K"]
        S_L1["Leaf<br/>owner_id=1234<br/>PK list: 5,000,001 / 5,000,123 / 8,234,567 …"]
        S_Root --> S_L1
    end
    subgraph step2 ["Step 2 — Clustered index walk (= the table itself)"]
        direction TB
        C_Root["Root<br/>PK 1~10M"]
        C_L1["Leaf PK=5,000,001<br/>+ full row"]
        C_L2["Leaf PK=5,000,123<br/>+ full row"]
        C_L3["Leaf PK=8,234,567<br/>+ full row"]
        C_Root --> C_L1
        C_Root --> C_L2
        C_Root --> C_L3
    end
    S_L1 -.->|PK lookup| C_Root

→ Reading diagram 3. For WHERE owner_id = 1234 AND amount > 1000:

  1. Step 1: find leaves with owner_id=1234 in the secondary index → get a list of PKs (5,000,001 / 5,000,123 / 8,234,567 …).
  2. Step 2: for each PK, look up the clustered index again → fetch the full row, evaluate amount > 1000.

That is the cost of a secondary-index lookup: one extra walk per PK — two walks total. Operationally this is the random-I/O cost. If 100 rows match owner_id=1234, that’s 100 clustered-index lookups. If those 100 rows are scattered across 100 different leaf pages, you pay 100 page seeks.

3.2 Difference vs PostgreSQL — heap TID vs PK

PostgreSQL’s secondary indexes carry the heap TID (physical tuple location) in the leaf. One lookup. The leaf points straight to the heap tuple, no PK indirection.

Use The Index, Luke! — Anatomy of an Index lays out this contrast cleanly.

InnoDB (MySQL)PostgreSQL
secondary leaf(key, PK)(key, TID = page+slot)
lookups2 (secondary → clustered)1 (secondary → heap)
on PK changesecondary unaffected (if PK stable)(PG: PK is itself indexed like secondary)
on row move (page split)secondary unaffectedall secondary indexes updated (HOT optimization avoids some)

→ A trade-off, not a winner. InnoDB pays the two-walk cost in exchange for secondary indexes being unaffected by row movement (as long as PK is stable). PostgreSQL is the opposite. Both are trade-offs.

3.3 PostgreSQL’s heap + index model — TID / page+slot / HOT, unwound

Let’s open up the internals of that difference, so the trade-off is clear to outside readers.

TID (Tuple Identifier)

A coordinate that physically points at one PG row:

TID = (block_number, item_number)
    = (the N-th 8KB page in the heap file, the K-th slot inside that page)

InnoDB has no equivalent of TID, no concept of a direct physical address. Rows are identified by their PK value.

Heap (PG) vs Clustered (InnoDB)

InnoDBPostgreSQL
Table bodyClustered B-tree (sorted by PK)Heap (no sort)
Data storageRows live inside clustered leavesRows live inside heap pages (in INSERT order)
Index leaf(key, PK)(key, TID)
Lookupsecondary → PK → clustered = 2 walkssecondary → TID → heap = 1 walk

Page + slot (the inside of a PG heap page)

Inside a PG heap page (8KB):

Page Header (24B)
Item Pointer Array (slots)  ← grows downward from the top
  Slot 0: → tuple X
  Slot 1: → tuple Y
  Slot 2: → tuple Z
  ...
Free Space
Tuple Z (variable length)   ← grows upward from the bottom
Tuple Y
Tuple X

→ The slot is an indirection layer. If a tuple shifts within the same page, only the slot needs an update — the TID (page, slot) stays valid.

Why PG must update every secondary index when a row moves

PG UPDATE is not in-place (MVCC):

  1. Mark the existing tuple as “deleted”.
  2. INSERT the new tuple at a different location — a new TID.
  3. Every index’s leaf entry must be updated to the new TID.

HOT (Heap-Only Tuple) optimization: same page + index columns unchanged → skip the index update.

InnoDB pays zero of this cost — since secondaries point only at the PK, row movement is irrelevant.

→ Net trade-off: PG wins on single-walk lookups but pays every-index update on UPDATE. InnoDB pays two-walk lookups but enjoys lighter UPDATEs.

3.4 [Measured — Java/Spring] — Q5 composite-index lookup cost

Q5 (WHERE owner_id=? AND state=? ORDER BY created_at DESC LIMIT 20):

Stepactual time
Before (no index, full scan)1,497 ms
After (idx_owner_state_created composite)2.59 ms (577x faster)

The composite (owner_id, state, created_at) leaf carries (owner_id, state, created_at, PK). With owner_id=1234 + state=‘CONFIRMED’, only 699 candidate rows survive, then created_at reverse scan + LIMIT 20 picks 20. Only those 20 trigger PK lookups.

That’s why “a composite index can feel almost as fast as a covering index”. Since PK rides along in every secondary leaf, ordering keys like (created_at, id) becomes covering for free.

3.5 Query → which path → what is read

Mapping the secondary-index lookup mechanisms of Section 3 onto concrete query examples:

QueryWhich index / pathWhat is readLookups
SELECT * FROM orders WHERE id = 100clustered (PK lookup)leaf = full row1 (auto-covering)
SELECT * FROM ordersclustered full scanevery leafleaf count
SELECT id FROM orders WHERE owner_id = 1234secondary idx_owner_id onlyleaf = (owner_id, PK)1 (covering)
SELECT amount FROM orders WHERE owner_id = 1234secondary → PK → clusteredleaf has no amount → back to clustered2
SELECT id, owner_id FROM orders WHERE owner_id = 1234secondary idx_owner_id onlyleaf has it all1 (covering)

Using index in EXPLAIN ANALYZE is the signal for rows 1, 3, and 5 of this table. If only Using where shows up, you’re looking at row 2 or 4.

Keep this mapping in your head as we move into Section 4 — the precise definition of a covering index.


4. Covering Index — an index where the answer lives in the leaf

4.1 Definition — not an index type, a state of the index × query combination

Covering Index = an index where every column the query requires is already inside the leaf.

The key nuance — “Covering” is not a type of index, it’s a state relative to a query. The same index can be:

Index idx_owner (owner_id) — leaf automatically holds (owner_id, PK)

Query A: SELECT id FROM orders WHERE owner_id = ?
  → required: {id=PK, owner_id} / leaf: {owner_id, PK}  → Covering ✅

Query B: SELECT amount FROM orders WHERE owner_id = ?
  → required: {amount, owner_id} / leaf: {owner_id, PK}  → amount missing ❌

The index itself doesn’t know whether it’s covering. The query decides at execution time.

Composite-index covering

idx_owner_state_amount (owner_id, state, amount)
leaf: (owner_id, state, amount, PK)

SELECT id, amount WHERE owner_id=? AND state=? → everything is in the leaf → covering ✅

→ The technique: analyze the queries you run most often → bundle every column they require into a composite index to make it covering.

PK-index auto-covering

The clustered (PK) leaf = the full row. Therefore a PK lookup is always automatically covering (for any query). However, the term “covering” usually refers specifically to the effect on a secondary index.

EXPLAIN signals

Extrameaning
Using indexCovering succeeded — secondary leaf alone answers the query
Using where onlysecondary → PK → clustered (two-lookup walk)

In MySQL — EXPLAIN Output, the Extra column showing Using index is the covering signal.

Since InnoDB’s secondary indexes always include the PK in the leaf, a (created_at, id) index is automatically covering for SELECT id, created_at FROM .... id is the PK (already there), created_at is the index key (already there).

4.2 Diagram 4 — covering vs non-covering walk

sequenceDiagram
    participant Q as Query
    participant S as Secondary B-tree
    participant C as Clustered B-tree

    Note over Q,C: NOT Covering — two walks
    Q->>S: SELECT amount, name<br/>WHERE owner_id=1234
    S->>S: owner_id=1234 → list of PKs (100)
    S-->>Q: PK list
    loop 100 times
        Q->>C: lookup per PK
        C-->>Q: fetch amount, name
    end

    Note over Q,C: Covering — one walk
    Q->>S: SELECT id, created_at<br/>ORDER BY created_at DESC LIMIT 20
    S->>S: idx_created_at_id leaf reverse walk 20
    S-->>Q: 20 rows (clustered untouched)

→ Reading diagram 4. Non-covering = the answer is not in the secondary index, so you fetch the PK list and visit the clustered index again. With 100 rows, up to 100 page seeks. Covering = the secondary leaf contains the answer; clustered is untouched, page seeks drop sharply.

4.3 [Measured — Java/Spring] Q3 — the clearest case for covering

Q3 (SELECT id, created_at FROM orders_w2 ORDER BY created_at DESC LIMIT 20):

Stageactual timerows processed
Before (no index → filesort)1,609 ms9,708,696
After (idx_created_at_id, covering reverse scan)0.65 ms20

2,476x difference. This is the cleanest illustration of the covering effect.

The key EXPLAIN line:

type: index
key: idx_created_at_id
Extra: Using index; Backward index scan

Using index = covering. Backward index scan = walk the leaf linked list backward. Together, “sort 9.7M rows” collapses into “reverse-walk 20 rows”.

4.4 Combined with the leftmost-prefix rule

A composite index (a, b, c) has leaves (a, b, c, PK). Hence:

→ “Is it covering?” and “Does WHERE narrow it efficiently?” are two different questions. Even if covering, a WHERE that violates leftmost prefix forces a full index scan.

The companion post’s Section 3.2 (single-key cursor at 0.27ms) and Section 3.3 (OR-split cursor at 0.30ms) both run on top of this covering index. Same index, same 10M rows, different SQL shapes.


5. Physical walk on a B-tree — leaf doubly-linked list

5.1 Leaf nodes connected via prev / next pointers

The prev page id / next page id fields inside the page header (Section 1) do this work. The leaves of the B+-tree form a doubly-linked list. Moving to the immediate sibling at the same level is direct.

Diagram 5 — leaf doubly-linked list:

                           Root
                          /   \
                    Internal   Internal
                    /    \      /    \
              Leaf 1 ↔ Leaf 2 ↔ Leaf 3 ↔ Leaf 4 ↔ Leaf 5
              (id=1~100) (101~200) (201~300) (301~400) (401~500)

              ──────── forward (ASC) walk ────────→
              ←──────── backward (DESC) walk ────────

→ Reading diagram 5. Leaf 1’s next = leaf 2’s page id; leaf 2’s prev = leaf 1’s page id. ASC sort → walk left-to-right via next. DESC sort → walk right-to-left via prev.

5.2 Reverse-scan cost ≈ forward-scan cost

MySQL — Descending Indexes (8.0+):

“InnoDB supports descending index scans. With ascending index scans, the server scans index entries from low to high. With descending index scans, the server scans index entries from high to low. Performance is comparable for both directions.”

→ Forward and reverse are essentially equal cost. The only edge: read-ahead (predictive prefetch) is tuned for the forward direction, so reverse can be slightly (≤10%) slower. The measured difference is tiny.

5.3 The companion post’s (b) single-key cursor 0.27ms exercises this

The companion Section 3.2’s WHERE created_at < ? ORDER BY created_at DESC LIMIT 20 produced this EXPLAIN ANALYZE:

-> Limit: 20 row(s)
   -> Covering index range scan on orders_w2 using idx_created_at_id
      over (created_at < '2024-...') (reverse)
      (rows=20)

(reverse) is the signal that the leaf linked list is being walked backward. rows=20 = 20 next/prev hops and done. The reason 10M rows are irrelevant.


6. The four walk patterns side by side

Four kinds of walks can happen on a B-tree.

6.1 Diagram 6 — the four walks

┌──────────────────────────────────────────────────────────────┐
│ 1. Index Seek (point lookup)                                 │
│    Root → Internal → Leaf                                    │
│    cost: O(log N) = 3~4 page seeks                           │
│    result: 1 row                                             │
├──────────────────────────────────────────────────────────────┤
│ 2. Index Range Scan                                          │
│    Root → Internal → Leaf start                              │
│    → walk leaf linked list for N items                       │
│    cost: O(log N + N) pages                                  │
│    result: matching rows                                     │
├──────────────────────────────────────────────────────────────┤
│ 3. Full Index Scan                                           │
│    First leaf → end of linked list                           │
│    cost: O(every leaf)                                       │
│    result: all index rows                                    │
├──────────────────────────────────────────────────────────────┤
│ 4. Full Table Scan (= Clustered Index Full Scan)             │
│    Clustered first leaf → end                                │
│    cost: O(every row)                                        │
│    result: every column of every row  (most expensive)       │
└──────────────────────────────────────────────────────────────┘

→ Reading diagram 6:

walkentryexitcost
Index Seek (point lookup)binary-search root → leaf1 rowO(log N) ≈ 3-4 pages
Index Range Scanbinary-search to start leafwalk linked list M timesO(log N + M) pages
Full Index Scanfirst leaf (linked-list head)until last leafO(leaf count)
Full Table Scan = Clustered Full Scanfirst clustered leafend + all columnsmost expensive

6.2 [Measured — Java/Spring] mapping the 5 queries to the 4 walks

The five queries in this series:

Qwalk typeactual time
Q1 (WHERE id = 5000000)Index Seek (PK point lookup)0.042 ms
Q2 (WHERE created_at BETWEEN ... LIMIT 20)Index Range Scan (after)13.5 ms
Q3 (ORDER BY created_at DESC LIMIT 20)Index Range Scan + reverse + covering0.65 ms
Q4 (GROUP BY region_code)Full Index Scan (after — covering, smaller)1,271 ms
Q5 (WHERE owner_id=? AND state=? ...)Index Seek (composite) + reverse range scan2.59 ms

→ Q4’s GROUP BY is still 1,271ms even as a full index scan. region_code has only 4 distinct values, so the entire 9.7M leaves must be walked + group aggregate. Faster than the full table scan (Before 2,249ms) but absolute time is large — full index scan still touches every leaf.


7. The OFFSET ceiling — why everything is read and discarded

7.1 B-trees do not maintain a row counter

OFFSET 1,000,000 = “skip the first 1,000,000 rows, return the next 20”. Intuitively: “just jump to row 1,000,001, right?” InnoDB doesn’t know that location.

Why. B-tree internal nodes only store key ranges: “id 1–5M is the left child / 5M+1–10M is the right child” — pure key-routing information. They do not store “how many rows live in this subtree”.

7.2 Why no counter — cost > benefit

“Couldn’t a counter make OFFSET fast?” — no, it isn’t added. Why:

Every INSERT would have to bump counters on every internal node along the root path. DELETE, the same with -1. With tree height 4, one INSERT = 4 counter updates. But that counter would be a contention point for every concurrent transaction — 100 simultaneous INSERTs would turn the root’s counter into a hot spot, collapsing throughput under lock contention.

The win on OFFSET is dwarfed by the loss on INSERT/DELETE concurrency. So the general B-tree index (as implemented in MySQL / PostgreSQL / Oracle / SQL Server etc.) does not maintain ordinal-position metadata. A fundamental B-tree trade-off.

Use The Index, Luke! — No Offset covers this OFFSET ceiling — rooted in the absence of ordinal-position metadata in B-tree indexes — in detail.

7.3 Diagram 7 — OFFSET sequential walk + discard

OFFSET 1,000,000 LIMIT 20 walk:

Leaf 1 → Leaf 2 → ... → Leaf 9,999 → Leaf 10,000 → Leaf 10,001
   |                                                     |
   ↓                                                     ↓
[ read 1,000,000 rows, throw away ]              [ read 20 rows → return ]

total page accesses ≈ 10,001 pages (16KB × 10,001 ≈ 160MB I/O or buffer pool hits)
total rows processed = 1,000,020
returned rows = 20

→ Reading diagram 7. Even on a covering index, you must read 10,001 leaf pages in order. The first 10,000 leaves are read and thrown away. InnoDB cannot skip.

7.4 [Measured — Java/Spring] OFFSET 1M = 171ms

OFFSET-position vs latency, on top of a covering index:

OFFSETactual timerows scanned
1,0000.443 ms1,020
100,00023.4 ms100,020
1,000,000171 ms1,000,020
5,000,000765 ms5,000,020

OFFSET cost is exactly proportional to the number of rows read and discarded. Even with a covering index, you cannot skip. The companion Section 2 unwinds the page-level numbers; this post stays with the why at the B-tree mechanism layer.


8.1 WHERE created_at < ? = the B-tree’s true primitive

OFFSET’s collapse and cursor’s speed are two sides of the same coin.

OFFSET asks for a positional index (“the N-th row”). B-trees don’t carry that — sequential walk is forced.

WHERE created_at < ? asks for a key comparison (“rows whose key is less than this”). That is what B-trees do best. Binary search root → internal → leaf, jump straight to the start.

Diagram 8 — cursor’s binary-search jump:

sequenceDiagram
    participant Q as Query
    participant R as Root Page
    participant I as Internal Page
    participant L as Leaf Page

    Q->>R: WHERE created_at < '2024-03-15 00:00:00'
    R->>R: binary search: which child range?
    R-->>I: Internal Page #234
    I->>I: binary search: which leaf range?
    I-->>L: Leaf Page #1,099,234
    L->>L: page-directory binary search inside the leaf<br/>→ row just before '2024-03-15 00:00:00'
    L->>L: walk prev backward from there<br/>read 20 rows and stop
    L-->>Q: return 20 rows

    Note over Q,L: page accesses = 4 (root + internal + leaf + 1~2 linked-list hops)
    Note over Q,L: rows processed = 20

→ Reading diagram 8. 4 page accesses / 20 rows processed. The reason 10M rows are irrelevant: cursor maps directly onto the B-tree’s natural primitive.

8.2 [Measured — Java/Spring] cursor 0.30ms

OR-split cursor measurement from companion Section 3.3:

approachactual timerows scanned
OFFSET 1,000,000171 ms1,000,020
OR-split cursor0.30 ms20

About 570x. The two diagrams in Section 7 + Section 8 explain the gap. OFFSET = 1M page seeks (sequential walk). Cursor = 4 page seeks (binary search).

The companion’s three-shape comparison ((a) row constructor 154ms / (b) single-key cursor 0.27ms / (c) OR-split 0.30ms) is about whether the optimizer pushes the predicate down. The B-tree mechanism itself delivers binary-search primitive whenever push-down works. This post focuses on the latter.

8.3 Page-level mechanism — disk → buffer pool → record

The real reason cursor is fast, expressed in page-access counts.

Cursor WHERE created_at < ? LIMIT 20

  1. Root → Internal → start leaf (binary search, 3~4 page accesses).
  2. The leaf page (16KB) holds ~100 records.
  3. Decode just the first 20 records inside that page → done.

→ Total page accesses: 3~4, rows scanned = 20.

OFFSET LIMIT 20 OFFSET 1000000

  1. Root → start leaf (3~4 pages).
  2. Leaf 1: 100 records → count and throw all away.
  3. Leaf 1.next → Leaf 2 (read another 16KB).
  4. … (repeat ~10,000 times)
  5. Reach Leaf 10000 → 1M counted.
  6. Decode the next 20 → return.

→ Total page accesses: ~10,003, rows scanned = 1,000,020.

Mapping to measurements

cursor 0.30ms = ~4 page accesses × warm buffer-pool hits + decoding 20 records
OFFSET 1M 171ms = ~10,003 page accesses + decoding 1M records (most of the time)

Buffer pool / disk layer

InnoDB never reads .ibd pages from disk directly. Every read goes through the buffer pool (in memory):

→ With the buffer pool warm in the measurement environment, all 4 pages of the cursor query are buffer-pool hits → that’s how 0.30ms materializes.


9. Multi-index — N B-trees on the same table

9.1 5 indexes = 6 B-trees (clustered + 5)

The 5 indexes built in this series:

CREATE INDEX idx_created_at_id        ON orders_w2 (created_at, id);
CREATE INDEX idx_region_code          ON orders_w2 (region_code);
CREATE INDEX idx_owner_state_created  ON orders_w2 (owner_id, state, created_at);
CREATE INDEX idx_state_created        ON orders_w2 (state, created_at);
CREATE INDEX idx_owner_id             ON orders_w2 (owner_id);

On the single table orders_w2, InnoDB now holds 6 B-trees simultaneously: 1 clustered (= the table) + 5 secondary.

Diagram 9 — the same row sits in all 6 B-trees:

┌──────────────────────────────────────────────────────────────────┐
│ Row (1)                                                           │
│   id=5,000,001 / owner=1234 / state='CONFIRMED' /                │
│   region='KR' / created_at='2024-...'                            │
└────────────────────────────┬─────────────────────────────────────┘
                             │ same row exists in 6 B-trees at once
                             │ (one leaf entry per tree)

┌──────────────────────────────────────────────────────────────────┐
│  B-tree 1   Clustered (PK)            leaf: PK + full row        │ ← the table itself
├──────────────────────────────────────────────────────────────────┤
│  B-tree 2   idx_created_at_id         leaf: (created_at, PK)     │
├──────────────────────────────────────────────────────────────────┤
│  B-tree 3   idx_region_code           leaf: (region_code, PK)    │   (low cardinality)
├──────────────────────────────────────────────────────────────────┤
│  B-tree 4   idx_owner_state_created   leaf: (owner_id, state,    │   composite
│             (composite)                     created_at, PK)      │
├──────────────────────────────────────────────────────────────────┤
│  B-tree 5   idx_state_created         leaf: (state, created_at,  │
│                                              PK)                 │
├──────────────────────────────────────────────────────────────────┤
│  B-tree 6   idx_owner_id              leaf: (owner_id, PK)       │
└──────────────────────────────────────────────────────────────────┘

  → 1 row = 6 leaf entries  (one per B-tree)
  → 1 INSERT = updates to all 6 B-trees   (write amplification)
  → 1 DELETE = the same — all 6 updated
  → UPDATE: only the trees whose key columns changed (PK unchanged → secondaries unaffected)

→ Reading diagram 9. One row = 6 leaf entries (one in each B-tree). One INSERT = all 6 B-trees updated. One DELETE the same.

9.2 Cardinality of each index — five-index measurement [measured]

indexcardinalitymeaning
PRIMARY9,708,696id is nearly unique
idx_created_at_id9,708,696covering, nearly unique
idx_region_code45 regions evenly distributed → ~2.4M rows per region (very low selectivity)
idx_owner_state_created (owner)21,71110K owners
idx_owner_state_created (state)43,4224 states
idx_state_created (state)9694 states + time bucketing
idx_owner_id12,58510K owners

→ A cardinality-4 index (idx_region_code) is almost useless. A lookup on region_code=‘KR’ returns ~2.4M rows on average, and the optimizer is likely to fall back to a full table scan. An index does not always speed things up.

9.3 Write-latency cost — write amplification

Cost of one INSERT:

index countB-trees updated per INSERTrelative cost
0 (PK only)1 (clustered)1x (baseline)
1 secondary22x
5 secondary66x

The bulk-load measurement loaded 10M rows in 53.5s = 187K rows/s with no indexes. If you re-ran the same load with 5 indexes attached, you would expect 5–6x slower. That’s why the operational pattern is disable indexes for bulk load → enable after, recommended by MySQL — Bulk Data Loading for InnoDB Tables.

9.4 Storage cost

A single index ≈ (key column size + PK size) × row count × 1.5 (B-tree fill factor + page overhead).

Rough math for this series:

A 10GB table picks up 1.3GB of indexes. Indexes occupy buffer-pool slots → reduces buffer-pool hit rate on the clustered index itself. The topic of the index diet (series #6).


10. Logical vs physical

10.1 Two views of the same table

The logical view a SQL user sees and the physical view InnoDB sees are different.

Diagram 10 — logical ↔ physical mapping:

graph LR
    subgraph "Logical (SQL user)"
        L["Table orders<br/>┌────┬───────┬───────┐<br/>│ id │ owner │ state │<br/>├────┼───────┼───────┤<br/>│ 1  │ 1234  │ CONF  │<br/>│ 2  │ 5678  │ CONF  │<br/>│ ...│  ...  │  ...  │<br/>└────┴───────┴───────┘<br/>= one grid"]
    end
    subgraph "Physical (InnoDB)"
        P1["Clustered B-tree<br/>(table body)"]
        P2["Secondary B-tree #1<br/>idx_created_at_id"]
        P3["Secondary B-tree #2<br/>idx_owner_id"]
        P4["..."]
        P5["Secondary B-tree #5<br/>idx_state_created"]
    end
    L -.->|SQL ↔ B-tree| P1
    L -.->|same row| P2
    L -.->|in N trees| P3
    L -.->|simultaneously| P4
    L -.-> P5

→ Reading diagram 10:

aspectlogicalphysical
table1 grid (rows × cols)N B-trees (1 clustered + N-1 secondary)
row1 recordN leaf entries (one per B-tree)
columna column of the grida field in the clustered leaf / a key in a secondary index
sortORDER BY at query timeclustered = PK order / secondary = index-key order

10.2 Implications

When this mapping breaks, EXPLAIN, the optimizer, and indexes all turn fuzzy.


11. Re-defining “Full Table Scan”

11.1 In InnoDB, full table scan = clustered-index full scan

Restating the implication from Section 2. EXPLAIN’s type=ALL is colloquially “full table scan”. What actually happens physically:

= full leaf scan of the clustered index. PK-ordered walk. Different from PostgreSQL’s heap full scan (file-order, no sort).

11.2 EXPLAIN type-column mapping

MySQL — EXPLAIN Output (type):

typemeaningwalk in this post
const / eq_refPK / unique 1 rowIndex Seek
refsecondary equalitynarrow Index Range Scan
rangesecondary BETWEEN/<Index Range Scan
indexfull index scan (covering: leaf only)Full Index Scan
ALLclustered full leaf scanFull Table Scan = Clustered Full Scan

→ Seeing type=ALL in EXPLAIN means a clustered-index full leaf scan is happening. Not the random-walk-through-heap that PG users imagine — it’s a PK-ordered walk.

11.3 Big Tech case — LINE catches type=ALL with VISUAL EXPLAIN

LINE Engineering — Catching index behaviour with MySQL Workbench VISUAL EXPLAIN describes an operational tool that visualizes type=ALL. A full-table-scan slip easy to miss in the text EXPLAIN output becomes obvious in the graph.


12. Big Tech cases + interview answers

12.1 Big Tech sources (URL verified ≥ 6)

sourcepostwhich § does it support
Toss SLASH24Next core banking — Oracle→MySQL + InnoDB MVCCSection 2 clustered index, Section 3 secondary lookup
LINE EngineeringMySQL Workbench VISUAL EXPLAINSection 11 type=ALL detection
Kakao PayJPA Transactional readOnly + set_optionSection 3 secondary index + read tuning
Use The Index, Luke!Anatomy of an IndexSection 3 InnoDB vs PG indirection
Use The Index, Luke!No OffsetSection 7 OFFSET ceiling
Vlad MihalceaHow does MVCC workSection 1 InnoDB foundation
Vlad MihalceaIndex SelectivitySection 9.2 cardinality
PerconaInnoDB Buffer Pool / B-treeSection 1.3 page seek
DiscordStoring Billions of MessagesSection 9 multi-index → distributed limits
Jeremy ColeThe physical structure of InnoDB index pagesSection 1.2 page byte-level
MySQL officialClustered and Secondary IndexesSection 2 / Section 3
MySQL officialDescending IndexesSection 5.2 reverse scan

12.2 Recap — putting this article in your own words

If someone who just finished this article were to summarize it through five core questions, here’s how the measurements answer them.

Q. “If a table has no index, how does InnoDB store rows?”

InnoDB already stores rows sorted inside a B-tree, even with no index declared. If a PK is defined, the PK becomes the clustered-index key, and the clustered-index leaf carries the full row — making it the table itself. Without a PK, InnoDB auto-generates a 6-byte hidden ROWID as the clustered key. So a zero-index table doesn’t exist inside InnoDB — there is always at least one clustered index. The common “rows just pile up in INSERT order” is wrong — they are physically sorted by PK.

Q. “Why can’t a secondary-index lookup finish in one step?”

InnoDB’s secondary-index leaves carry the PK value, not the physical row location (page+slot). So WHERE owner_id=1234 goes (1) find leaves for owner_id=1234 in the secondary index → get a PK list, then (2) look up the clustered index again with each PK — two walks. PostgreSQL puts the heap TID (physical location) in the leaf and finishes in one walk — clear contrast. The trade-off: InnoDB’s secondary indexes are unaffected when a row moves due to page split (as long as PK stays put); PG must update every secondary index (HOT optimization avoids some). Neither side is universally superior — it’s a structural trade-off.

Q. “Why is a covering index fast? What’s the precise definition?”

A covering index isn’t a type of index — it’s a state defined by the index × query combination. When every column the SELECT needs lives in the secondary-index leaf, the clustered index is never touched — one lookup. InnoDB’s secondary indexes always carry the PK in the leaf, so an index like (created_at, id) is automatically covering for queries like SELECT id, created_at. EXPLAIN’s Using index is the covering signal. In this article’s measurements, ORDER BY created_at DESC LIMIT 20 went from 1,609ms (filesort, no index) → 0.65ms (covering reverse scan) — 2,476x ([measured — Java/Spring]).

Q. “Why does OFFSET collapse on deep pages — the real reason?”

B-tree internal nodes only store key ranges; they don’t carry a row counter. So “jump to the N-th row” is impossible — the engine must read N rows sequentially from the first leaf and discard them. Why no counter? Because every INSERT/DELETE would have to bump counters on every node along the root path, turning the root into a lock hot spot that collapses concurrent throughput. Cost > benefit. The general B-tree index (MySQL/PG/Oracle/SQL Server) makes the same call. In this article’s measurements, OFFSET 1M = 171ms with rows scanned = 1,000,020 — literally read 1M and threw them away ([measured — Java/Spring]).

Q. “5 indexes vs 0 — what’s the exact write cost difference?”

5 indexes on a single table = clustered 1 + secondary 5 = 6 B-trees simultaneously. One INSERT = updates to leaves on all 6. One DELETE same. UPDATE only touches the trees whose key columns changed (with PK unchanged, secondaries are unaffected). In this article’s environment, no-index load was 187K rows/s (53.5s for 10M rows); the same load with 5 indexes attached is theoretically 5–6x slower. That’s why disable indexes during bulk load, enable after is the operational standard. Storage adds 1.3GB → buffer-pool pressure also degrades clustered hit-rate. Hence the operational discipline of an index diet (sys.schema_unused_indexes / invisible indexes) ([measured — Java/Spring]).


13. What we learned

13.1 Assumptions broken by measurement

13.2 The single-line takeaway

Every InnoDB table is already a B-tree. PK = clustered index = the table itself. Secondary indexes are separate B-trees that point to PK → two-walk lookup. Covering indexes carry the answer in the leaf → one lookup. Reverse scan walks the leaf doubly-linked list backward. OFFSET collapses because B-trees don’t carry a row counter and force sequential walk. Cursor is fast because WHERE triggers the B-tree’s binary-search primitive. Multi-index = N B-trees on one table. At 10M rows, [measured] Q3 covering 2,476x / Q5 composite 577x / OFFSET 1M = 171ms / cursor = 0.30ms — every number falls out of one mechanism.


14. Up next — a series

This is post #1 of the RDB Mastery series — the internal index structure angle. Next:

Companion posts:


References

Official docs

Big Tech / operations

Textbook level

Known limitation

Raw data from this measurement is kept inside the portfolio repo (10M-row environment / cardinalities of 5 indexes / Q1~Q5 Before/After / OFFSET vs cursor 4-point measurements).


Share this post on:

Previous Post
RDB Mastery #2 — MySQL Index Types: B-tree / Hash / Covering / Composite / Multi-valued / Functional, and When to Pick What
Next Post
Decoding HikariCP Pool Exhaustion via JVM Thread Dump — What TIMED_WAITING (parked) Really Means