RDB Mastery #2 — MySQL Index Types: B-tree / Hash / Covering / Composite / Multi-valued / Functional, and When to Pick What

Open Table of contents

Intro — A table with just a PK already has one index. So when, and what kind, do you add next?
1. Six Index Types in MySQL
2. Variants of the B-tree Index — Clustered / Secondary / Covering / Composite
3. Composite Index and the Leftmost-Prefix Rule
4. Cardinality (Selectivity) and Index Effectiveness
5. Covering Index — An Index Where the Answer Lives Without Touching the PK
6. Multi-valued Index (8.0+) — For JSON Arrays
7. Functional Index (8.0.13+) — Expression Indexes
8. Hash / Spatial / Full-text — InnoDB’s Specialised Indexes
9. Index Cost — Indexes Are Not Free
10. The Q2 Paradox — Adding an Index Can Make a Query Slower
11. Operational Guidelines — When to Pick What
12. Big Tech References + Recap Questions
- 12.1 Big Tech references (URLs verified ≥ 6)
- 12.2 Recap — putting this article in your own words
13. What I Learned
- 13.1 Assumptions broken by measurement
- 13.2 The single line
14. Next Post — The Series Continues
References

Intro — A table with just a PK already has one index. So when, and what kind, do you add next?

Post #1 — RDB Mastery #1 — InnoDB Index Internals laid down a single hard fact. Every InnoDB table is already stored sorted inside a B-tree by PK. The PK is the clustered index = the table itself. Without a PK, InnoDB auto-generates a hidden ROWID, so there is always at least one B-tree.

Then comes the next question.

“On top of that, when you add another index — what kind, and when?”

Two common answers.

“Just add an index on the columns you query often.”
“B-tree is the default, so make a B-tree.”

Both are insufficient in production. MySQL has six index types. And inside the B-tree family alone, five decision axes — clustered / secondary / covering / composite / cardinality — fire at the same time.

“Can I create a Hash index on InnoDB?” — No. Memory engine only. InnoDB simulates parts of it via the adaptive hash index.
“Spatial index lives on GEOMETRY columns — an R-tree.”
“Full-text index is for natural-language search — an inverted-index structure.”
“Multi-valued index (8.0+) makes each element of a JSON array a leaf entry.”
“Functional index (8.0.13+) makes the result of an expression like LOWER(email) the key.”
“Composite index (a, b, c) works for WHERE a=? or WHERE a=? AND b=?. WHERE b=? alone does not (leftmost prefix).”
“Cardinality of 4 (4 distinct regions) on a standalone index = nearly useless.”
“Covering when possible — lookup count drops from 1 to 0.”

This post walks every one of those decision axes by building real indexes on a 10M-row table and measuring — concluding when to pick what.

Inputs: cardinality across 5 indexes + Q1~Q5 Before/After + Q2 paradox + index storage 1.3GB + write amplification (structurally N+1 B-tree updates — actual latency is not strictly linear due to buffer pool/redo/change buffer).
Companion post MySQL No-Offset Cursor Pagination — its cursor is one application of the Section 3 composite + leftmost-prefix rule from this post.
Depth: L2-L3 (RDB Mastery series, post #2 — trade-offs by index type + measurements + Big Tech operations + recap questions).

1. Six Index Types in MySQL

1.1 The six types

MySQL — CREATE INDEX defines six index types.

Type	Internal structure	Use case	Storage engine
B-tree	B+-tree	Equality / range / sort (default)	InnoDB / MyISAM
Hash	Hash table	Equality only	Memory engine only
Spatial	R-tree	GEOMETRY coordinates	InnoDB / MyISAM
Full-text	Inverted index	Natural-language search (`MATCH ... AGAINST`)	InnoDB / MyISAM
Multi-valued (8.0+)	B-tree variant	Each element of a JSON array	InnoDB
Functional (8.0.13+)	B-tree (expression result is key)	Computed columns like `LOWER(col)`	InnoDB

1.2 Diagram 1 — internal structure of all six types

┌─────────────────────────────────────────────────────────────┐
│ 1. B-tree (B+-tree)                                          │
│    Root → Internal → Leaf (linked list)                      │
│    Range + sort + equality, all of the above                 │
│                                                              │
│ 2. Hash (Memory engine)                                      │
│    bucket[h(key) % N] → key, value                          │
│    Equality O(1) / range X / sort X                          │
│                                                              │
│ 3. Spatial (R-tree)                                          │
│    A tree of bounding boxes                                  │
│    "Find N items near this coordinate"                       │
│                                                              │
│ 4. Full-text (Inverted Index)                                │
│    word → [doc1, doc3, doc7, ...]                           │
│    "Find documents that contain a query term"                │
│                                                              │
│ 5. Multi-valued (8.0+, B-tree variant)                       │
│    JSON [tag1, tag2, tag3] → 3 leaf entries                 │
│    "JSON_CONTAINS(tags, ?)"                                  │
│                                                              │
│ 6. Functional (8.0.13+, B-tree)                              │
│    LOWER('Foo') → 'foo' becomes the leaf key                │
│    "WHERE LOWER(col) = ?" pushes down                        │
└─────────────────────────────────────────────────────────────┘

→ Reading Diagram 1. Of the six, InnoDB can build five (B-tree / Spatial / Full-text / Multi-valued / Functional). Hash is Memory-engine-only. Spatial / Full-text apply only to specialised columns (GEOMETRY / TEXT) — so in everyday operations the indexes you meet are essentially B-tree and its variants (Multi-valued / Functional).

1.3 Focus of this post — InnoDB B-tree and its variants

This post focuses on InnoDB B-tree and its variants (Multi-valued / Functional) because they dominate operational decision-making. Hash / Spatial / Full-text get one paragraph each in Section 8. Within the B-tree family, four shapes — clustered / secondary / covering / composite — and one effect — cardinality — form the decision axes.

2. Variants of the B-tree Index — Clustered / Secondary / Covering / Composite

2.1 Quick recap from Post #1

A one-paragraph recap of the structure covered in Post #1.

Every InnoDB table is stored as a clustered index. The PK is the clustered key. The clustered leaf carries the entire row, so the clustered index is the table. A secondary index is a separate B-tree whose leaf carries (index key, PK) — so finding a row through a secondary index takes two lookups (one secondary + one clustered). A covering index holds every column the SELECT needs in the secondary leaf, so the clustered lookup is skipped → one lookup. A composite index (a, b, c) simply bundles multiple columns into one index.

How those four shapes differ inside the same B-tree family — Diagram 2 spells it out.

2.2 Diagram 2 — leaf shapes of clustered / secondary / composite

graph TB
    subgraph "Clustered Index (= the table itself)"
        C1["Leaf<br/>PK=5,000,001<br/>+ owner_id, state, region, created_at, amount, ... full row"]
    end
    subgraph "Secondary Index (single column)"
        S1["Leaf<br/>(owner_id=1234, PK=5,000,001)<br/>(owner_id=1234, PK=5,000,123)<br/>(owner_id=1234, PK=8,234,567)"]
    end
    subgraph "Composite Index (multiple columns)"
        M1["Leaf<br/>(owner=1234, state='C', created='2024-...', PK=5M+1)<br/>(owner=1234, state='C', created='2024-...', PK=5M+123)<br/>..."]
    end
    subgraph "Covering Index (covering = all SELECT columns live in the leaf)"
        V1["Leaf<br/>(created_at='2024-...', PK=5,000,001)<br/>For SELECT id, created_at, no clustered lookup"]
    end

→ Reading Diagram 2. Clustered leaf = full row. Secondary leaf = (key, PK). Composite leaf = (key1, key2, key3, PK) — multiple keys bundled into one leaf. Covering is when the SELECT columns all live inside the leaf, skipping the clustered lookup.

The key observation: secondary / composite / covering are all shapes of the same secondary-index B-tree, not separate index kinds. Whether you bundle one column or three at CREATE INDEX time decides the shape.

2.3 The four operational decision axes

Axis	Question	Impact
Clustered vs Secondary	How is the PK defined?	Cost of every secondary lookup (PK size + distribution)
Covering or not	Are all SELECT columns inside the index leaf?	One lookup vs two (tens to hundreds of x latency difference)
Composite column order	(a, b, c) vs (b, a, c) — which goes first?	Leftmost-prefix rule (next, Section 3)
Cardinality / Selectivity	Distinct values / total rows	Whether the index is effective at all (Section 4)

These four axes form the spine of Section 3~Section 5 below.

3. Composite Index and the Leftmost-Prefix Rule

3.1 The rule, stated

A composite index (a, b, c) makes the following WHERE shapes index-usable.

WHERE	Index used	Why
`WHERE a = ?`	✅	leftmost = a
`WHERE a = ? AND b = ?`	✅	a + b leftmost prefix
`WHERE a = ? AND b = ? AND c = ?`	✅	full prefix
`WHERE a = ? AND c = ?`	△ (only up to a)	skipping b means c cannot narrow further on the index
`WHERE b = ?`	❌	a missing, leftmost broken
`WHERE c = ?`	❌	a, b missing
`WHERE b = ? AND c = ?`	❌	a missing

This is the leftmost-prefix rule: matching has to start from the left and proceed in order.

3.2 Why “leftmost” — the B-tree’s sort order

The leaf of (a, b, c) is sorted as a 4-tuple (a, b, c, PK), with priority a → b → c.

Diagram 3 — the sort order of a composite leaf:

Leaf of idx_owner_state_created (owner_id, state, created_at):

(owner=1234, state='C', created='2024-01-01', PK=5M+1)
(owner=1234, state='C', created='2024-01-15', PK=5M+5)
(owner=1234, state='C', created='2024-03-22', PK=5M+99)
(owner=1234, state='P', created='2024-01-02', PK=5M+8)
(owner=1234, state='P', created='2024-02-10', PK=5M+45)
(owner=1235, state='C', created='2024-01-01', PK=5M+200)
(owner=1235, state='C', created='2024-04-19', PK=5M+311)
(owner=1235, state='X', created='2024-02-08', PK=5M+800)
...

  ↑ owner_id is the primary sort key, state secondary, created_at tertiary

→ Reading Diagram 3. WHERE owner_id = 1234 — binary search jumps to the leaf region where owner=1234, then walks. 100% index utilisation. WHERE owner_id = 1234 AND state = 'C' — additional binary search inside owner=1234 finds the state=‘C’ slice. 100% index utilisation. But WHERE state = 'C' alone — the leaf is primarily sorted by owner_id, so state=‘C’ rows are scattered. Binary search cannot help → full leaf walk required → the index cannot be used.

The phone-book analogy: in a directory sorted by surname-then-given-name, “surname Kim” is fast (front of the sort matches), but “given name John” forces you to scan the entire book because given names are scattered. Same principle.

3.3 [Measured — Java/Spring] Q5 — composite lookup + reverse scan

Q5 from this series, after creating the (owner_id, state, created_at, id) composite:

SELECT id, owner_id, state, created_at, amount
FROM orders_w2
WHERE owner_id = 1234 AND state = 'CONFIRMED'
ORDER BY created_at DESC
LIMIT 20;

Stage	Actual time	Rows processed	Notes
Before (no index, full scan + filesort + filter)	1,497 ms	9,708,696	Full 9.7M scan + sort
After (composite + reverse + lookup)	2.59 ms	699	577x faster

Key EXPLAIN line:

type: ref
key: idx_owner_state_created
key_len: 1043 (owner 8B + state varchar + created 5B + PK 8B)
rows: 699
Extra: Backward index scan

→ With the prefix owner_id=1234 AND state='CONFIRMED', the row set narrows to 699, then created_at reverse scan + LIMIT 20 reads only 20. The index narrowing 9.7M to 699, then the reverse-scan + LIMIT 20 reading just 20 — that is the substance of 1,497ms → 2.59ms.

The companion post MySQL No-Offset Cursor Pagination uses the same mechanism — WHERE created_at < ? (leftmost) + ORDER BY created_at DESC LIMIT 20 on top of (created_at, id). A direct application of this leftmost-prefix rule.

3.4 Picking the column order

The standard rule for composite column order:

Equality (=) columns first, range (<, >) columns later: equality narrows the leaf to a point, range narrows to an interval, so equality-then-range works best.
Higher cardinality first (within the equality group): if equal in operator, the more selective column goes first.
ORDER BY column at the very end (so the B-tree’s natural sort order eliminates filesort).

This series’ (owner_id, state, created_at) follows exactly this rule. owner_id (10K owners) + state (4 values, equality) → created_at (for sort). Equality → sort.

4. Cardinality (Selectivity) and Index Effectiveness

4.1 Definitions

Cardinality = number of distinct values in the column
Selectivity = unique values / total rows = cardinality / row count
Selectivity 1.0 = every row unique (PK)
Selectivity near 0 = many duplicates → indexing is barely useful

Vlad Mihalcea — Index Selectivity puts it plainly:

“The higher the selectivity of a column, the more efficient an index on it can be. A column with low selectivity (e.g., a boolean) often does not benefit from a standalone index — the optimizer would prefer a full table scan.”

4.2 [Measured — Java/Spring] Cardinality across the five indexes from this series

Index / column	Cardinality	Selectivity	Effect
PRIMARY (id)	9,708,696	≈ 1.0	Nearly unique — point lookup ideal
idx_created_at_id (created_at)	9,708,696	≈ 1.0	Covering, range scan ideal
idx_owner_id (owner_id)	12,585	0.0013	10,000 owners → ~970 rows per owner
idx_state_created (state)	969	0.0001	4 states + time
idx_region_code (region_code)	4	4 × 10⁻⁷	5 regions evenly distributed → standalone index nearly useless

(state has 4 values: CONFIRMED / PENDING / CANCELLED / REFUNDED + 1 edge case)

4.3 Diagram 4 — cardinality vs latency effect

Cardinality vs index effect (based on Q3 ORDER BY DESC LIMIT 20):

cardinality
   9.7M  │ ████████████████████████████ Q3 → 0.65ms (covering, 2,476x faster)
   9.7M  │ ████████████████████████████ Q5 → 2.59ms (composite, 577x faster)
   969   │ ███░░░░░░░░░░░░░░░░░░░░░░░░ idx_state_created (low impact when used alone)
   4     │ █░░░░░░░░░░░░░░░░░░░░░░░░░░ idx_region_code (alone = does not avoid full scan)
   1.0×  │
        └─────────────────────────────────────────
              index effect →

   Reading:
   - cardinality 9.7M (nearly unique) → maximum index effect
   - cardinality 4 → optimizer may **fall back to full table scan**

→ Reading Diagram 4. With cardinality 9.7M, one key fetches one row — maximum index effect. With cardinality 4, one key returns ~2.4M rows on average — full table scan may actually be faster, and the optimizer’s cost-based decision often falls back to full scan.

4.4 Finding — no standalone index on low-cardinality columns

A one-line rule for production.

Do not build a standalone index on a low-cardinality column (selectivity < 0.01). It only earns its keep as a non-leading column inside a composite.

This series’ idx_region_code (cardinality 4) is nearly useless on its own. The optimizer almost never picks it. But as a non-leading column in something like (region_code, owner_id, created_at), it narrows by 1/4 first and then owner_id refines further.

This is the principle behind “do not build standalone indexes on boolean / state / region enum columns”.

5. Covering Index — An Index Where the Answer Lives Without Touching the PK

5.1 Recap

A one-paragraph recap of the covering index from Post #1, Section 4.

If every column the SELECT requires lives in the secondary index leaf, the clustered index can be skipped — one lookup is enough. Because InnoDB’s secondary index always carries the PK in the leaf, an index like (created_at, id) automatically covers SELECT id, created_at. Using index in EXPLAIN is the covering signal.

This post tackles the decision angle — when can you make it covering?

5.2 Shapes that can be covered

SELECT shape	Index that can cover
`SELECT id WHERE owner_id = ?`	`(owner_id)` — secondary already includes PK in leaf
`SELECT id, created_at ORDER BY created_at DESC`	`(created_at, id)` — covering reverse scan
`SELECT owner_id, state ORDER BY ...`	`(..., owner_id, state, ...)` — owner_id and state both inside the leaf
`SELECT *`	Hardly ever coverable — putting every column in the index makes the index almost as big as the table

Key observation. SELECT * is hard to cover. Putting every column into an index essentially turns the secondary index into a copy of the clustered index — index storage explodes to roughly the table size.

The operational pattern: narrow the SELECT list explicitly and design covering indexes around that narrow column set. The reason you replace ORM SELECT * defaults with DTO projections.

5.3 [Measured — Java/Spring] Q3 — the cleanest covering case

Q3 (SELECT id, created_at FROM orders_w2 ORDER BY created_at DESC LIMIT 20):

Stage	Actual time	Rows processed
Before (no index → full table scan + filesort)	1,609 ms	9,708,696
After (idx_created_at_id, covering reverse scan)	0.65 ms	20

→ 2,476x faster. The clearest covering effect.

Key EXPLAIN line:

type: index
key: idx_created_at_id
Extra: Using index; Backward index scan

Using index = covering signal. Clustered index never touched, zero PK lookups. Sorting 9.7M rows collapses to a 20-row reverse walk.

5.4 Diagram 5 — the covering effect

sequenceDiagram
    participant Q as Query
    participant S as Secondary Index Leaf
    participant C as Clustered Index Leaf

    Note over Q,C: Plain secondary (non-covering)
    Q->>S: SELECT amount, name<br/>WHERE owner_id=1234
    S->>S: owner_id=1234 → 100 PKs
    S-->>Q: PK list
    loop 100 PK lookups
        Q->>C: lookup per PK
        C-->>Q: amount, name
    end
    Note over Q,C: Total lookups = 1 + 100 = 101

    Note over Q,C: Covering
    Q->>S: SELECT id, created_at<br/>ORDER BY created_at DESC LIMIT 20
    S->>S: idx_created_at_id leaf reverse-walk 20
    S-->>Q: 20 rows (clustered untouched)
    Note over Q,C: Total lookups = 1

→ Reading Diagram 5. Non-covering: take the PK list back to the clustered index → total lookups proportional to the row count. Covering: secondary leaf alone yields the answer → total lookups = 1 (one binary-search start point + leaf walk).

6. Multi-valued Index (8.0+) — For JSON Arrays

6.1 Definition

Multi-valued Index, introduced in MySQL 8.0:

“A multi-valued index is a secondary index defined on a column that stores an array of values. Normal indexes have one index record for each data record (1:1). A multi-valued index can have multiple index records for a single data record (N:1).”

Each element of the JSON array goes into the index leaf. Where a normal B-tree is row : leaf = 1:1, multi-valued is one row → multiple leaves.

6.2 Example

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    tags JSON,   -- ["urgent", "vip", "international"]
    INDEX idx_tags ((CAST(tags AS UNSIGNED ARRAY)))
);

-- Supported WHERE shapes (exactly three)
SELECT * FROM orders WHERE JSON_CONTAINS(tags, '"urgent"');
SELECT * FROM orders WHERE 'urgent' MEMBER OF (tags);
SELECT * FROM orders WHERE JSON_OVERLAPS(tags, '["urgent", "vip"]');

6.3 Diagram 6 — Plain B-tree vs Multi-valued

┌─ Plain B-tree (1:1) ──────────────────┐
│                                        │
│ Row id=1, tags="urgent"                │
│   → 1 leaf entry                       │
│                                        │
│ Leaf:                                  │
│   ('urgent', PK=1)                     │
│   ('vip', PK=2)                        │
│                                        │
└────────────────────────────────────────┘

┌─ Multi-valued (N:1) ──────────────────┐
│                                        │
│ Row id=1, tags=["urgent","vip","intl"] │
│   → 3 leaf entries                     │
│                                        │
│ Leaf:                                  │
│   ('urgent', PK=1)                     │
│   ('vip', PK=1)        ← same PK!      │
│   ('intl', PK=1)       ← same PK!      │
│   ('urgent', PK=2)                     │
│                                        │
└────────────────────────────────────────┘

→ Reading Diagram 6. Multi-valued: one row → many leaves. Element-level lookups like JSON_CONTAINS / MEMBER OF / JSON_OVERLAPS run through the index.

6.4 Limits and use cases

Item	Constraint
WHERE shape	`JSON_CONTAINS` / `MEMBER OF` / `JSON_OVERLAPS` only
Column type	JSON only (a JSON-shaped TEXT does not qualify)
ORDER BY	Multi-valued indexes cannot drive sort order
Write cost	Each row updates N leaves (proportional to array size)
Storage	Rows with large arrays inflate the index

The operational use case is tags / labels / categories — an N:M relationship modeled as JSON instead of a join table. Suitable when the average array size is small (a few to a few dozen elements) and INSERT/UPDATE frequency is low. Beyond ~100 elements per array or with frequent updates, a separate join table tends to outperform.

7. Functional Index (8.0.13+) — Expression Indexes

7.1 Definition

Functional Key Parts, introduced in MySQL 8.0.13:

CREATE INDEX idx_lower_email ON users ((LOWER(email)));
CREATE INDEX idx_yyyymm ON orders ((DATE_FORMAT(created_at, '%Y%m')));

The expression result becomes the index key. WHERE LOWER(email) = ? pushes down as an index range scan. Pre-8.0.13 the standard workaround was a generated column plus an index on it — now it is one line.

7.2 Use cases

Use case	Expression	WHERE
Case-insensitive search	`LOWER(email)`	`WHERE LOWER(email) = 'foo@bar.com'`
Date-part match	`DATE_FORMAT(created_at, '%Y%m')`	`WHERE DATE_FORMAT(created_at, '%Y%m') = '202401'`
Casts / extraction	`CAST(JSON_EXTRACT(meta, '$.user_id') AS UNSIGNED)`	`WHERE CAST(...) = 1234`
Length-based match	`CHAR_LENGTH(name)`	`WHERE CHAR_LENGTH(name) > 10`

7.3 Diagram 7 — leaves of a functional index

Plain index (idx_email):
  Leaf:
    ('Alice@Foo.com', PK=1)
    ('alice@foo.com', PK=2)        ← treated as a different key
    ('Bob@Bar.com', PK=3)

Functional index (idx_lower_email = LOWER(email)):
  Leaf:
    ('alice@foo.com', PK=1)        ← LOWER applied, that becomes the key
    ('alice@foo.com', PK=2)        ← sorted under the same key
    ('bob@bar.com', PK=3)

WHERE LOWER(email) = 'alice@foo.com'
  → binary search jumps to 'alice@foo.com' in idx_lower_email
  → range scan returns PK=1 and PK=2 simultaneously

→ Reading Diagram 7. The post-expression value (after LOWER) is the leaf sort key. WHERE LOWER(email) = ? matches the binary-search primitive directly. Wrapping LOWER over a plain index would not work — the index keys are like ‘Alice@Foo.com’, so the comparison-after-function destroys the index (full scan).

7.4 Constraints

Item	Constraint
WHERE expression	Must match the CREATE INDEX expression exactly to push down
Determinism	Expression must be deterministic (no NOW() / RAND())
Storage	Long expression results (e.g., 64-char SHA256 hex) inflate leaf size
Extra runtime cost	Expression evaluated on every INSERT/UPDATE

Vlad Mihalcea — MySQL 8 Functional Indexes covers 8.0.13+ usage and pitfalls in detail.

Our measurements do not include functional indexes, so this section relies on official docs + Vlad Mihalcea. In production, EXPLAIN ANALYZE confirmation that the push-down really happens is mandatory — even a single-character mismatch silences the index.

8. Hash / Spatial / Full-text — InnoDB’s Specialised Indexes

This post centres on InnoDB B-tree, so the three specialised indexes get one paragraph each.

8.1 Hash index — Memory-engine-only + InnoDB’s adaptive hash

MySQL — Hash Index:

“InnoDB does not support direct creation of hash indexes. However, InnoDB monitors index searches, and if it observes that queries could benefit from building a hash index, it builds one automatically for index pages frequently accessed. This feature is called the adaptive hash index.”

→ InnoDB users do not create hash indexes directly. Instead, InnoDB caches frequently-accessed B-tree nodes as a hash index internally — the adaptive hash. Toggleable via innodb_adaptive_hash_index, default ON, and that default is the operational standard.

Memory-engine hash indexes are for fast ephemeral lookups: equality (=) O(1), no range, no sort. Suitable only for niche use cases like session-scoped caches.

8.2 Spatial index — R-tree

For spatial data on GEOMETRY / POINT / LINESTRING. R-tree is a tree of bounding boxes — “find N items near this coordinate” lookups become index-driven.

CREATE TABLE places (
    id BIGINT PRIMARY KEY,
    location POINT NOT NULL SRID 4326,
    SPATIAL INDEX idx_location (location)
);

SELECT * FROM places
WHERE ST_Distance_Sphere(location, ST_GeomFromText('POINT(127.0 37.5)', 4326)) < 1000;

Operational use cases are location-based services (restaurant finder / store search / nearby owners). Rare in plain commerce backends, but standard wherever map-coordinate search matters.

8.3 Full-text index — inverted index

For natural-language search (MATCH ... AGAINST). The structure is an inverted index: word → list of documents that contain it.

CREATE TABLE articles (
    id BIGINT PRIMARY KEY,
    title VARCHAR(200),
    body TEXT,
    FULLTEXT INDEX idx_title_body (title, body)
);

SELECT * FROM articles
WHERE MATCH(title, body) AGAINST('search query' IN NATURAL LANGUAGE MODE);

Operational limits:

Weak Korean morphological analysis (use the ngram parser: WITH PARSER ngram)
Index maintenance cost grows quickly past ~1M rows
For real-time search as a core capability, dedicated engines like Elasticsearch / Meilisearch are the standard

InnoDB full-text tops out at small forums / simple lookups. Real search products belong to a separate engine.

9. Index Cost — Indexes Are Not Free

9.1 [Measured — Java/Spring] The price of the five-index design

The bill that landed alongside the read-side speedups after adding the five indexes:

Cost item	Measured	Notes
Storage added	1.3GB	All five indexes combined (+13% on a 10GB table)
Write latency	structurally 6 B-tree updates per INSERT (actual latency varies)	Each INSERT updates 6 B-trees (1 clustered + 5 secondary)
Buffer pool occupancy	1.3GB / pool size	Index leaves take a share of the pool → impacts clustered hit ratio

Breakdown:

idx_created_at_id (8B + 8B) × 10M × 1.5 ≈ 240MB
idx_owner_state_created (8 + 1 + 8 + 8) × 10M × 1.5 ≈ 375MB
idx_state_created (1 + 8 + 8) × 10M × 1.5 ≈ 255MB
idx_owner_id (8 + 8) × 10M × 1.5 ≈ 240MB
idx_region_code (1 + 8) × 10M × 1.5 ≈ 135MB
Total ≈ 1.3GB

9.2 Diagram 8 — index count vs cost

Cost vs index count (this series' measurement reference):

#indexes    INSERT cost (× baseline)    Storage (GB)
   0          █  1×                        0
   1          ██  2×                        ~0.25
   3          ████  4×                       ~0.75
   5          ██████  6×                       ~1.3
   10         ███████████  11×                 ~2.6 (theoretical)
              ↑                                 ↑
       Every INSERT touches             10GB table with 26% indexes →
       all B-trees                      buffer pool pressure + clustered hit ratio

  W2 [Measured]: load at 187K rows/s with zero indexes (Phase 2)
   → reloading with five indexes is theoretically structurally 6 B-tree updates per INSERT (actual latency varies) (6 B-tree updates per INSERT).
   That is why the bulk-load pattern is **disable indexes during load → enable after load**.

→ Reading Diagram 8. INSERT cost scales with #indexes + 1 (1 clustered + N secondary). Five secondaries = 6x INSERT cost. Storage scales with row count × key size × index count.

9.3 Finding — the index that speeds up reads slows down writes

The core trade-off.

Indexes accelerate reads at the cost of slower writes. Read/write ratio + write frequency + storage budget all go into the decision.

A one-line rule:

OLAP table with 99%+ reads: add indexes liberally (write cost is negligible).
OLTP table with 50%+ writes: keep indexes minimal — only the ones you genuinely need.
Batch load then read: disable indexes during load, enable afterward (the load-then-add-indexes pattern in this series).

10. The Q2 Paradox — Adding an Index Can Make a Query Slower

10.1 [Measured — Java/Spring] Q2’s numbers

Q2 from this series:

SELECT id, owner_id, state, created_at
FROM orders_w2
WHERE state = 'CONFIRMED'
ORDER BY created_at DESC
LIMIT 5;

Stage	Actual time	Notes
Before (no index)	0.658 ms	Full scan + LIMIT 5 early termination
After (idx_state_created added)	13.5 ms	⚠️ Slower after adding the index

→ A case where adding an index made reads slower. Counter-intuitive numbers. When an interviewer asks “have you ever added an index that made the query slower?” — this is the answer.

10.2 Why slower — the trap of cost-based optimisation

EXPLAIN diff:

	Before	After
type	ALL (full scan)	range (idx_state_created)
rows	9.7M (estimate)	336K (estimated state=‘CONFIRMED’ distribution)
Extra	filesort + filter	Backward index scan

Before: full scan estimates 9.7M rows but in reality state = 'CONFIRMED' is ~25% (4 values evenly distributed) → after starting the scan, 5 matches are found early → only ~20 rows actually read → 0.658ms.

After: idx_state_created kicks in → reverse scan over the state=‘CONFIRMED’ leaf region + LIMIT 5. But the leaf is sorted by created_at, so locating the most recent 5 with state=‘CONFIRMED’ requires secondary-index → table-row lookups. Random I/O cost from the secondary lookups outweighs the early-termination savings of the full scan → 13.5ms.

The optimizer is statistics-driven. When statistics deviate slightly from reality, the wrong plan can be selected. Q2 is exactly that case.

10.3 Diagram 9 — Q2’s wrong choice by the optimizer

graph TB
    Q[Q2: WHERE state='CONFIRMED' ORDER BY created_at DESC LIMIT 5]
    Q --> Decision{Optimizer<br/>cost-based decision}
    Decision -->|Before: no index| FS[Full Table Scan + LIMIT 5 early termination<br/>= 0.658 ms]
    Decision -->|After: idx_state_created| IS[Index Range Scan + Reverse + Lookup<br/>= 13.5 ms]
    FS -.->|actually faster| Faster[Before is the<br/>faster path in reality]
    IS -.->|optimizer picked wrong| Slower[Optimizer believed<br/>the index path is faster]

→ Reading Diagram 9. The optimizer is mostly right, but for small LIMIT + evenly distributed predicates, full scan + early termination can win. The optimizer, looking only at statistics, saw “state=‘CONFIRMED’ is 25% of rows, so the index path is efficient” — but in reality the lookup overhead dominated.

10.4 Workarounds — index hints or refreshed statistics

Three workarounds:

Index hint (USE INDEX(PRIMARY) / IGNORE INDEX(idx_state_created)) — Q2 returns to ~0.65ms
Refresh statistics (ANALYZE TABLE orders_w2) — accurate stats may let the optimizer fall back automatically
Drop the index — if Q2 is the only query touching state, idx_state_created is unnecessary (covered in Series Post 6 — index dieting)

The standard production answer: verify with EXPLAIN ANALYZE. The optimizer is mostly right, never infallible. If in doubt, measure.

LINE Engineering — MySQL Workbench VISUAL EXPLAIN carries the same message — visualisation surfaces the optimizer’s wrong picks.

11. Operational Guidelines — When to Pick What

11.1 By WHERE shape

A single table that compresses every decision in this post.

WHERE shape	Right index	Why
`=` (point lookup)	B-tree	Matches the binary-search primitive
`<` / `>` / `BETWEEN` (range)	B-tree	Leaf linked-list walk
`LIKE 'foo%'` (prefix)	B-tree	Front-anchored match = range scan
`LIKE '%foo'` (suffix)	(index unused)	Suffix has no relation to leaf sort order
`JSON_CONTAINS` / `MEMBER OF`	Multi-valued	JSON array element match
`LOWER()` / expression	Functional	Expression result is the key
`MATCH ... AGAINST`	Full-text	Natural-language search
Coordinate-based	Spatial (R-tree)	Bounding-box tree

11.2 Cardinality and column-order rules

Rule	Application
Standalone-index cardinality threshold	Selectivity ≥ 0.01 (= 1% or less per value); below that, no standalone index
Composite column order	Equality (=) first → range (<, >) next → ORDER BY column last
Composite leading column	Highest cardinality (within the equality group)
Low-cardinality columns	Only as trailing columns inside a composite (region / state / boolean)

11.3 Cover when you can, write-amplify only when you must

Rule	Application
Make it covering when possible	Narrow the SELECT list explicitly so all columns fit in the leaf
High-write tables	Indexes minimal — only what is genuinely needed
Bulk loads	Disable indexes during load → enable after (the load-then-add-indexes pattern in this series)
Before adding any index	Measure Before/After with EXPLAIN ANALYZE + write-latency impact + record the decision in an ADR

11.4 The hard rule (production standard)

Every new index must (1) have measured Before/After EXPLAIN ANALYZE, (2) have measured write-latency impact, and (3) be recorded in an ADR or design doc with the rationale — all three before merge.

This compresses every measurement and finding in this post into a process. A PR that adds an index with only “should be faster” as the rationale gets rejected. A PR that says “Q3 1,609ms → 0.65ms / write latency 1.2x / Q2 paradox confirmed” gets merged.

12. Big Tech References + Recap Questions

12.1 Big Tech references (URLs verified ≥ 6)

Source	Article	Linked section in this post
LINE Engineering	Validating index behaviour with MySQL Workbench VISUAL EXPLAIN	Section 10 detecting Q2 paradox + Section 11 EXPLAIN in operations
Kakao Pay	JPA Transactional readOnly + set_option, QPS down 58%	Section 5 covering + read-side optimisation
Toss SLASH24	Next core banking — Oracle→MySQL MVCC + index behaviour	Section 2 / Section 10 optimizer
Vlad Mihalcea	Index Selectivity / Cardinality	Section 4 cardinality
Vlad Mihalcea	MySQL 8 Functional Indexes	Section 7 functional
PlanetScale	How indexes work in MySQL	Section 1 / Section 11
Pinterest Engineering	Sharding Pinterest’s growing data	Section 11 production (indexes + sharding)
Discord	Storing Billions of Messages	Section 9 index cost → distributed migration
MySQL official	CREATE INDEX	Section 1 the six types
MySQL official	Multi-valued Indexes	Section 6
MySQL official	Hash Index — adaptive hash	Section 8.1
Use The Index, Luke!	The Where Clause	Section 3 leftmost prefix

12.2 Recap — putting this article in your own words

If someone who just finished this article were to summarize it through five core questions, here’s how the measurements answer them.

Q. “What index types does MySQL have?”

What this article catalogued is six types. B-tree (default; equality / range / sort), Hash (Memory-engine-only — InnoDB simulates parts via the adaptive hash index), Spatial (R-tree, GEOMETRY columns), Full-text (inverted index, natural-language search), Multi-valued (8.0+, each JSON-array element is a leaf), Functional (8.0.13+, expression result like LOWER() becomes the key). InnoDB can build five of them (no Hash). The decisions you make most often in production are around B-tree and its variants (composite / covering / multi-valued / functional).

Q. “What is the leftmost-prefix rule for composite indexes?”

What this article showed by measurement is — a composite (a, b, c) requires matching from the left. WHERE a=? works, WHERE a=? AND b=? works, full prefix works. But WHERE b=? alone fails to use the index, WHERE b=? AND c=? also fails. The reason: the leaf is sorted as a → b → c, and without a, b is scattered across the leaf — binary search no longer applies. The phone-book analogy: in a directory sorted by surname-then-given-name, finding “given name John” alone forces a full scan. Same principle. Operationally: composite column order is equality (=) first, range (<, >) next, ORDER BY last. In this article’s measurements, (owner_id, state, created_at, id) made Q5 go from 1,497ms to 2.59ms (577x) — leftmost-prefix matching + reverse scan combined ([Measured — Java/Spring]).

Q. “Have you ever added an index and the query got slower?”

What this article surfaced as the canonical case is Q2 (WHERE state='CONFIRMED' ORDER BY created_at DESC LIMIT 5) — 0.658ms without an index, 13.5ms after adding idx_state_created. Adding the index made it slower. Reason: the full scan with LIMIT 5 terminated early after reading ~20 rows, while the index path required secondary-index → table-row random-I/O lookups that exceeded the early-termination savings. The optimizer is statistics-driven cost-based — when stats drift from reality, it picks the wrong plan. Workarounds: (1) index hint (USE INDEX(PRIMARY)) to force, (2) ANALYZE TABLE to refresh stats, (3) drop the index if no other query needs it. Lesson: verify with EXPLAIN ANALYZE. The optimizer is mostly right, never infallible ([Measured — Java/Spring]).

Q. “When do you reach for Multi-valued or Functional indexes?”

What this article catalogued as use cases: Multi-valued (8.0+) sits on a JSON-array column — WHERE JSON_CONTAINS(tags, ?) / MEMBER OF / JSON_OVERLAPS becomes index-driven element-level lookups. One row → N leaves (size of the array). Use case: tags / labels / categories — modeling N:M as JSON instead of a join table. Only suitable when array size is small and INSERT frequency is low. Functional (8.0.13+) makes an expression like LOWER(email) the index key — WHERE LOWER(email) = ? pushes down as an index range scan. Use cases: case-insensitive search, date-part match, JSON field extraction. Caveat: the WHERE expression must match the CREATE INDEX expression exactly to push down — a single-character difference silences the index. EXPLAIN ANALYZE confirmation that the push-down really fires is mandatory.

Q. “Adding N indexes to a table — is the write cost N× higher?”

[Structural fact] One clustered index (the table itself) + N secondary = N+1 B-trees coexisting. Each INSERT updates every B-tree’s leaf — the number of update targets is N+1. [Hypothesis] However, the actual write latency depends on buffer pool / redo log / batch size / change buffer effects, so it does not scale linearly — this article confirms the structural update target count in a five-index environment, not a linear latency multiplier. [Operational recommendation] That’s why the standard pattern for bulk loads is disable indexes during load → enable after (Bulk Data Loading recommendation). [Measured] Storage impact — five indexes on 10M rows ≈ 1.3GB extra (+13% on a 10GB table) → buffer-pool occupancy → clustered-index hit ratio degrades. So the index that speeds up reads costs writes, storage, and memory — read/write ratio + write frequency + storage budget all factor in. Index dieting (sys.schema_unused_indexes / invisible indexes) is the operational standard.

13. What I Learned

13.1 Assumptions broken by measurement

“Every MySQL index is a B-tree” → Six types — Hash / Spatial / Full-text / Multi-valued / Functional all exist
“Adding an index always helps” → The Q2 paradox (0.66ms → 13.5ms) — the optimizer can pick wrong
“Indexes are also useful for boolean / state columns” → A standalone index on cardinality 4 is nearly worthless. Trailing-column-in-composite only
“An index is one-time work” → Storage 1.3GB + structural write amplification (N+1 B-trees) + buffer pool pressure — ongoing cost
“JSON columns can’t be indexed” → Multi-valued (8.0+) and Functional (8.0.13+) make it possible
“The optimizer is always right” → Statistics-driven cost-based — drift from reality means the wrong plan

13.2 The single line

MySQL indexing has six types + four shapes inside the B-tree (clustered/secondary/covering/composite) + cardinality as the decision axes. On a 10M-row environment, [measured] Q3 covering 2,476× / Q5 composite 577× / Q2 paradox (0.66ms → 13.5ms) / structural write amplification (N+1 B-trees) / storage 1.3GB. Indexes are not free — the index that speeds up reads slows down writes. Every new index must clear three gates: measured Before/After EXPLAIN ANALYZE + measured write-latency impact + ADR record — all three, before merge.

14. Next Post — The Series Continues

This is Post #2 of the RDB Mastery series — the trade-offs of index types. Coming up:

Post #3 — Mastering EXPLAIN ANALYZE: builds on Section 10’s Q2 paradox and the row-constructor push-down trap — how the optimizer actually picks an index
Post #4 — Safe operational ALTER patterns (Online DDL / pt-osc / gh-ost): how the N+1 B-trees wobble during ALTER
Post #5 — Limits of 1:N joins (N+1 / EntityGraph / cursor + 1:N): how secondary lookups blow up on top of an ORM
Post #6 — Index dieting: how to recover the N+1 B-tree cost in production

Companion posts:

RDB Mastery #1 — InnoDB Index Internals (the structural view this post recaps in Section 2 / Section 3 / Section 5)
MySQL No-Offset Cursor Pagination (a cursor application of Section 3’s leftmost prefix)
Understanding MySQL InnoDB Architecture (this post on index types / that post on buffer pool / log / undo)

References

Official documentation

MySQL — CREATE INDEX — the six types
MySQL — Multi-valued Indexes — 8.0+
MySQL — Functional Key Parts — 8.0.13+
MySQL — Hash Index / Adaptive Hash
MySQL — Spatial Indexes
MySQL — Full-Text Search
MySQL — Bulk Data Loading for InnoDB — disable indexes during load
MySQL — EXPLAIN Output (type / Extra)

Big Tech / production

LINE Engineering — VISUAL EXPLAIN — detecting optimizer mistakes
Kakao Pay — JPA Transactional readOnly — read-side optimisation
Toss SLASH24 — Next core banking — Oracle → MySQL InnoDB MVCC
PlanetScale — How indexes work in MySQL — operational view
Pinterest — Sharding Pinterest’s growing data — index strategy when sharding
Discord — Storing Billions of Messages — RDB limits → distributed migration

Textbook-grade

The raw data behind these measurements lives in a separate learning note (inside the portfolio repo). 10M-row environment / cardinality of five indexes / Q1~~Q5 Before/After / Q2 paradox / write latency 5~~6× / storage 1.3GB.