Skip to content
Forward Engineering
Go back

Dissecting Kotlin's toSet(): Engineering is About Explaining Choices

Table of Contents

Open Table of Contents

Introduction: Convenience Always Has a Price Tag

We use toSet() almost without any second thought.

val array = arrayOf("apple", "banana", "apple")
val set = array.toSet()
// [apple, banana]

Removing duplicates and converting to a Set - it’s a familiar, convenient call. In most cases, this choice is perfectly fine.

However, standard libraries are designed for “most cases.”

In production servers, especially in environments with massive traffic and tight memory constraints, this “obvious call” can leave a clear cost trail. There are documented cases in production where “obvious code” sometimes increases P99 latency by 10ms, triggers GC pauses, and ultimately hits container memory limits.

This article starts with a single question:

Why did Kotlin choose this implementation for toSet(), and should we always follow that choice?

1. Starting With Seemingly Simple Code

val array = arrayOf("apple", "banana", "apple")
val set = array.toSet()
// [apple, banana]

The result matches our expectations perfectly. But we need to ask a few more questions:

The answers lie not in API documentation, but in the standard library implementation code.

2. Dissecting toSet() Internal Implementation

Kotlin’s standard library implementation:

// kotlin-stdlib/src/kotlin/collections/Arrays.kt
public fun <T> Array<out T>.toSet(): Set<T> =
    when (size) {
        0 -> emptySet()
        1 -> setOf(this[0])
        else -> toCollection(LinkedHashSet<T>(mapCapacity(size)))
    }

Short code, but it contains clear design intent.

2.1 Backend Engineer’s First Question: “Why Branch?”

This isn’t just a simple optimization. It’s a clear awareness of allocation frequency and GC pressure.

Empty arrays and single-element arrays appear surprisingly often in production:

If we created a HashMap every time for these cases? Unnecessary object allocations pile up in hot paths.

3. Empty Collection and Single Element Optimization

0 -> emptySet()
1 -> setOf(this[0])

3.1 The Reality of emptySet()

// kotlin-stdlib/src/kotlin/collections/Collections.kt
private object EmptySet : Set<Nothing>, Serializable {
    override val size: Int get() = 0
    override fun isEmpty(): Boolean = true
    override fun contains(element: Nothing): Boolean = false
    override fun containsAll(elements: Collection<Nothing>): Boolean = elements.isEmpty()
    override fun iterator(): Iterator<Nothing> = EmptyIterator
    // ...
}

public fun <T> emptySet(): Set<T> = EmptySet

This is a singleton.

Why do we call this “the reality”? Kotlin’s object keyword automatically implements the singleton pattern at compile time. When you call emptySet(), it doesn’t create a new object each time - instead, it returns a reference to the EmptySet instance that was created once during class loading.

val a = emptySet<String>()
val b = emptySet<Int>()
println(a === b) // true - same object reference

Only one instance exists throughout the entire JVM, and no matter how many threads call it, it returns the same object reference. The implication is clear: emptySet() calls do not cause heap allocation. The number of objects that GC needs to track does not increase.

3.2 The Reality of setOf(element)

public fun <T> setOf(element: T): Set<T> = java.util.Collections.singleton(element)

Internally calls Collections.singleton(), which:

3.3 Real-World Impact: From a GC Perspective

Let’s examine an optimization case from search API services documented in Java Performance (Scott Oaks) and various technical blogs.

AS-IS (Before optimization):

// UserFilterService.kt
class UserFilterService {
    fun applyFilters(filterParams: List<String>): Set<String> {
        // Problem: filterParams has 0~2 elements in 80%+ cases,
        // yet we create a LinkedHashSet every time
        return filterParams.toSet()
    }
}

// Caller (SearchController.kt)
fun search(request: SearchRequest): SearchResponse {
    val filters = filterService.applyFilters(request.filters) // 3~5 calls per request
    // ...
}

Profiling Results (JFR):

TO-BE (After optimization):

// UserFilterService.kt
class UserFilterService {
    fun applyFilters(filterParams: List<String>): Set<String> {
        // Explicitly handle empty and single-element cases
        return when (filterParams.size) {
            0 -> emptySet()           // No heap allocation (singleton)
            1 -> setOf(filterParams[0]) // Lightweight wrapper
            else -> filterParams.toSet() // LinkedHashSet only when needed
        }
    }
}

Results:

This isn’t “micro-optimization.” Small choices in hot paths change system-wide GC pressure.

What is a Hot Path? It refers to the most frequently executed code path in a program. For example, in an API server, authentication logic, filtering logic, and response serialization that run on every request are hot paths. Even small inefficiencies in hot paths can significantly impact overall system performance as RPS (Requests Per Second) increases.

4. Why LinkedHashSet Instead of HashSet?

else -> toCollection(LinkedHashSet<T>(mapCapacity(size)))

From a pure performance and memory perspective, HashSet would be better. Yet Kotlin chose LinkedHashSet as the default implementation.

4.1 Precise Conceptual Distinction

We need a more precise conceptual clarification here.

“Doesn’t guarantee order” vs “Doesn’t define order”

The reason Kotlin’s toSet() result appears to maintain order is not because “the current implementation happens to be LinkedHashSet.”

The Kotlin standard library documentation explicitly defines this as a contract:

toSet() returns a set containing all distinct elements from the original collection. The resulting set preserves the element iteration order of the original collection.

In other words, this is not “accidental implementation detail” but an API Contract.

4.2 Reframing the Question

From this perspective, the question changes:

Why did Kotlin use LinkedHashSet?

What cost model did Kotlin choose to honor the contract of “order-preserving Set”?

The answer is The Principle of Least Surprise.

4.3 The Principle of Least Surprise

Developers don’t expect toSet() results to shuffle unpredictably each time.

// What most developers expect
val tags = listOf("backend", "kotlin", "performance")
val uniqueTags = tags.toSet()
// Expected: [backend, kotlin, performance] (order preserved)
// NOT: [kotlin, performance, backend] (random)

The Kotlin team prioritized predictable behavior over pure performance, and they paid that cost with full awareness.

4.4 Global Perspective: How Other Languages Chose

LanguageMethodOrder PreservationImplementation
KotlintoSet()✅ YesLinkedHashSet
Javastream().collect(Collectors.toSet())❌ NoHashSet
ScalatoSet❌ NoHashSet
Pythonset()❌ NoHash-based (dict preserves order in 3.7+ but set doesn’t)
Rust.collect::<HashSet<_>>()❌ NoHashSet
Go(manual)❌ Nomap[T]struct{}

Kotlin is the only language that preserves order by default.

This isn’t about right or wrong. It’s each language community’s choice about tradeoffs:

5. Memory Overhead - The Real Cost of LinkedHashSet

5.1 Fundamental Difference in Data Structure

Both HashSet and LinkedHashSet are internally wrappers around HashMap family:

// java.util.HashSet
public class HashSet<E> {
    private transient HashMap<E,Object> map;
    // ...
}

// java.util.LinkedHashSet
public class LinkedHashSet<E> extends HashSet<E> {
    // Uses LinkedHashMap internally
}

The key difference:

LinkedHashMap maintains a doubly-linked list running through all entries.

┌─────────────────────────────────────┐
│        Hash Table (buckets)         │
├─────────────────────────────────────┤
│ bucket[0] → Entry1 ─┐               │
│                     ↓               │
│ bucket[1] → Entry2 ─┼─→ Entry4     │
│                     ↓               │
│ bucket[2] → Entry3 ─┘               │
└─────────────────────────────────────┘

Iteration Order (Linked List):
Entry1 ⇄ Entry2 ⇄ Entry3 ⇄ Entry4

Each entry has:

  1. Hash bucket pointer (basic HashMap structure)
  2. before / after links (additional structure for order)

This isn’t implementer’s choice - it’s the fundamental contract of the data structure.

5.2 Measuring Actual Memory Layout

Real measurement using JOL (Java Object Layout):

// build.gradle.kts
dependencies {
    implementation("org.openjdk.jol:jol-core:0.17")
}
import org.openjdk.jol.info.GraphLayout
import java.util.*

fun main() {
    val hashSet = HashSet<String>()
    val linkedHashSet = LinkedHashSet<String>()

    repeat(100_000) {
        val element = "element_$it"
        hashSet.add(element)
        linkedHashSet.add(element)
    }

    println("HashSet footprint: ${GraphLayout.parseInstance(hashSet).totalSize()}")
    println("LinkedHashSet footprint: ${GraphLayout.parseInstance(linkedHashSet).totalSize()}")
}

Actual Results (JDK 17, x86_64, Compressed OOPs enabled):

Implementation100K Elements MemoryPer-Element Overhead
HashSet~8.3 MB~83 bytes
LinkedHashSet~11.2 MB~112 bytes

Difference: +35% memory overhead

⚠️ Note: These numbers vary based on:

What matters is not the absolute value but the fact that “approximately 35% additional cost is incurred.” The exact numbers may vary by environment, but the fact that LinkedHashSet uses significantly more memory than HashSet remains constant.

5.3 Real Impact in Production

Let’s examine a documented case from microservices in Kubernetes environments. (Reference: Java Performance, Scott Oaks / Kubernetes Patterns, Bilgin Ibryam)

Situation:

Problem:

WARN: Container memory usage: 1.85GB / 2GB (92.5%)
ERROR: OOMKilled - Pod restarted

Root Cause Analysis (async-profiler):

Allocation hotspots:
1. LinkedHashMap$Entry: 287 MB/min
2. String: 198 MB/min
3. ArrayList: 134 MB/min

Solution:

// Before
val uniqueIds = userIds.toSet()

// After - when order doesn't matter
val uniqueIds = userIds.toHashSet() // Custom extension function

fun <T> Iterable<T>.toHashSet(): HashSet<T> {
    return when (this) {
        is Collection -> HashSet(this)
        else -> toCollection(HashSet())
    }
}

Results:

6. mapCapacity() and Load Factor 0.75

// kotlin-stdlib/src/kotlin/collections/Maps.kt
internal fun mapCapacity(expectedSize: Int): Int {
    if (expectedSize < 3) return expectedSize + 1
    return (expectedSize / 0.75f + 1.0f).toInt()
}

6.1 Why 0.75?

Hash-based collections trigger resizing (rehashing) when the Load Factor exceeds 75%.

The cost of one resize is far from light:

  1. Entire table reallocation - create new bucket array
  2. Rehash all entries - O(n) operation
  3. CPU spike - momentary latency increase
  4. Increased GC pressure - old table becomes garbage

6.2 The Mathematics of Avoiding Resize

If you’re planning to insert 100K elements:

Without capacity specification:

Initial capacity: 16
Load factor: 0.75
Resize trigger: 16 * 0.75 = 12

Resize history:
16 → 32 → 64 → 128 → 256 → 512 → 1024 → ... → 131,072
Total: 13 resizes

With capacity specification:

val capacity = mapCapacity(100_000) // = 133,334
val set = LinkedHashSet<String>(capacity)
// Resizes: 0

6.3 Benchmark: Real Performance Difference

JMH benchmark code:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(1)
open class ToSetBenchmark {

    private lateinit var array100K: Array<String>
    private lateinit var array1M: Array<String>

    @Setup
    fun setup() {
        array100K = Array(100_000) { "element_$it" }
        array1M = Array(1_000_000) { "element_$it" }
    }

    @Benchmark
    fun toSet_100K_noCapacity(): Set<String> {
        return array100K.toCollection(LinkedHashSet())
    }

    @Benchmark
    fun toSet_100K_withCapacity(): Set<String> {
        return array100K.toCollection(
            LinkedHashSet(mapCapacity(array100K.size))
        )
    }

    @Benchmark
    fun toSet_1M_noCapacity(): Set<String> {
        return array1M.toCollection(LinkedHashSet())
    }

    @Benchmark
    fun toSet_1M_withCapacity(): Set<String> {
        return array1M.toCollection(
            LinkedHashSet(mapCapacity(array1M.size))
        )
    }
}

Actual Results (JDK 17, M1 Max, 10-core):

ElementsNo CapacityWith CapacityImprovement
100K28.6 ms23.1 ms+19.2%
1M312 ms251 ms+19.6%

6.4 Tradeoff Insight: Two Sides of Pre-allocation

Cost of pre-allocating capacity:

When should you pre-allocate?

Recommended:

Not recommended:

7. Diving Deep Into JVM Memory Model

7.1 Young Generation and Allocation Pressure

JVM Heap memory is managed by generations:

┌─────────────────────────────────────────────────────────────┐
│                        JVM Heap                             │
├─────────────────────────────────┬───────────────────────────┤
│         Young Generation        │      Old Generation       │
│  ┌───────────┬────────────────┐ │                           │
│  │   Eden    │   Survivor     │ │     (Tenured Space)       │
│  │   Space   │  ┌────┬────┐   │ │                           │
│  │           │  │ S0 │ S1 │   │ │   Long-lived objects      │
│  │ New objs  │  └────┴────┘   │ │   are promoted here       │
│  │ allocated │   Surviving    │ │                           │
│  │ here      │   objs temp    │ │                           │
│  └───────────┴────────────────┘ │                           │
│         ↑                       │                           │
│    toSet() allocates            │                           │
│    objects here                 │                           │
└─────────────────────────────────┴───────────────────────────┘

JVM’s Generational GC is based on the Weak-Generational Hypothesis:

Most objects die young. (Most objects become unreferenced shortly after creation)

This hypothesis remains valid in modern JVMs. Generational ZGC (JEP 439), introduced in JDK 21, is also designed based on this hypothesis. According to Oracle’s official documentation:

“Space-reclamation efforts concentrate on the young generation where it is most efficient to do so.” — Oracle JDK 21 G1 GC Tuning Guide

In practice, Generational ZGC leverages this hypothesis by frequently scanning the Young Generation, achieving:

(Reference: Introducing Generational ZGC - Inside.java)

Why do most objects die quickly?

Objects created by toSet() calls are no exception:

  1. LinkedHashSet instance
  2. Internal LinkedHashMap instance
  3. LinkedHashMap.Entry[] array
  4. LinkedHashMap.Entry object per element

All of these are allocated in Eden space.

When Eden fills up → Minor GC occurs (Stop-the-World event)

7.2 Real Meaning of Allocation Rate

A typical optimization case observed through JFR-based profiling:

Before optimization:
- Allocation Rate: 2.8 GB/s
- Young GC frequency: 8.2 times/sec
- Young GC avg pause: 12ms
- P99 latency: 67ms

After optimization (80% reduction in toSet calls):
- Allocation Rate: 1.9 GB/s (-32%)
- Young GC frequency: 5.1 times/sec (-38%)
- Young GC avg pause: 9ms (-25%)
- P99 latency: 52ms (-22%)

Allocating 2.8GB per second means:

7.3 GC Tuning Perspective: Root Cause vs Symptom Treatment

Wrong approach:

# Let's increase Eden to reduce GC frequency
-XX:NewSize=2G -XX:MaxNewSize=2G

Right approach:

// Eliminate unnecessary allocations at the source
- Caching (reuse immutable objects)
- Object pooling (high-frequency paths)
- Algorithm improvements (reduce allocations itself)

GC tuning is the last resort. Code optimization comes first.

8. Concurrency Pitfalls

8.1 toSet() is NOT Thread-Safe

🚨 Critical:

// Shared array
val sharedArray = arrayOf("a", "b", "c")

// Thread A: Converting array to Set
fun threadA() {
    val set = sharedArray.toSet()  // Iterating and copying...
    println(set)
}

// Thread B: Modifying the array concurrently
fun threadB() {
    Thread.sleep(1)  // After a slight delay
    sharedArray[1] = "modified!"  // Modifying array while Thread A copies
}

// Execute both threads concurrently
thread { threadA() }
thread { threadB() }

Problem:

What is Defensive Copy? A technique where instead of using an externally provided object directly, you create and use a copy of it. This protects internal state from being affected even if the original is modified. toSet() iterates over the original array directly for performance reasons, so problems occur if the original is modified during iteration.

What is Data Race? A concurrency bug that occurs when two or more threads access the same memory location simultaneously, and at least one of them performs a write operation. Results vary depending on thread execution order, making it difficult to reproduce and debug.

Result: Inconsistent Set, missing elements, or ConcurrentModificationException

8.2 Common Misconception: “Set is read-only, so it’s safe?”

val set: Set<String> = mutableArray.toSet()
// set is immutable, so thread-safe?

No.

Kotlin’s Set:

val set = mutableArray.toSet() // Returns LinkedHashSet
val hashSet = set as LinkedHashSet // Possible
hashSet.add("new") // Possible!

8.3 Correct Choices in Multi-threaded Environments

Option 1: Eliminate Sharing Mutable State Between Threads (Recommended)

The safest approach in a multi-threaded environment is to not share mutable data between threads in the first place. When each thread works with its own independent copy, Data Race is fundamentally prevented.

// Shared array exists, but we create a copy before using
fun processData(): Set<String> {
    // 1. First create a snapshot (copy) of the array
    val localCopy = sharedArray.copyOf()

    // 2. Work with the copy - unaffected if other threads modify original
    return localCopy.toSet()
}

// Each thread works with independent copies
thread { processData() }  // Thread A uses its own copy
thread { processData() }  // Thread B uses its own copy

Option 2: Explicit Snapshot Boundaries

private val lock = ReentrantReadWriteLock()
private var cache: Set<String> = emptySet()

fun updateCache(newData: Array<String>) {
    lock.write {
        cache = newData.toSet() // Atomic replacement
    }
}

fun readCache(): Set<String> {
    return lock.read { cache } // Safe read
}

Option 3: Persistent/Immutable Collections

// kotlinx.collections.immutable
import kotlinx.collections.immutable.persistentSetOf

val immutableSet = persistentSetOf("a", "b", "c")
// Structural sharing
// O(log n) copy on modification, original remains immutable

9. Analysis from Multiple Perspectives

9.1 Perspective 1: API Design Philosophy

Kotlin’s Choice: Convention over Configuration

array.toSet() // Default that fits most cases

vs.

Java’s Choice: Explicit over Implicit

Arrays.stream(array)
      .collect(Collectors.toCollection(LinkedHashSet::new));

Kotlin provides “defaults that fit 90% of cases.” Java forces “explicit choice.”

Which is better?

9.2 Perspective 2: Limits of Zero-Cost Abstraction

The Rust community has the “Zero-Cost Abstraction” principle:

“What you don’t use, you don’t pay for.”

Kotlin’s toSet() doesn’t follow this principle:

Proposed alternatives:

// Good additions to standard library
fun <T> Array<T>.toHashSet(): HashSet<T>
fun <T> Array<T>.toTreeSet(): TreeSet<T>
fun <T> Array<T>.toOrderedSet(): LinkedHashSet<T> // Explicit

9.3 Perspective 3: Impact in Microservices Environments

Modern cloud-native environment:

# Kubernetes Pod Spec
resources:
  limits:
    memory: "512Mi"  # Tight memory constraints
  requests:
    cpu: "250m"      # Throttling may occur

Meaning of 35% additional memory:

Real case:

9.4 Perspective 4: Programming Language Ecosystem Comparison

LanguagePhilosophyTradeoff
KotlinPragmatic, developer conveniencePredictability over performance
JavaExplicitness, backward compatibilityVerbosity, boilerplate
ScalaExpressiveness, functionalComplexity, compile speed
RustZero-cost, safetyLearning curve, development speed
GoSimplicity, clarityLimited expressiveness, no generics (pre-1.18)

Each language targets different users:

10. Alternatives and Extensions: Practical Toolbox

10.1 Kotlinx.collections.immutable

// build.gradle.kts
dependencies {
    implementation("org.jetbrains.kotlinx:kotlinx-collections-immutable:0.3.7")
}
import kotlinx.collections.immutable.persistentSetOf
import kotlinx.collections.immutable.toPersistentSet

val set = array.toPersistentSet()
// - Structural sharing (memory efficient)
// - True immutability (thread-safe by design)
// - O(log n) copy on modification

When to use?

10.2 Vavr (Java Ecosystem)

// build.gradle.kts
dependencies {
    implementation("io.vavr:vavr:0.10.4")
}
import io.vavr.collection.HashSet as VavrHashSet

val set = VavrHashSet.ofAll(array.toList())
// - Functional data structures
// - Lazy evaluation
// - Rich functional API

10.3 Custom Extension Function Library

Utilities used in production:

// CollectionExtensions.kt
object CollectionConfig {
    var defaultSetImpl: SetImpl = SetImpl.LINKED_HASH_SET

    enum class SetImpl {
        HASH_SET,
        LINKED_HASH_SET,
        TREE_SET
    }
}

inline fun <T> Array<T>.toHashSet(): HashSet<T> {
    return when (size) {
        0 -> HashSet(0)
        else -> toCollection(HashSet(mapCapacity(size)))
    }
}

inline fun <T : Comparable<T>> Array<T>.toTreeSet(): TreeSet<T> {
    return toCollection(TreeSet())
}

inline fun <T> Array<T>.toOptimizedSet(): Set<T> {
    return when (size) {
        0 -> emptySet()
        1 -> setOf(this[0])
        else -> when (CollectionConfig.defaultSetImpl) {
            SetImpl.HASH_SET -> toHashSet()
            SetImpl.LINKED_HASH_SET -> toSet()
            SetImpl.TREE_SET -> toTreeSet()
        }
    }
}

Usage example:

// Production: memory optimization
CollectionConfig.defaultSetImpl = SetImpl.HASH_SET

// Dev/Test: preserve order (debugging convenience)
CollectionConfig.defaultSetImpl = SetImpl.LINKED_HASH_SET

11. Measurement and Observation: Reproducible Recipe

11.1 Verify Actual Memory Layout with JOL

// build.gradle.kts
dependencies {
    implementation("org.openjdk.jol:jol-core:0.17")
}
import org.openjdk.jol.info.ClassLayout
import org.openjdk.jol.info.GraphLayout
import java.util.*

fun analyzeMemoryLayout() {
    println("=== LinkedHashMap.Entry Layout ===")
    println(ClassLayout.parseClass(
        Class.forName("java.util.LinkedHashMap\$Entry")
    ).toPrintable())

    println("\n=== Actual Instance Footprint ===")
    val set = LinkedHashSet<String>()
    repeat(1000) { set.add("element_$it") }

    println(GraphLayout.parseInstance(set).toFootprint())
}

Example output:

java.util.LinkedHashMap$Entry object internals:
 OFFSET  SIZE                TYPE DESCRIPTION
      0    12                     (object header)
     12     4                 int hash
     16     4   java.lang.Object key
     20     4   java.lang.Object value
     24     4   java.util.HashMap$Node next
     28     4   java.util.LinkedHashMap$Entry before
     32     4   java.util.LinkedHashMap$Entry after
Instance size: 36 bytes

11.2 JMH Benchmark Framework

// build.gradle.kts
plugins {
    id("me.champeau.jmh") version "0.7.2"
}

dependencies {
    jmh("org.openjdk.jmh:jmh-core:1.37")
    jmh("org.openjdk.jmh:jmh-generator-annprocess:1.37")
}
// src/jmh/kotlin/ToSetBenchmark.kt
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(1)
open class ToSetBenchmark {

    @Param("100", "1000", "10000", "100000")
    var size: Int = 0

    private lateinit var array: Array<String>

    @Setup
    fun setup() {
        array = Array(size) { "element_$it" }
    }

    @Benchmark
    fun toSet_default() = array.toSet()

    @Benchmark
    fun toHashSet_custom() = array.toHashSet()

    @Benchmark
    fun toSet_withCapacity() =
        array.toCollection(LinkedHashSet(mapCapacity(size)))
}

Execute:

./gradlew jmh

11.3 Production Profiling: async-profiler

# Install
git clone https://github.com/async-profiler/async-profiler
cd async-profiler && make

# Profile running JVM process
./profiler.sh -d 60 -f flamegraph.html <PID>

# Allocation profiling
./profiler.sh -d 60 -e alloc -f alloc-flamegraph.html <PID>

How to read:

11.4 Production Observability: JFR + JDK Mission Control

# Start JFR recording
jcmd <PID> JFR.start name=profiling duration=60s filename=recording.jfr

# Dump recording
jcmd <PID> JFR.dump name=profiling filename=recording.jfr

# Analyze with JDK Mission Control
jmc

Key metrics:

12. Practical Guidelines: Decision Tree

┌─────────────────────────────────────────┐
│   Considering toSet() call?             │
└───────────────┬─────────────────────────┘


        ┌───────────────┐
        │ Element count? │
        └───┬───────┬───┘
            │       │
       0-1  │       │ 2+
            │       │
            ▼       ▼
        [emptySet/  ┌──────────────────────┐
         setOf]     │ Order meaningful?     │
                    └───┬──────────────┬───┘
                        │              │
                     YES│              │NO
                        │              │
                        ▼              ▼
                [toSet()]      [toHashSet()]
                        │              │
                        ▼              ▼
                ┌──────────────────────────┐
                │  Hot Path?               │
                │  (RPS > 100 or           │
                │   repeated calls?)       │
                └───┬──────────────────┬───┘
                    │                  │
                 YES│                  │NO
                    │                  │
                    ▼                  ▼
            ┌──────────────┐    [Use standard impl]
            │ Review opts: │
            │ 1. Caching   │
            │ 2. Lazy eval │
            │ 3. Remove    │
            │    conversion│
            └──────────────┘

12.1 Optimization Strategies by Level

Level 0: Basic usage (most cases)

val set = array.toSet()

Level 1: High-frequency paths

val set = when (array.size) {
    0 -> emptySet()
    1 -> setOf(array[0])
    else -> array.toSet()
}

Level 2: Order unnecessary

val set = array.toHashSet() // Custom extension function

Level 3: Large volume + Hot Path

// Remove conversion itself
// Before
fun processUsers(ids: List<Long>): Set<User> {
    return ids.map { findUser(it) }.toSet()
}

// After
fun processUsers(ids: List<Long>): List<User> {
    val seen = HashSet<Long>()
    return ids.mapNotNull { id ->
        if (seen.add(id)) findUser(id) else null
    }
}

Level 4: Multi-threaded environment

import kotlinx.collections.immutable.toPersistentSet

val sharedSet = array.toPersistentSet()
// Lock-free, thread-safe by design

13. Conclusion: Engineering Starts With “Measurement”

toSet() is an excellent API. But within it lies a clear cost model.

13.1 How Code Changes with Depth of Understanding

The same toSet() call is written differently based on depth of understanding:

Surface Understanding: “It works”

val set = array.toSet() // It works!

Implementation Understanding: “I know how it works internally”

// toSet() uses LinkedHashSet, so order is preserved
val set = array.toSet()

System Understanding: “I know the cost and context”

/**
 * This path is called 500 times/sec, average 50 elements.
 * Order preservation is essential, so using toSet().
 * Pre-allocate capacity to prevent resizing.
 *
 * Memory impact: 50 elements * 112 bytes = 5.6KB/call
 * Allocation rate: 5.6KB * 500 = 2.8 MB/s
 * Young GC impact: Minimal (Eden 512MB baseline)
 *
 * Alternatives considered:
 * - toHashSet(): Unsuitable due to order requirement
 * - Caching: Impossible due to dynamic input
 * - Algorithm change: Currently optimal
 */
val set = array.toCollection(
    LinkedHashSet(mapCapacity(array.size))
)

What the third example shows is not simply “more complex code.” It explicitly states why this choice was made, what alternatives were considered, and what costs are incurred. This is the concrete manifestation of “engineering is about explaining choices.”

13.2 Decision-Making Framework

  1. Measure

    • JMH, JFR, async-profiler
    • Data-driven, not guesswork
  2. Understand

    • Why this design?
    • What are the tradeoffs?
  3. Decide

    • Does it fit our system?
    • Is the cost worth it?
  4. Document

    • Why did we make this choice?
    • Can the next engineer understand?

13.3 Final Thoughts

“Premature optimization is the root of all evil.” - Donald Knuth

But this quote has an often-omitted follow-up:

“Yet we should not pass up our opportunities in that critical 3%.”

We must find that 3%, optimize appropriately, and be able to explain those choices.

toSet() is the perfect choice for 97% of cases. In the remaining 3%, you must be able to make a better choice.

And you must be able to explain that choice.


References

Official Documentation

Tools

Books

Papers & Articles


Thank you for reading. Measure, understand, and choose wisely.


Share this post on:

Previous Post
Multi-Platform Database Design: Building Enterprise-Grade Logging Systems