IC Study Guide

Incident Commander Training Contact me

Changelog

2026-03-31

IC Skills tab added. Six topic areas: Framing the Incident, Ownership Assignment, Timeline Tracking, Parallel Work, Decisive Action, and Structured Communication. Each section includes notes, an SVG diagram, a flashcard, and practice quiz questions (8 total). Six new clusters added.

2026-03-31

Databases — Seeded Reports. New note section under a new Observability cluster, covering how generic pre-built reports can mask real incident impact. Includes SVG diagram, one flashcard, and two quiz questions.

2026-03-30

Oracle Stack tab added. Ten topic areas covering identity and authentication: IDCS auth failure, token expiry, SSO misconfiguration, LDAP latency, user provisioning, MFA failure, OAuth/OIDC, certificate expiry, rate limiting, and identity dependency failure. Includes 10 notes with SVG diagrams, 10 flashcards, and 20 quiz questions across 4 clusters.

2026-03-30

Databases — Oracle cluster added. Three new note sections: Undo & Read Consistency, Memory Architecture (SGA/PGA), and Undo + Memory Interaction in a RAC environment. Includes 3 flashcards and 3 quiz questions.

2026-03-30

Networking tab added. Five topic areas: DNS Record Types, TTL & Propagation, TCP vs UDP, TCP Handshake, and Retransmissions & Congestion. Includes notes with SVG diagrams, 5 flashcards, and 10 quiz questions across DNS and TCP/IP clusters.

2026-03-30

Cluster filter bar. Filter buttons added to Flashcards and Practice tabs across all topics. Flashcard count and quiz question count update live when a filter is selected.

2026-03-30

Databases — content expansion. Five new note sections added: Deadlocks, Replication Lag, Long-Running Transactions, Buffer Pool, and Query Timeout vs Connection Timeout. Matching flashcards and quiz questions added. Notes reorganised into logical clusters with labelled nav groups.

2026-03-27

Initial build. Single-file study guide with Databases tab. Traffic model content across 10 note sections with SVG diagrams. 3D flip flashcards with progress tracking. Symptom → Diagnosis practice quiz with shuffle, scoring, and colour-coded end screen. Now live at edgeset.dev/ic.

Query Optimizer GPS choosing the route

What it does: Chooses how queries are executed. Decides indexes, join order, and access paths.

Problem in incident: Picks inefficient execution plan. Ignores indexes or misjudges data.

Effect (what you see): Gradual slowdown, queries pile up, CPU increases.

Technical effect:

  • Full table scans instead of index lookups
  • More rows processed than needed
  • Increased CPU / disk I/O
  • Connections held longer

What it means: System doing too much work per query. Inefficiency spreading across system. Can lead to saturation or connection exhaustion.

Analogy: GPS sends cars through small roads instead of highways.

Incident signals:

  • Slow query logs increasing
  • db file sequential read
  • Rising latency

Key insight: The optimizer makes its decision automatically based on statistics. If stats are stale or data distribution has shifted, it can pick the wrong plan even when a good index exists — causing a sudden slowdown with no code change.

IC Questions: "Any slow queries?" / "What changed?" / "Are indexes being used?" / "Are statistics up to date?"

✓ GOOD PLAN — WITH INDEX QUERY Index Lookup 3 rows ~5ms VS ✗ BAD PLAN — FULL TABLE SCAN QUERY All Rows Scanned ~2000ms · millions of rows read

When Does an Index Lose Its Effectiveness? Library catalog

Core understanding: An index isn't "broken" — it becomes less useful when the optimizer decides it's no longer efficient. This happens due to fragmentation, poor selectivity, or outdated statistics.

What it does: Helps the database find data quickly.

Problem in incident: Index exists but queries are slow.

Effect (what you see): Slow queries, full table scans.

Technical effect:

  • Fragmentation from frequent inserts/updates/deletes
  • Statistics out of date
  • Optimizer ignores index

What it means: Navigation system exists but is unreliable.

Analogy: Library catalog that's messy or outdated.

Incident signals: Full table scan, high read I/O.

IC Questions: "Has data changed recently?" / "Are indexes still used?"

Slow Queries & Indexing Road choice and quality

What it does: Determines how fast data is accessed.

Problem in incident: Missing indexes or inefficient queries.

Effect (what you see): Gradual slowdown, high CPU.

Technical effect:

  • Full scans
  • High CPU / I/O
  • Increased query duration

What it means: System inefficiency under load. Can cascade into bigger issues.

Analogy: Cars using small roads instead of highways.

Incident signals:

  • Slow query logs
  • High CPU
  • db file sequential read

IC Questions: "Any slow queries?" / "Indexes being used?" / "Recent changes?"

With Index ~5ms No Index ~2000ms 0ms 500ms 1000ms 2000ms 400× slower without an index

Buffer Pool / Cache Hit Ratio City warehouse vs distant storage depot

What it does: The buffer pool (or buffer cache) holds frequently accessed data pages in memory so the DB can serve reads from RAM instead of disk.

Problem in incident: If the buffer pool is too small or gets evicted under memory pressure, the DB must read from disk more often — causing high read I/O and latency even when queries are efficient.

Effect (what you see): High disk read I/O, slow reads, elevated "physical reads" metric. Looks similar to a missing index but queries may have good plans.

Technical effect:

  • Low cache hit ratio → frequent physical reads from disk
  • Memory pressure → pages evicted before they can be reused
  • Working set larger than available buffer pool

Key distinction from disk I/O bottleneck: Disk I/O bottleneck = disk can't keep up with demand. Buffer pool problem = too many requests hitting disk that could be served from memory.

Analogy: Warehouse runs out of stock — every request requires a trip to a distant depot instead of grabbing from the shelf.

Incident signals: Low cache hit ratio alert, high physical reads, memory utilisation high on DB host.

IC Questions: "What is the cache hit ratio?" / "Has memory pressure increased?" / "Has the working data set grown recently?"

Row Lock One lane blocked

What it does: Locks specific rows during updates.

Problem in incident: Long transactions hold locks.

Effect (what you see): Queries waiting, localised slowdown.

Technical effect:

  • Other queries blocked on same rows
  • Increased wait times
  • Queue formation

What it means: One piece of work is blocking others. Can escalate if widespread.

Analogy: One lane closed due to accident.

Incident signals:

  • enq: TX - row lock contention
  • TX enqueue (mode 6)
  • Queries waiting

Key insight: Write always blocks write. Whether a write blocks a read depends on isolation level — in some databases reads are never blocked (MVCC); in others they wait. Important distinction for diagnosing who is actually stuck.

IC Questions: "What's blocking?" / "Any long transactions?" / "Can we clear it?" / "Is this write-write or write-read contention?"

T1 — ACTIVE holds lock HOLDS Row X 🔒 LOCKED T2 — WAITING T3 — WAITING T4 — WAITING ← blocked

Deadlocks Two cars blocking each other at a junction

What it does: Two transactions each hold a lock the other needs, causing a circular wait that neither can resolve.

Problem in incident: Transactions freeze waiting on each other — the database must detect and kill one to break the cycle.

Effect (what you see): One transaction is rolled back with a deadlock error. Throughput drops if deadlocks are frequent.

Technical effect:

  • T1 holds lock on Row A, wants Row B
  • T2 holds lock on Row B, wants Row A
  • DB deadlock detector kills one (the "victim") and rolls it back

Key distinction from row lock: Row lock contention is one-directional (one waits). A deadlock is circular (both wait on each other). The DB resolves it automatically but the rolled-back transaction may retry and repeat.

Analogy: Two cars at a narrow junction, each waiting for the other to reverse — neither can move until one backs down.

Incident signals: Deadlock errors in logs, rolled-back transactions, retry storms.

IC Questions: "Are deadlock errors in the logs?" / "Is the same pair of transactions involved?" / "Are retries making it worse?"

Metadata Lock Entire road closed

What it does: Locks entire table structure.

Problem in incident: Schema change blocks all access.

Effect (what you see): Sudden freeze — queries pile up instantly.

Technical effect:

  • All queries blocked waiting on metadata
  • No progress despite low CPU

What it means: System is blocked, not overloaded. One operation is halting everything.

Analogy: Entire road shut down.

Incident signals:

  • Queries stuck "waiting"
  • Low CPU but high latency

IC Questions: "Any schema changes?" / "What's blocking?" / "Can we stop it?"

Query A Query B Query C Query D TABLE 🚫 METADATA LOCK ALTER TABLE schema change in progress holds metadata lock

Locks & Contention Blocked roads and junctions

What it does: Controls access to shared data.

Problem in incident: Too many locks or long transactions.

Effect (what you see): Queries waiting — system appears stuck.

Technical effect:

  • Blocking chains
  • Increased wait times
  • Throughput drops

What it means: Work is queued behind blockers. System not overloaded — just blocked.

Analogy: Traffic jam behind blocked road.

Incident signals:

  • Lock wait alerts
  • Waiting queries

IC Questions: "What's blocking?" / "How long?" / "Can we remove it?"

T1 ACTIVE holds lock blocks T2 WAITING queued blocks T3 WAITING queued blocks T4+ WAITING chain grows... system not overloaded — just blocked. kill T1 to unblock the chain.

Long-Running Transactions A lorry blocking a side road for hours

What it does: A transaction that stays open much longer than normal, holding locks and resources throughout.

Problem in incident: Long transactions are a root cause that triggers several other issues — they hold row locks (blocking others), prevent log truncation (causing log growth), and inflate undo/rollback segments.

Effect (what you see): Depends on what the transaction is doing — could appear as row lock contention, log growth, or disk pressure rather than the transaction itself.

Technical effect:

  • Holds row locks for extended period → blocks other transactions
  • Prevents transaction log from being truncated → log grows
  • Holds undo/rollback space → undo segment pressure

Key insight: Often invisible as a direct alert — you see the symptoms (lock waits, log growth) but must look for long-running transactions as the underlying cause.

Analogy: A lorry parked across a side road for hours — blocking everything behind it and preventing road crews from clearing the area.

Incident signals: Long transaction time in monitoring, lock waits, log growth, undo pressure.

IC Questions: "Any transactions open for an unusual length of time?" / "Is this causing lock waits or log growth?" / "Can it be safely rolled back?"

Redo Log / Transaction Log Traffic control recording every car movement

What it does: Records all changes for durability and recovery.

Problem in incident: Heavy write activity overwhelms logging. Logs become a bottleneck.

Effect (what you see): System slows under write load. Even simple operations delayed.

Technical effect:

  • Increased disk writes
  • Log flush contention
  • Transactions slowed waiting for log writes

What it means: Write throughput is limiting performance. System can't commit changes fast enough. Risk of cascading slowdown.

Analogy: Cars must stop at a checkpoint before continuing.

Incident signals:

  • High write latency
  • Disk pressure
  • Slow commits

IC Questions: "Is write volume high?" / "Any long transactions?" / "Is disk under pressure?"

App Writes heavy writes Redo Log flush contention BOTTLENECK slow Disk Flush ✓ Commit acknowledged

Bottleneck in Transaction Log Single toll booth

Core understanding: All write operations must be recorded in the transaction log first. If the log can't keep up (slow disk or high write volume), everything slows down.

What it does: Ensures durability of writes.

Problem: Log becomes a bottleneck.

Effect (what you see): Slow transactions, connection buildup.

Technical effect:

  • Log write delays
  • Commit latency rises

What it means: Central write system is congested.

Analogy: Single toll booth causing traffic backup.

Incident signals: Log write waits, rising active sessions.

IC Questions: "Is disk slow?" / "Too many writes?"

Are Items Removed from Transaction Log? Black box recorder

Core understanding: Completed transactions are not immediately removed. The log keeps them until it is safe to reuse the space — after checkpoints and/or log backups, depending on system.

What it does: Stores transaction history for recovery.

Problem: Log keeps growing.

Effect (what you see): Disk pressure.

Technical effect:

  • Entries retained until safe for recovery
  • Space reused later (not deleted immediately)

What it means: Log is controlled reuse, not deletion.

Analogy: Black box recorder that overwrites old data later.

Incident signals: Log growth alerts.

IC Questions: "Are log backups running?" / "Any long transactions?"

Checkpoint vs Log Backup Unloading truck vs clearing warehouse

Core understanding: Checkpoint writes data pages to disk for recovery. Log backup allows the transaction log to reuse space. They solve different problems — using the wrong one won't fix the issue.

What it does:

  • Checkpoint → flushes data pages to disk
  • Log backup → frees log space for reuse

Problem: Log growing unexpectedly.

Effect (what you see): Disk issues despite checkpoints running.

Technical effect:

  • Checkpoint does not truncate the log
  • Log backup is required to free space

What it means: Wrong tool applied to the problem.

Analogy: Unloading a truck (checkpoint) vs clearing the whole warehouse (log backup).

Incident signals: Log growth despite checkpoints running.

IC Questions: "Are log backups configured?" / "What recovery mode is set?"

Database Connections / Connection Pooling Cars entering the city

What it does: Limits number of active DB connections.

Problem in incident: Too many connections or leaks.

Effect (what you see): Requests waiting or timing out.

Technical effect:

  • Connection pool exhausted
  • Requests queued before DB
  • Threads blocked waiting

What it means: System can't accept more work. Often caused by slow queries or leaks.

Analogy: Cars queued at city entrance.

Incident signals:

  • "Too many connections"
  • Timeouts
  • Low DB utilisation sometimes

IC Questions: "Are we at max connections?" / "Are connections released?" / "What's holding them?"

CONNECTION POOL max: 10 connections IN USE IN USE IN USE IN USE IN USE IN USE IN USE IN USE FREE FREE 8 in use · 2 free QUEUE waiting for slot Req 11 Req 12 Req 13 timing out... 3 requests waiting

Connection Pathway + Redo Log Club capacity + slow bar

Core understanding: A client must connect before running queries. Write operations are logged first (redo/transaction log). If the system is slow, connections stay open longer and can hit limits.

What it does: Handles access and write durability.

Problem: Too many connections / slow commits.

Effect (what you see): Connection errors, requests rejected.

Technical effect:

  • Flow: Client → Connect → Limit check → Query → Execute → Log
  • Slow log → slow commits → connections pile up → limit hit

What it means: System saturated at entry or commit stage.

Analogy: Club at capacity with slow bar service — people can't get in or get stuck inside.

Incident signals: "Too many connections" error, rising active sessions.

IC Questions: "Are connections being released?" / "Where is the bottleneck?"

Query Timeout vs Connection Timeout Order taking too long vs never getting a table

What it does: Two different timeout types that produce similar-looking errors but have different causes and fixes.

Problem in incident: Teams often conflate them — treating a connection timeout like a slow query problem, or vice versa. Diagnosing the wrong one wastes time.

Technical effect:

  • Query timeout: Connection was made, query started, but it ran too long — DB or app killed it. Cause: slow query, missing index, lock wait.
  • Connection timeout: App could not get a connection within the time limit — never reached a query. Cause: pool exhausted, DB overloaded, network issue.

Key distinction:

  • Query timeout → you got in, but service was too slow
  • Connection timeout → you never got a table

Analogy: Query timeout = seated at a restaurant but your order never arrives. Connection timeout = no tables available, turned away at the door.

Incident signals: Error message wording — "query exceeded timeout" vs "connection timed out" / "could not acquire connection".

IC Questions: "What does the exact error say?" / "Did the connection succeed?" / "Is the pool full or are queries just slow?"

Temp Index Rebuild Road maintenance during rush hour

What it does: Rebuilds or reorganises indexes.

Problem in incident: Happens during peak load. Competes for resources.

Effect (what you see): Sudden slowdown, increased I/O and CPU.

Technical effect:

  • Heavy disk usage
  • Temporary space consumption
  • Increased contention with live queries

What it means: Background work is stealing capacity from production traffic. Can trigger wider performance issues.

Analogy: Roadworks reducing available lanes.

Incident signals:

  • Maintenance job running
  • "tablespace is full" (possible)
  • Disk spikes

Key insight: Rebuilding creates a new index alongside the old one before swapping — temporarily doubling the storage needed. Disk full alerts during maintenance are often this, not a general storage leak.

IC Questions: "Any maintenance running?" / "Can we pause it?" / "Is disk space OK?" / "Was disk headroom checked before the job started?"

Lane 1 Lane 2 Lane 3 Live Query Traffic → → → Live Query Traffic → → → 🔧 Index Rebuild — consuming disk I/O & CPU, competing with live queries

Resource Saturation (CPU / Disk / Memory) City at full capacity

What it does: Provides compute and storage resources.

Problem in incident: System exceeds capacity.

Effect (what you see): Everything slows — no single clear cause.

Technical effect:

  • CPU maxed → slow processing
  • Disk maxed → slow reads/writes
  • Memory pressure → less caching

What it means: System overloaded. Needs load reduction or scaling.

Analogy: Entire city overwhelmed with traffic.

Incident signals:

  • High CPU / disk
  • System-wide latency

IC Questions: "Which resource is maxed?" / "Load spike or inefficiency?" / "Can we reduce load?"

CPU 95% Disk I/O 88% Memory 82% danger threshold (80%)

Replication Lag Branch office receiving yesterday's updates

What it does: Changes written to the primary database are replicated to read replicas, usually with a small delay.

Problem in incident: Lag grows — reads from replicas return stale data. Users see outdated results or inconsistencies.

Effect (what you see): Data appears to "go backwards" or users see different data depending on which replica they hit. May look like a bug rather than an infrastructure issue.

Technical effect:

  • Primary processes writes faster than replica can apply them
  • Replica falls behind — lag measured in seconds or minutes
  • Reads routed to replica return old data

Common causes: Heavy write load on primary, slow replica disk, long-running queries on replica blocking apply, network issues.

Analogy: Head office sends updates daily — branch office is working from yesterday's data.

Incident signals: Replication lag metric rising, user reports of stale data, replica behind primary by N seconds.

IC Questions: "What is current replica lag?" / "Are reads being routed to replicas?" / "Is write load on primary spiking?" / "Can we route reads to primary temporarily?"

Incident Chain How it all connects

1 Bad Query Plan Inefficient routing — full scans instead of index lookups 🗺️ 2 Queries Slow Down Stay longer in system — connections held, queue grows 🚗 3 Redo Log Pressure Increases Write throughput constrained — commits begin to slow 📋 4 Index Rebuild Kicks In Background maintenance steals capacity — disk I/O spikes 🛠️ 5 Locks Appear Row and metadata locks block traffic — wait queues form 🔒 6 System Gridlock Nothing moves — full saturation or connection exhaustion 🚨 Performance degradation Capacity reduction Critical / blocking

Undo & Read Consistency (RAC) Old maps for drivers

Core understanding: Oracle lets readers see a consistent past version of data using undo, even while writes are happening. In RAC, this consistency must work across multiple nodes, which adds coordination overhead.

What it does:

  • Stores before-images of data (undo)
  • Lets queries read a stable snapshot
  • Prevents read/write blocking

Problem in incident: Undo too small or overwritten; long queries need old data that no longer exists; RAC adds delay due to cross-node access.

Effect (what you see): "Snapshot too old" query failures; sudden query slowdowns; intermittent errors on long-running reports.

Technical effect: Required undo data no longer available, or slow retrieval across RAC nodes.

What it means: Capacity issue (undo too small) or workload mismatch (long queries vs high churn). In RAC, could also be inter-node latency.

Analogy: Cars (queries) need a map of the road from 5 minutes ago. Old maps (undo) keep getting thrown away. If the map is gone, the driver gets lost — query fails.

Incident signals: "snapshot too old" errors; long-running queries failing; spikes in undo usage; RAC: interconnect latency warnings.

IC Questions: Are queries long-running? Has data change rate increased? Any recent batch jobs? Is this happening across all RAC nodes or one?

Memory Architecture (SGA/PGA, RAC) Kitchens with shared fridges

Core understanding: Oracle uses memory to cache data and speed up queries. In RAC, each node has its own memory but must share data via interconnect — the "pinging" problem.

What it does:

  • SGA = shared memory (data cache, SQL cache)
  • PGA = per-session memory
  • Reduces disk I/O by caching hot data

Problem in incident: Memory pressure (too many queries); cache inefficiency; RAC blocks constantly moving between nodes.

Effect (what you see): High latency; high CPU; slow queries across cluster; sudden performance degradation.

Technical effect: Cache misses lead to more disk reads; RAC block transfer overhead between nodes ("gc" waits).

What it means: Resource contention (memory/CPU) or bad workload distribution across RAC. Often: too many queries, poor query patterns, or hot blocks bouncing between nodes.

Analogy: Each RAC node is a separate kitchen with its own fridge. If a chef needs something from another kitchen, they must run across the street. Too much running = everything slows down.

Incident signals: High CPU; high memory usage; RAC interconnect traffic spikes; "buffer busy waits" / "gc" waits.

IC Questions: Is load evenly distributed across nodes? Any spike in query volume? Are specific queries dominating? Is one node worse than others?

Undo + Memory Interaction (RAC) Bridge congestion + roadworks

Core understanding: Undo and memory work together to serve consistent reads quickly. In RAC, this may involve remote memory access between nodes — heavy writes and long reads colliding causes compounding pressure.

What it does:

  • Memory serves cached data quickly
  • Undo reconstructs older versions for consistency
  • RAC shares both mechanisms across nodes

Problem in incident: Heavy writes + long reads + RAC traffic causes simultaneous contention and latency.

Effect (what you see): Cluster-wide slowdown; queries inconsistent in performance; timeouts; mixed symptoms (CPU + latency + errors).

Technical effect: Undo reconstruction + memory contention happening at the same time; inter-node block transfers compound both.

What it means: System under stress — multiple subsystems interacting badly. Often triggered by batch jobs or reporting running alongside heavy writes.

Analogy: Cars need old maps (undo). Roads are busy (writes). Cities are connected by bridges (RAC). Too many cars crossing bridges + changing roads = gridlock.

Incident signals: Mixed symptoms (CPU + latency + errors); RAC interconnect spikes; query variability; undo errors alongside memory pressure.

IC Questions: What changed? (batch job, release) Is this cluster-wide? Are reads and writes colliding at the same time?

Seeded Reports City-wide traffic map

Core understanding: A seeded report is a pre-built, default report that ships with a system. Designed for common use cases — not tailored to your specific environment or incident needs.

What it does: Provides standard visibility into data (performance, usage, sales) without requiring a custom build.

Problem in incident: Seeded reports often lack the detail, speed, or focus needed during an active incident.

Effect (what you see):

  • Missing key data you need right now
  • Reports too slow to load
  • Data feels generic — "nothing looks wrong"
  • Teams say "the report looks fine" but users are impacted

Technical effect: Queries are broad and inefficient; not optimised for real-time debugging; may miss critical filters or dimensions (specific customer, query, endpoint).

What it means (IC interpretation): Observability gap. You're relying on generic tooling instead of targeted insight — this slows decision-making and prolongs the incident.

Analogy: A city-wide traffic map. It shows "traffic looks normal overall" — but your incident is a single blocked lane on one street. You need a zoomed-in camera, not a general map.

Incident signals:

  • "Dashboard shows normal but users report slowness"
  • "Report takes too long to generate"
  • "No visibility into specific query / user / service"
  • Conflicting statements between teams

IC Questions: "Do we have a more granular or real-time view?" / "Can we filter to affected users or endpoints?" / "Is this report cached or delayed?" / "Who can run a targeted query or log search instead?"

SEEDED REPORT vs TARGETED VIEW Seeded Report — City Traffic Map ✓ Overall: NORMAL hidden blockage Targeted View — Zoomed Camera BLOCKED user: affected_customer endpoint: /api/load ⚠ Root cause visible drill down
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
27 questions · shuffled each round · score tracked.

DNS Record Types Contact list with routing rules

Core understanding: DNS isn't just "name → IP." It stores different record types that control where traffic goes and how services are discovered.

What it is: A distributed directory with multiple record types, each serving a different routing purpose.

Key records:

  • A → domain → IPv4 (most common)
  • AAAA → domain → IPv6
  • CNAME → alias (domain points to another domain)
  • MX → mail routing
  • TXT → verification / policies (SPF, DKIM)
  • NS → which DNS servers are authoritative

Problem in incident: Wrong IP in A record · broken CNAME chain · missing or incorrect records

Effect (what you see): Users routed to wrong server · partial outages · some services work, others fail

Technical effect: DNS resolves — but to the wrong destination

What it means: Misconfiguration, not outage — traffic is flowing, but incorrectly

Analogy: Contact list with wrong phone numbers or forwarding rules

Incident signals:

  • Traffic hitting wrong servers
  • Sudden shift in traffic patterns
  • "It works for some domains but not others"

IC questions: "What record changed?" / "Are we resolving to the expected IP?" / "Is there a CNAME chain involved?"

Pattern: Traffic going somewhere wrong → think DNS misconfiguration

DNS RECORD TYPES RECORD MAPS USED FOR INCIDENT RISK A domain → IPv4 Website / API traffic Wrong IP → wrong server AAAA domain → IPv6 IPv6 traffic IPv6-only users broken CNAME domain → domain Aliases / CDN / subdomains Broken chain → NXDOMAIN MX domain → mail server Email routing Email fails, site still up TXT domain → text string SPF, DKIM, verification Emails marked spam NS domain → nameserver Authoritative server lookup All DNS resolution fails

TTL & Propagation Old maps still in circulation

Core understanding: DNS changes are not instant — TTL (Time To Live) controls how long old answers stay cached by resolvers across the internet.

What it does: TTL determines how long a resolver caches a DNS answer before it re-queries the authoritative server.

Problem in incident: Old records still cached · some users see new config, others see old

Effect (what you see): "Works for me but not others" · gradual recovery · region-dependent behaviour

Technical effect: Different resolvers return different answers — inconsistent global state

What it means: Not a failure — the change is still propagating. Expected behaviour after a DNS update.

Analogy: Old maps still being used while new maps are being distributed

Incident signals:

  • Mixed behaviour across regions or users
  • Gradual improvement over time after a DNS change
  • "Some users fixed, others still broken"

IC questions: "What is the TTL?" / "When was the change made?" / "Are caches cleared?"

Pattern: Inconsistent behaviour after a DNS change → think TTL propagation delay

RESOLVER A — STALE CACHE TTL not expired · cached old answer DNS query: example.com Cache hit → 203.0.113.5 (OLD) → OLD SERVER user still broken VS RESOLVER B — FRESH CACHE TTL expired · re-queried authoritative DNS query: example.com Live query → 203.0.113.9 (NEW) → NEW SERVER user is fixed

TCP vs UDP Registered mail vs postcards

Core understanding: TCP and UDP are two transport protocols — reliable vs fast. Knowing which one your traffic uses changes how you diagnose failures.

TCP (Transmission Control Protocol): Reliable, ordered, connection-based · used by HTTP/S, MySQL · retries automatically · guaranteed delivery

UDP (User Datagram Protocol): Fast, no guarantees, connectionless · used by DNS, streaming, VoIP · sends and forgets — no retry built in

Problem in incident:

  • TCP: congestion, connection limits, slow under load
  • UDP: silent drops, hard-to-detect failures, no error trail

Effect (what you see): TCP issues → timeouts, slow apps · UDP issues → intermittent failures, missing responses

What it means: TCP problems = congestion or capacity · UDP problems = loss or instability

Analogy: TCP = registered mail (guaranteed delivery) · UDP = postcards (fast but may get lost)

Incident signals:

  • TCP: high latency, connection timeouts
  • UDP: missing responses, intermittent failures, no error logs

IC questions: "Is this TCP or UDP traffic?" / "Do we see retries or silent drops?" / "Is reliability or speed more critical?"

Pattern: Silent failures with no error logs → think UDP packet loss

TCP — RELIABLE HTTP/S · MySQL · guaranteed delivery SENDER P1 ACK P2 ACK RECEIVER Every packet confirmed ✓ Retransmits if no ACK Analogy: registered mail VS UDP — FAST / NO GUARANTEE DNS · streaming · VoIP · fire and forget SENDER P1 P2? P3 RECEIVER P2 dropped — no retry ✗ No error logged · silent failure Analogy: postcard

TCP Handshake & Connection Lifecycle Knocking on a door that won't answer

Core understanding: Before any data flows, TCP must establish a connection via a 3-step handshake. If this fails, no requests can be processed at all.

The handshake: SYN → SYN-ACK → ACK

Problem in incident: Handshake fails or is delayed · SYN queue fills up · server cannot accept new connections

Effect (what you see): Connection timeouts · users can't connect · errors appear before any request is sent

Technical effect: Entry point is saturated — the problem is at the door, not inside the application

What it means: Often load-related or an attack — not an application bug

Analogy: Knocking on a door but no one answers — the house is overwhelmed before anyone can get inside

Incident signals:

  • SYN backlog warnings
  • High connection attempt counts
  • Timeouts before any request data is exchanged

IC questions: "Are connections failing before requests?" / "Is the SYN queue full?" / "Is this a traffic spike or an attack?"

Pattern: Fails before any request is processed → think TCP handshake saturation

NORMAL HANDSHAKE CLIENT SERVER 1. SYN "I want to connect" 2. SYN-ACK "OK, I'm ready" 3. ACK "Confirmed — send data" Connection established ✓ data transfer begins VS SYN QUEUE SATURATED SERVER SYN queue FULL C1 SYN C2 SYN C3 SYN C4 SYN ✗ DROPPED C4 gets: connection timeout

Retransmissions & Congestion Traffic jam where cars keep re-entering

Core understanding: When TCP packets are lost, they are automatically retransmitted. Under high load, this creates a congestion feedback loop — more retransmits = more traffic = worse congestion.

What it does: TCP guarantees delivery by resending lost packets — but each resend adds to overall traffic load.

Problem in incident: High retransmission rate · congestion builds · performance degrades progressively under sustained load

Effect (what you see): Slow responses · latency climbing · throughput dropping under load

Technical effect: More traffic → more loss → more retransmits → worse performance (self-reinforcing loop)

What it means: Network degradation spiral — not a full outage, but worsening performance under load

Analogy: Traffic jam where cars keep re-entering — clearing gets harder the more vehicles try to pass

Incident signals:

  • Retransmission rate climbing
  • Latency increasing over time
  • Throughput dropping under load

IC questions: "Are retransmissions increasing?" / "Is packet loss present?" / "Where is the congested link?"

Pattern: Progressive slowdown under load + rising retries → think TCP congestion loop

RETRANSMISSION CONGESTION LOOP TIME → Normal Load builds Packet loss Retransmits Spiral ↓ Latency Throughput Retransmits loop ↑
0 / 5 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
10 questions · shuffled each round · score tracked.

IDCS Global Authentication Failure Highway entrance closed

Core understanding: IDCS is a centralised cloud identity provider. It acts as the first gate users must pass through before reaching any system. If it becomes unavailable, users cannot authenticate anywhere — even though the underlying apps may still be healthy.

What it is: A shared login authority used across multiple systems.

What it does: Authenticates users and issues access tokens.

Problem in incident: IDCS outage or service disruption.

Effect (what you see):

  • All apps inaccessible after login attempt
  • 401/403 spike across every service simultaneously

Technical effect: No tokens issued — authentication cannot begin.

IC interpretation: Central dependency failure — the authentication hub is down.

Analogy: Highway entrance closed — all routes blocked even though the roads beyond are clear.

Incident signals: Login failures across all apps at once · drop in successful auth metrics.

IC questions: "Are all apps affected?" / "Is IDCS reachable?" / "When did auth success rate drop?"

Pattern recognition: All apps fail login simultaneously → suspect IDCS.

IDCS GLOBAL AUTH FAILURE — hub-and-spoke, all connections broken IDCS ⚠ SERVICE DOWN App A App B App C App D 👤 login fails 👤 login fails All apps: 401 / 403 spike — apps healthy but unreachable

Token Expiry / Validation Issues Expired train ticket during journey

Core understanding: After login, users don't continuously re-authenticate — they use tokens as proof of identity. These tokens have rules like expiration time and validation checks. If those rules are misconfigured or systems disagree on time, valid users can suddenly appear invalid.

What it does: Maintains authenticated sessions across systems.

Problem in incident: Expired or misvalidated tokens.

Effect (what you see):

  • Random mid-session logouts
  • Intermittent 401 errors for users already logged in

Technical effect: Token rejected by applications.

IC interpretation: Misconfiguration or time sync issue — not an outage.

Analogy: Expired train ticket during the journey — you bought it, you're on the train, but the gate says it's invalid.

Incident signals: Token validation errors in logs · session drops without user action.

IC questions: "Are tokens expiring earlier than expected?" / "Is system time consistent across services?"

Pattern recognition: Random auth failures for already-logged-in users → token issue.

TOKEN LIFECYCLE — misconfigured expiry vs expected time Issued Valid window (expected) Misconfigured / clock drift Expected expiry User gets 401 — mid-session logout Fix: check token TTL config → sync system clocks → rollback if recently changed

Federation / SSO Misconfiguration Two border checkpoints refusing each other

Core understanding: Federation allows one identity system to trust another (e.g., corporate login into cloud apps). This relies on precise configuration and certificates. If that trust breaks, users get stuck in login flows or cannot authenticate at all.

What it does: Enables login via external identity providers.

Problem in incident: Broken trust configuration or certificate mismatch.

Effect (what you see):

  • Redirect loops — browser bounces between app and login page
  • Login fails after being redirected to SSO

Technical effect: Authentication handshake fails between identity providers.

IC interpretation: Integration misconfiguration — the two systems no longer agree on trust.

Analogy: Two border checkpoints refusing to accept each other's stamps.

Incident signals: Repeated redirect errors · SSO-specific error codes · only SSO users affected.

IC questions: "Are only SSO users affected (local accounts still work)?" / "Any cert or config changes recently?"

Pattern recognition: Redirect loop → SSO / federation issue.

FEDERATION / SSO — broken trust → redirect loop 👤 User tries login Corp IdP (e.g. AD FS) IDCS ⚠ trust broken App redirect loop Check: certificate validity · SAML/OIDC metadata · recent cert or config changes

LDAP Latency (IDM) Traffic jam at ID checkpoint

Core understanding: LDAP is the directory service that stores user identities in IDM environments. During login, systems query LDAP to verify users. If LDAP is slow, every authentication request slows down — even if nothing is technically broken.

What it does: Provides user data for authentication queries.

Problem in incident: Slow directory responses.

Effect (what you see):

  • Login takes much longer than normal (15–20s instead of 1–2s)
  • Occasional timeouts for some users

Technical effect: Queued or delayed auth requests — high LDAP response times.

IC interpretation: Performance bottleneck — slowness, not failure.

Analogy: Traffic jam at the ID checkpoint — everyone gets through eventually, but very slowly.

Incident signals: High auth latency · complaints about slow login, not login failure.

IC questions: "Is login slow or actually failing?" / "What are LDAP query response times?" / "Any load increase recently?"

Pattern recognition: Login eventually works but is very slow → LDAP latency.

LDAP LATENCY — auth request queue building up Auth requests queuing: R1 R2 R3 R4 R5 R6 slow drain LDAP ⏳ high latency Response time: Normal: ~20ms Under load: 8 000ms+ Check: LDAP query times · index health · connection pool exhaustion · server load

User Provisioning / Sync Issues Different checkpoints, different passenger lists

Core understanding: Users and permissions are synchronised across systems. If this process fails, different systems may have different views of who a user is or what they can access — creating inconsistent, hard-to-diagnose failures.

What it does: Keeps user identities and roles consistent across all systems.

Problem in incident: Sync delays or failures.

Effect (what you see):

  • Some users fail while others succeed
  • Permissions missing or incorrect for affected users

Technical effect: Data inconsistency across systems.

IC interpretation: State mismatch — not an outage, but a divergence between systems.

Analogy: Different checkpoints using different passenger lists.

Incident signals: Only specific users or groups affected · new users, recently changed roles, or recently onboarded teams impacted.

IC questions: "Who exactly is affected?" / "Any recent provisioning changes or new user onboarding?"

Pattern recognition: Partial user failures (not everyone) → sync or provisioning issue.

USER PROVISIONING — sync failure creates state mismatch Identity Source Alice · Bob · Carol (new) System A — synced ✓ Alice ✓ Bob ✓ Carol ✓ System B — stale ✗ Alice ✓ Bob ✓ Carol ✗ sync failed

MFA Failure Second checkpoint blocked

Core understanding: MFA adds a second verification step after password authentication. This step often depends on external systems (SMS providers, authenticator apps). If it fails, users are authenticated on password but cannot complete login.

What it does: Provides additional identity verification beyond password.

Problem in incident: MFA system or provider failure.

Effect (what you see):

  • Users stuck after entering their password
  • MFA prompts that never arrive or fail to validate

Technical effect: Second authentication step cannot complete.

IC interpretation: Partial authentication failure — first step worked, second step blocked.

Analogy: Getting through the first checkpoint but being blocked at the second.

Incident signals: MFA error messages in logs · push notifications or SMS not arriving.

IC questions: "Where exactly does login stop — before or after MFA prompt?" / "Is this an external MFA provider?"

Pattern recognition: Login stalls after password entry → MFA failure.

MFA FAILURE — stuck at second checkpoint 👤 User Step 1 Password ✓ Step 2 MFA ✗ FAILED App External provider (SMS / push) unreachable Check: MFA provider status · SMS gateway · push service · consider temp bypass for recovery

OAuth / OIDC Misconfiguration Wrong key for one door

Core understanding: Applications must be correctly configured to trust IDCS tokens. This includes client IDs, secrets, and redirect URLs. A small mismatch can break authentication for a single app while others work fine.

What it does: Connects individual applications to the identity provider.

Problem in incident: Incorrect client configuration in one app.

Effect (what you see):

  • One specific app fails login
  • All other apps still work fine

Technical effect: Token rejected by the misconfigured application.

IC interpretation: App-specific misconfiguration — scope is narrow, not a platform issue.

Analogy: Wrong key for one door — master key still works on all others.

Incident signals: Single app impacted · OAuth error codes (invalid_client, redirect_uri_mismatch).

IC questions: "Is this only one app or multiple?" / "Any config deployment to this app recently?"

Pattern recognition: One app broken while others work → OAuth / OIDC misconfiguration.

OAUTH / OIDC MISCONFIG — one app broken, others healthy IDCS issuing tokens ✓ App A ✓ correct config App B ✓ correct config App C ✗ wrong client_id token rejected Check: client_id · client_secret · redirect_uri · scopes · recent app deployment

Certificate Expiry Expired passport

Core understanding: Certificates establish trust between systems in authentication flows. They have expiration dates. When they expire, systems stop trusting each other — causing sudden, complete failures with no degraded middle period.

What it does: Secures and validates identity communication between systems.

Problem in incident: Expired certificate.

Effect (what you see):

  • Sudden, complete login failure — was working, now completely broken
  • SSO stops working

Technical effect: Trust validation fails — systems refuse to communicate.

IC interpretation: Preventable config failure — a known expiry date was missed.

Analogy: Expired passport — valid until midnight on the expiry date, then refused everywhere instantly.

Incident signals: Certificate error messages in logs · sudden complete outage with no deployment.

IC questions: "Did any certificate expire recently?" / "Was there a cert change or renewal attempt?"

Pattern recognition: Sudden auth break with no deployment → check certificate expiry first.

CERTIFICATE EXPIRY — sudden trust failure at expiry date Certificate VALID — auth working Issued EXPIRED Trust fails — complete auth outage ✓ Auth working normally ✗ All auth fails instantly Fix: renew cert → deploy → verify trust chain · Prevent: monitor cert expiry dates proactively

Rate Limiting / Throttling Road closed due to too much traffic

Core understanding: Identity systems protect themselves by limiting how many requests they accept per time window. During traffic spikes, legitimate users can be blocked if limits are hit — even when the identity system itself is completely healthy.

What it does: Prevents overload or abuse by capping request rates.

Problem in incident: Too many requests trigger the limit.

Effect (what you see):

  • Login failures during peak usage times
  • 429 (Too Many Requests) responses

Technical effect: Requests rejected or delayed by the rate limiter.

IC interpretation: Capacity or protection issue — the limit may be correct or may need tuning.

Analogy: Road closed due to too much traffic — the road is fine, volume exceeded what's allowed.

Incident signals: Traffic spike correlates exactly with login failure onset · 429 errors in logs.

IC questions: "Is there a traffic spike right now?" / "Are 429 errors visible?" / "What are the configured rate limit thresholds?"

Pattern recognition: Peak usage + login failures + 429 errors → throttling.

RATE LIMITING — traffic spike breaches limit ceiling Limit 429 normal peak — throttled recovery Req/s Check: 429 errors · request rate vs configured limits · genuine spike vs client retry storm

Identity Dependency Failure Checkpoint staff can't access records

Core understanding: Identity systems rely on underlying services like databases, network, and storage. If those fail, identity services degrade or stop working — even if the identity system's own processes are healthy.

What it does: Depends on backend infrastructure to function.

Problem in incident: Database, network, or storage failure beneath IDCS.

Effect (what you see):

  • Slow or failed login
  • Auth errors combined with infrastructure alerts

Technical effect: Backend dependency unavailable — IDCS cannot complete auth lookups.

IC interpretation: Downstream dependency issue — the visible failure is auth, but the root cause is infrastructure.

Analogy: Checkpoint staff can't access the records database — they're present but unable to do their job.

Incident signals: Infra alerts fire alongside auth failures · auth latency spike coincides with DB / network alerts.

IC questions: "Are there DB or network alerts at the same time?" / "Is this auth-only or a wider infrastructure issue?"

Pattern recognition: Auth failures + infra alerts simultaneously → dependency failure.

IDENTITY DEPENDENCY FAILURE — root cause is below IDCS 👤 User auth fails IDCS process up ⚠ degraded query fails Database ⚠ DOWN / slow 🔔 Infra Alert firing Key insight: IDCS may look healthy — root cause is in the infrastructure layer below it
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
20 questions · shuffled each round · score tracked.

Framing the Incident (Impact First) Side street vs motorway

Core understanding: Framing means quickly defining what is broken and how bad it is. Without it, teams focus on the wrong things or move too slowly.

What it does: Aligns everyone on what matters most and how urgent the situation is.

Problem in incident: Engineers jump into debugging without confirming impact. Low-priority issues get equal attention as critical ones. No urgency → slow decisions.

Effect (what you see): People asking different questions, no shared sense of severity, delayed mitigation.

What it means (IC interpretation): This is a priority alignment problem. The system isn't just failing — the response is unfocused.

Analogy: An accident happens but no one knows if it's on a side street or a major motorway. If it's the motorway (checkout), you need immediate response and all resources focused.

Incident signals: "Is this actually impacting users?" / "How bad is this?" / "Are we sure this is critical?" / Multiple threads of investigation.

IC questions: "What is the user impact right now?" / "Which functionality is affected?" / "Is this revenue-critical (checkout/login)?" / "How many users are impacted?" / "When did this start?"

Then state clearly: "Checkout is failing → high priority → focus on mitigation."

IMPACT FRAMING — WHICH ROAD IS BLOCKED? Side Street (Low Priority) blocked Low traffic · 1 lane · low urgency Motorway (High Priority) BLOCKED Checkout · all users · urgent now vs

Ownership Assignment Uncontrolled junction

Core understanding: Every critical task needs a clearly named person or team responsible. Without this, work is assumed, duplicated, or not done at all.

What it does: Ensures work happens without delay and everyone knows who is doing what.

Problem in incident: Tasks are suggested but not assigned. People assume "someone else is doing it." Gaps or duplication in work.

Effect (what you see): "I thought that was already happening." Silence after actions are suggested. Same task done twice or not at all.

What it means (IC interpretation): This is a responsibility gap. The system is slow because no one owns execution.

Analogy: Traffic lights exist but no one is assigned to operate them. Cars hesitate, collide, or stop moving entirely.

Incident signals: "Who is doing that?" / "Is that being worked on?" / Long pauses after instructions.

IC questions: "Who owns the app right now?" / "Who is handling DB investigation?" / "Who is managing infra/network?"

Then assign clearly: "App team → initiate rollback now. DBA → investigate queries. Network → prepare to drain nodes."

NO OWNER vs CLEAR OWNER No Owner Assigned App Team DBA Network Infra ? Owner Assigned IC App → rollback DBA → queries Net → traffic vs

Timeline Tracking Sequence before the crash

Core understanding: Timeline tracking means keeping a clear sequence of events during the incident. This helps connect cause and effect quickly.

What it does: Identifies what changed before the failure. Prevents confusion during the incident.

Problem in incident: Events get mixed up. Teams argue about what happened first. Root cause becomes harder to identify.

Effect (what you see): "Wait, did that happen before or after the deploy?" Repeated questions. Confusion about sequence.

Technical effect: Slower diagnosis. Missed correlations (e.g., deploy → failure).

What it means (IC interpretation): This is a visibility problem over time. You can't solve what you can't sequence.

Analogy: Trying to understand a crash without knowing which car entered the junction first or when the collision happened.

Incident signals: Confusion about timing / "When did that happen?" repeated / Misaligned understanding across teams.

IC questions: "When did alerts start?" / "When was the last deploy?" / "When did user impact begin?"

Then state: "09:05 deploy → 09:12 alerts → likely related."

INCIDENT TIMELINE 09:05 Deploy 09:12 Alerts fire 09:15 Users report impact 09:18 IC engaged 7 min gap → likely related

Parallel Work (Avoid Serial Investigation) Multi-lane road

Core understanding: Parallel work means multiple teams investigate different areas at the same time. Serial work (one after another) slows everything down.

What it does: Speeds up diagnosis and mitigation simultaneously.

Problem in incident: Teams wait for each other. Only one path investigated at a time. Bottlenecks form.

Effect (what you see): "Let's wait for DB before doing anything." Idle teams. Slow progress.

What it means (IC interpretation): This is a throughput problem. Not enough work happening simultaneously.

Analogy: Only opening one lane when multiple lanes are available — traffic builds up unnecessarily.

Incident signals: Teams waiting / Sequential updates / Slow momentum.

IC questions: "What can each team investigate right now?" / "Are we blocked or just waiting?" / "Can we run these in parallel?"

Then assign: App → deploy/rollback. DBA → queries. Network → traffic. All simultaneously.

SERIAL vs PARALLEL INVESTIGATION Serial (Slow) App DBA (waiting) Net (waiting) Total time: A + B + C Outage extended unnecessarily Parallel (Fast) App → rollback DBA → queries Net → traffic All done together vs

Decisive Action (Mitigation First) Clear the road before the inquest

Core understanding: Incident command requires making fast, reasonable decisions to reduce impact — even without full information.

What it does: Stops user impact quickly. Buys time for deeper investigation.

Problem in incident: Over-analysis. Fear of making the wrong decision. Delayed action.

Effect (what you see): Endless discussion. No clear plan. Metrics not improving.

What it means (IC interpretation): This is a decision paralysis problem. The system isn't recovering because no action is taken.

Analogy: Seeing a blocked road but debating the causes instead of clearing it first.

Incident signals: "We're still investigating…" with no action taken / No improvement in metrics / Repeated theories.

IC questions: "What is the fastest way to reduce impact?" / "Can we roll back?" / "What is the safest immediate mitigation?"

Then decide: "We are rolling back — execute now."

DECISION PARALYSIS vs DECISIVE ACTION Paralysis BLOCKED ? ? "Why did this happen?" Road stays blocked. Impact grows. Decisive Action "Roll back — execute now" Road clears. Then investigate why. vs

Structured Communication (Who / What / Priority) Clear junction signs

Core understanding: Communication must be clear, direct, and structured so actions happen immediately.

What it does: Removes ambiguity. Speeds up execution.

Problem in incident: Vague instructions. Long explanations. Misunderstandings.

Effect (what you see): "Sorry, what was I doing?" Delayed responses. Confusion.

What it means (IC interpretation): This is a clarity problem. Work slows because instructions are unclear.

Analogy: Giving unclear directions at a busy junction — cars hesitate or go the wrong way.

Incident signals: Repeated clarifications / Tasks misunderstood / Slow execution after instruction.

Structure: Every instruction = Who is doing this + What exactly + Priority (now / next).

Example: "App team → roll back all nodes → priority now." (not "let's look into rollback")

VAGUE vs STRUCTURED COMMUNICATION Vague "Let's look into rollback…" ? ? no action ? Structured WHO WHAT PRIORITY "App team → rollback all nodes → priority NOW" Action starts immediately vs
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.