Query Optimizer GPS choosing the route

What it does: Chooses how queries are executed. Decides indexes, join order, and access paths.

Problem in incident: Picks inefficient execution plan. Ignores indexes or misjudges data.

Effect (what you see): Gradual slowdown, queries pile up, CPU increases.

Technical effect:

  • Full table scans instead of index lookups
  • More rows processed than needed
  • Increased CPU / disk I/O
  • Connections held longer

What it means: System doing too much work per query. Inefficiency spreading across system. Can lead to saturation or connection exhaustion.

Analogy: GPS sends cars through small roads instead of highways.

Incident signals:

  • Slow query logs increasing
  • db file sequential read
  • Rising latency

Key insight: The optimizer makes its decision automatically based on statistics. If stats are stale or data distribution has shifted, it can pick the wrong plan even when a good index exists — causing a sudden slowdown with no code change.

IC Questions: "Any slow queries?" / "What changed?" / "Are indexes being used?" / "Are statistics up to date?"

✓ GOOD PLAN — WITH INDEX QUERY Index Lookup 3 rows ~5ms VS ✗ BAD PLAN — FULL TABLE SCAN QUERY All Rows Scanned ~2000ms · millions of rows read

When Does an Index Lose Its Effectiveness? Library catalog

Core understanding: An index isn't "broken" — it becomes less useful when the optimizer decides it's no longer efficient. This happens due to fragmentation, poor selectivity, or outdated statistics.

What it does: Helps the database find data quickly.

Problem in incident: Index exists but queries are slow.

Effect (what you see): Slow queries, full table scans.

Technical effect:

  • Fragmentation from frequent inserts/updates/deletes
  • Statistics out of date
  • Optimizer ignores index

What it means: Navigation system exists but is unreliable.

Analogy: Library catalog that's messy or outdated.

Incident signals: Full table scan, high read I/O.

IC Questions: "Has data changed recently?" / "Are indexes still used?"

Slow Queries & Indexing Road choice and quality

What it does: Determines how fast data is accessed.

Problem in incident: Missing indexes or inefficient queries.

Effect (what you see): Gradual slowdown, high CPU.

Technical effect:

  • Full scans
  • High CPU / I/O
  • Increased query duration

What it means: System inefficiency under load. Can cascade into bigger issues.

Analogy: Cars using small roads instead of highways.

Incident signals:

  • Slow query logs
  • High CPU
  • db file sequential read

IC Questions: "Any slow queries?" / "Indexes being used?" / "Recent changes?"

With Index ~5ms No Index ~2000ms 0ms 500ms 1000ms 2000ms 400× slower without an index

Buffer Pool / Cache Hit Ratio City warehouse vs distant storage depot

What it does: The buffer pool (or buffer cache) holds frequently accessed data pages in memory so the DB can serve reads from RAM instead of disk.

Problem in incident: If the buffer pool is too small or gets evicted under memory pressure, the DB must read from disk more often — causing high read I/O and latency even when queries are efficient.

Effect (what you see): High disk read I/O, slow reads, elevated "physical reads" metric. Looks similar to a missing index but queries may have good plans.

Technical effect:

  • Low cache hit ratio → frequent physical reads from disk
  • Memory pressure → pages evicted before they can be reused
  • Working set larger than available buffer pool

Key distinction from disk I/O bottleneck: Disk I/O bottleneck = disk can't keep up with demand. Buffer pool problem = too many requests hitting disk that could be served from memory.

Analogy: Warehouse runs out of stock — every request requires a trip to a distant depot instead of grabbing from the shelf.

Incident signals: Low cache hit ratio alert, high physical reads, memory utilisation high on DB host.

IC Questions: "What is the cache hit ratio?" / "Has memory pressure increased?" / "Has the working data set grown recently?"

Row Lock One lane blocked

What it does: Locks specific rows during updates.

Problem in incident: Long transactions hold locks.

Effect (what you see): Queries waiting, localised slowdown.

Technical effect:

  • Other queries blocked on same rows
  • Increased wait times
  • Queue formation

What it means: One piece of work is blocking others. Can escalate if widespread.

Analogy: One lane closed due to accident.

Incident signals:

  • enq: TX - row lock contention
  • TX enqueue (mode 6)
  • Queries waiting

Key insight: Write always blocks write. Whether a write blocks a read depends on isolation level — in some databases reads are never blocked (MVCC); in others they wait. Important distinction for diagnosing who is actually stuck.

IC Questions: "What's blocking?" / "Any long transactions?" / "Can we clear it?" / "Is this write-write or write-read contention?"

T1 — ACTIVE holds lock HOLDS Row X 🔒 LOCKED T2 — WAITING T3 — WAITING T4 — WAITING ← blocked

Deadlocks Two cars blocking each other at a junction

What it does: Two transactions each hold a lock the other needs, causing a circular wait that neither can resolve.

Problem in incident: Transactions freeze waiting on each other — the database must detect and kill one to break the cycle.

Effect (what you see): One transaction is rolled back with a deadlock error. Throughput drops if deadlocks are frequent.

Technical effect:

  • T1 holds lock on Row A, wants Row B
  • T2 holds lock on Row B, wants Row A
  • DB deadlock detector kills one (the "victim") and rolls it back

Key distinction from row lock: Row lock contention is one-directional (one waits). A deadlock is circular (both wait on each other). The DB resolves it automatically but the rolled-back transaction may retry and repeat.

Analogy: Two cars at a narrow junction, each waiting for the other to reverse — neither can move until one backs down.

Incident signals: Deadlock errors in logs, rolled-back transactions, retry storms.

IC Questions: "Are deadlock errors in the logs?" / "Is the same pair of transactions involved?" / "Are retries making it worse?"

Metadata Lock Entire road closed

What it does: Locks entire table structure.

Problem in incident: Schema change blocks all access.

Effect (what you see): Sudden freeze — queries pile up instantly.

Technical effect:

  • All queries blocked waiting on metadata
  • No progress despite low CPU

What it means: System is blocked, not overloaded. One operation is halting everything.

Analogy: Entire road shut down.

Incident signals:

  • Queries stuck "waiting"
  • Low CPU but high latency

IC Questions: "Any schema changes?" / "What's blocking?" / "Can we stop it?"

Query A Query B Query C Query D TABLE 🚫 METADATA LOCK ALTER TABLE schema change in progress holds metadata lock

Locks & Contention Blocked roads and junctions

What it does: Controls access to shared data.

Problem in incident: Too many locks or long transactions.

Effect (what you see): Queries waiting — system appears stuck.

Technical effect:

  • Blocking chains
  • Increased wait times
  • Throughput drops

What it means: Work is queued behind blockers. System not overloaded — just blocked.

Analogy: Traffic jam behind blocked road.

Incident signals:

  • Lock wait alerts
  • Waiting queries

IC Questions: "What's blocking?" / "How long?" / "Can we remove it?"

T1 ACTIVE holds lock blocks T2 WAITING queued blocks T3 WAITING queued blocks T4+ WAITING chain grows... system not overloaded — just blocked. kill T1 to unblock the chain.

Long-Running Transactions A lorry blocking a side road for hours

What it does: A transaction that stays open much longer than normal, holding locks and resources throughout.

Problem in incident: Long transactions are a root cause that triggers several other issues — they hold row locks (blocking others), prevent log truncation (causing log growth), and inflate undo/rollback segments.

Effect (what you see): Depends on what the transaction is doing — could appear as row lock contention, log growth, or disk pressure rather than the transaction itself.

Technical effect:

  • Holds row locks for extended period → blocks other transactions
  • Prevents transaction log from being truncated → log grows
  • Holds undo/rollback space → undo segment pressure

Key insight: Often invisible as a direct alert — you see the symptoms (lock waits, log growth) but must look for long-running transactions as the underlying cause.

Analogy: A lorry parked across a side road for hours — blocking everything behind it and preventing road crews from clearing the area.

Incident signals: Long transaction time in monitoring, lock waits, log growth, undo pressure.

IC Questions: "Any transactions open for an unusual length of time?" / "Is this causing lock waits or log growth?" / "Can it be safely rolled back?"

Redo Log / Transaction Log Traffic control recording every car movement

What it does: Records all changes for durability and recovery.

Problem in incident: Heavy write activity overwhelms logging. Logs become a bottleneck.

Effect (what you see): System slows under write load. Even simple operations delayed.

Technical effect:

  • Increased disk writes
  • Log flush contention
  • Transactions slowed waiting for log writes

What it means: Write throughput is limiting performance. System can't commit changes fast enough. Risk of cascading slowdown.

Analogy: Cars must stop at a checkpoint before continuing.

Incident signals:

  • High write latency
  • Disk pressure
  • Slow commits

IC Questions: "Is write volume high?" / "Any long transactions?" / "Is disk under pressure?"

App Writes heavy writes Redo Log flush contention BOTTLENECK slow Disk Flush ✓ Commit acknowledged

Bottleneck in Transaction Log Single toll booth

Core understanding: All write operations must be recorded in the transaction log first. If the log can't keep up (slow disk or high write volume), everything slows down.

What it does: Ensures durability of writes.

Problem: Log becomes a bottleneck.

Effect (what you see): Slow transactions, connection buildup.

Technical effect:

  • Log write delays
  • Commit latency rises

What it means: Central write system is congested.

Analogy: Single toll booth causing traffic backup.

Incident signals: Log write waits, rising active sessions.

IC Questions: "Is disk slow?" / "Too many writes?"

Are Items Removed from Transaction Log? Black box recorder

Core understanding: Completed transactions are not immediately removed. The log keeps them until it is safe to reuse the space — after checkpoints and/or log backups, depending on system.

What it does: Stores transaction history for recovery.

Problem: Log keeps growing.

Effect (what you see): Disk pressure.

Technical effect:

  • Entries retained until safe for recovery
  • Space reused later (not deleted immediately)

What it means: Log is controlled reuse, not deletion.

Analogy: Black box recorder that overwrites old data later.

Incident signals: Log growth alerts.

IC Questions: "Are log backups running?" / "Any long transactions?"

Checkpoint vs Log Backup Unloading truck vs clearing warehouse

Core understanding: Checkpoint writes data pages to disk for recovery. Log backup allows the transaction log to reuse space. They solve different problems — using the wrong one won't fix the issue.

What it does:

  • Checkpoint → flushes data pages to disk
  • Log backup → frees log space for reuse

Problem: Log growing unexpectedly.

Effect (what you see): Disk issues despite checkpoints running.

Technical effect:

  • Checkpoint does not truncate the log
  • Log backup is required to free space

What it means: Wrong tool applied to the problem.

Analogy: Unloading a truck (checkpoint) vs clearing the whole warehouse (log backup).

Incident signals: Log growth despite checkpoints running.

IC Questions: "Are log backups configured?" / "What recovery mode is set?"

Database Connections / Connection Pooling Cars entering the city

What it does: Limits number of active DB connections.

Problem in incident: Too many connections or leaks.

Effect (what you see): Requests waiting or timing out.

Technical effect:

  • Connection pool exhausted
  • Requests queued before DB
  • Threads blocked waiting

What it means: System can't accept more work. Often caused by slow queries or leaks.

Analogy: Cars queued at city entrance.

Incident signals:

  • "Too many connections"
  • Timeouts
  • Low DB utilisation sometimes

IC Questions: "Are we at max connections?" / "Are connections released?" / "What's holding them?"

CONNECTION POOL max: 10 connections IN USE IN USE IN USE IN USE IN USE IN USE IN USE IN USE FREE FREE 8 in use · 2 free QUEUE waiting for slot Req 11 Req 12 Req 13 timing out... 3 requests waiting

Connection Pathway + Redo Log Club capacity + slow bar

Core understanding: A client must connect before running queries. Write operations are logged first (redo/transaction log). If the system is slow, connections stay open longer and can hit limits.

What it does: Handles access and write durability.

Problem: Too many connections / slow commits.

Effect (what you see): Connection errors, requests rejected.

Technical effect:

  • Flow: Client → Connect → Limit check → Query → Execute → Log
  • Slow log → slow commits → connections pile up → limit hit

What it means: System saturated at entry or commit stage.

Analogy: Club at capacity with slow bar service — people can't get in or get stuck inside.

Incident signals: "Too many connections" error, rising active sessions.

IC Questions: "Are connections being released?" / "Where is the bottleneck?"

Query Timeout vs Connection Timeout Order taking too long vs never getting a table

What it does: Two different timeout types that produce similar-looking errors but have different causes and fixes.

Problem in incident: Teams often conflate them — treating a connection timeout like a slow query problem, or vice versa. Diagnosing the wrong one wastes time.

Technical effect:

  • Query timeout: Connection was made, query started, but it ran too long — DB or app killed it. Cause: slow query, missing index, lock wait.
  • Connection timeout: App could not get a connection within the time limit — never reached a query. Cause: pool exhausted, DB overloaded, network issue.

Key distinction:

  • Query timeout → you got in, but service was too slow
  • Connection timeout → you never got a table

Analogy: Query timeout = seated at a restaurant but your order never arrives. Connection timeout = no tables available, turned away at the door.

Incident signals: Error message wording — "query exceeded timeout" vs "connection timed out" / "could not acquire connection".

IC Questions: "What does the exact error say?" / "Did the connection succeed?" / "Is the pool full or are queries just slow?"

Temp Index Rebuild Road maintenance during rush hour

What it does: Rebuilds or reorganises indexes.

Problem in incident: Happens during peak load. Competes for resources.

Effect (what you see): Sudden slowdown, increased I/O and CPU.

Technical effect:

  • Heavy disk usage
  • Temporary space consumption
  • Increased contention with live queries

What it means: Background work is stealing capacity from production traffic. Can trigger wider performance issues.

Analogy: Roadworks reducing available lanes.

Incident signals:

  • Maintenance job running
  • "tablespace is full" (possible)
  • Disk spikes

Key insight: Rebuilding creates a new index alongside the old one before swapping — temporarily doubling the storage needed. Disk full alerts during maintenance are often this, not a general storage leak.

IC Questions: "Any maintenance running?" / "Can we pause it?" / "Is disk space OK?" / "Was disk headroom checked before the job started?"

Lane 1 Lane 2 Lane 3 Live Query Traffic → → → Live Query Traffic → → → 🔧 Index Rebuild — consuming disk I/O & CPU, competing with live queries

Resource Saturation (CPU / Disk / Memory) City at full capacity

What it does: Provides compute and storage resources.

Problem in incident: System exceeds capacity.

Effect (what you see): Everything slows — no single clear cause.

Technical effect:

  • CPU maxed → slow processing
  • Disk maxed → slow reads/writes
  • Memory pressure → less caching

What it means: System overloaded. Needs load reduction or scaling.

Analogy: Entire city overwhelmed with traffic.

Incident signals:

  • High CPU / disk
  • System-wide latency

IC Questions: "Which resource is maxed?" / "Load spike or inefficiency?" / "Can we reduce load?"

CPU 95% Disk I/O 88% Memory 82% danger threshold (80%)

Replication Lag Branch office receiving yesterday's updates

What it does: Changes written to the primary database are replicated to read replicas, usually with a small delay.

Problem in incident: Lag grows — reads from replicas return stale data. Users see outdated results or inconsistencies.

Effect (what you see): Data appears to "go backwards" or users see different data depending on which replica they hit. May look like a bug rather than an infrastructure issue.

Technical effect:

  • Primary processes writes faster than replica can apply them
  • Replica falls behind — lag measured in seconds or minutes
  • Reads routed to replica return old data

Common causes: Heavy write load on primary, slow replica disk, long-running queries on replica blocking apply, network issues.

Analogy: Head office sends updates daily — branch office is working from yesterday's data.

Incident signals: Replication lag metric rising, user reports of stale data, replica behind primary by N seconds.

IC Questions: "What is current replica lag?" / "Are reads being routed to replicas?" / "Is write load on primary spiking?" / "Can we route reads to primary temporarily?"

Database Wallet Secure key locker

What it does: A database wallet is a secure store for credentials, certificates, and encryption keys. Applications and databases retrieve passwords and keys from the wallet instead of having them exposed in plain-text config files or code.

Problem in incident: Wallet missing, corrupted, or inaccessible; wrong file permissions; expired certificates; config pointing to the wrong wallet path.

Symptoms:

  • Apps suddenly can't connect to the database
  • Authentication failures spike — often immediately after a deploy
  • Services fail on startup or restart

Technical effect: The system can't retrieve credentials or encryption material, so DB connections fail, TLS/SSL handshakes may fail, and authentication breaks even if the underlying credentials are correct.

What it means (IC interpretation): Likely a misconfiguration or dependency failure — not load-related. Often triggered by deployments, certificate rotation, or permission changes. The credentials themselves may be fine; it's access to them that has broken.

Analogy: A secure key locker for delivery drivers. Drivers (apps) don't carry keys themselves — they go to the locker to pick them up before each delivery. If the locker is locked, broken, or empty, no deliveries happen regardless of whether the drivers are available.

Incident signals: "Authentication failed" · "Cannot load wallet" · "Permission denied" · "SSL handshake failed" · Spike in connection errors immediately after a deploy

IC questions: Did anything change recently (deploy, config, cert rotation)? Is the wallet file path accessible from the service? Are file permissions correct? Has anything expired (certs/keys)? Is this affecting all services or just one?

DATABASE WALLET — SECURE KEY LOCKER ✓ NORMAL FLOW APP Service WALLET Keys / Certs DATABASE Connected Keys retrieved → DB connected Driver picks up key → makes delivery ✗ WALLET INACCESSIBLE APP Service WALLET Inaccessible DATABASE No Auth Auth fails · SSL error · no connection Locker locked → no key → no delivery

Incident Chain How it all connects

1 Bad Query Plan Inefficient routing — full scans instead of index lookups 🗺️ 2 Queries Slow Down Stay longer in system — connections held, queue grows 🚗 3 Redo Log Pressure Increases Write throughput constrained — commits begin to slow 📋 4 Index Rebuild Kicks In Background maintenance steals capacity — disk I/O spikes 🛠️ 5 Locks Appear Row and metadata locks block traffic — wait queues form 🔒 6 System Gridlock Nothing moves — full saturation or connection exhaustion 🚨 Performance degradation Capacity reduction Critical / blocking

Undo & Read Consistency (RAC) Old maps for drivers

Core understanding: Oracle lets readers see a consistent past version of data using undo, even while writes are happening. In RAC, this consistency must work across multiple nodes, which adds coordination overhead.

What it does:

  • Stores before-images of data (undo)
  • Lets queries read a stable snapshot
  • Prevents read/write blocking

Problem in incident: Undo too small or overwritten; long queries need old data that no longer exists; RAC adds delay due to cross-node access.

Effect (what you see): "Snapshot too old" query failures; sudden query slowdowns; intermittent errors on long-running reports.

Technical effect: Required undo data no longer available, or slow retrieval across RAC nodes.

What it means: Capacity issue (undo too small) or workload mismatch (long queries vs high churn). In RAC, could also be inter-node latency.

Analogy: Cars (queries) need a map of the road from 5 minutes ago. Old maps (undo) keep getting thrown away. If the map is gone, the driver gets lost — query fails.

Incident signals: "snapshot too old" errors; long-running queries failing; spikes in undo usage; RAC: interconnect latency warnings.

IC Questions: Are queries long-running? Has data change rate increased? Any recent batch jobs? Is this happening across all RAC nodes or one?

Memory Architecture (SGA/PGA, RAC) Kitchens with shared fridges

Core understanding: Oracle uses memory to cache data and speed up queries. In RAC, each node has its own memory but must share data via interconnect — the "pinging" problem.

What it does:

  • SGA = shared memory (data cache, SQL cache)
  • PGA = per-session memory
  • Reduces disk I/O by caching hot data

Problem in incident: Memory pressure (too many queries); cache inefficiency; RAC blocks constantly moving between nodes.

Effect (what you see): High latency; high CPU; slow queries across cluster; sudden performance degradation.

Technical effect: Cache misses lead to more disk reads; RAC block transfer overhead between nodes ("gc" waits).

What it means: Resource contention (memory/CPU) or bad workload distribution across RAC. Often: too many queries, poor query patterns, or hot blocks bouncing between nodes.

Analogy: Each RAC node is a separate kitchen with its own fridge. If a chef needs something from another kitchen, they must run across the street. Too much running = everything slows down.

Incident signals: High CPU; high memory usage; RAC interconnect traffic spikes; "buffer busy waits" / "gc" waits.

IC Questions: Is load evenly distributed across nodes? Any spike in query volume? Are specific queries dominating? Is one node worse than others?

Undo + Memory Interaction (RAC) Bridge congestion + roadworks

Core understanding: Undo and memory work together to serve consistent reads quickly. In RAC, this may involve remote memory access between nodes — heavy writes and long reads colliding causes compounding pressure.

What it does:

  • Memory serves cached data quickly
  • Undo reconstructs older versions for consistency
  • RAC shares both mechanisms across nodes

Problem in incident: Heavy writes + long reads + RAC traffic causes simultaneous contention and latency.

Effect (what you see): Cluster-wide slowdown; queries inconsistent in performance; timeouts; mixed symptoms (CPU + latency + errors).

Technical effect: Undo reconstruction + memory contention happening at the same time; inter-node block transfers compound both.

What it means: System under stress — multiple subsystems interacting badly. Often triggered by batch jobs or reporting running alongside heavy writes.

Analogy: Cars need old maps (undo). Roads are busy (writes). Cities are connected by bridges (RAC). Too many cars crossing bridges + changing roads = gridlock.

Incident signals: Mixed symptoms (CPU + latency + errors); RAC interconnect spikes; query variability; undo errors alongside memory pressure.

IC Questions: What changed? (batch job, release) Is this cluster-wide? Are reads and writes colliding at the same time?

Seeded Reports City-wide traffic map

Core understanding: A seeded report is a pre-built, default report that ships with a system. Designed for common use cases — not tailored to your specific environment or incident needs.

What it does: Provides standard visibility into data (performance, usage, sales) without requiring a custom build.

Problem in incident: Seeded reports often lack the detail, speed, or focus needed during an active incident.

Effect (what you see):

  • Missing key data you need right now
  • Reports too slow to load
  • Data feels generic — "nothing looks wrong"
  • Teams say "the report looks fine" but users are impacted

Technical effect: Queries are broad and inefficient; not optimised for real-time debugging; may miss critical filters or dimensions (specific customer, query, endpoint).

What it means (IC interpretation): Observability gap. You're relying on generic tooling instead of targeted insight — this slows decision-making and prolongs the incident.

Analogy: A city-wide traffic map. It shows "traffic looks normal overall" — but your incident is a single blocked lane on one street. You need a zoomed-in camera, not a general map.

Incident signals:

  • "Dashboard shows normal but users report slowness"
  • "Report takes too long to generate"
  • "No visibility into specific query / user / service"
  • Conflicting statements between teams

IC Questions: "Do we have a more granular or real-time view?" / "Can we filter to affected users or endpoints?" / "Is this report cached or delayed?" / "Who can run a targeted query or log search instead?"

Real-world example — Top Customers Report: A classic seeded report you'll find pre-installed in many systems:

SELECT
    customer_id,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10;

This query shows your top 10 customers by spending — a common business report that ships by default. It's useful day-to-day, but during an incident it tells you almost nothing: it doesn't filter by time window, affected region, or error type. You'd need a targeted query scoped to the problem instead.

Where seeded reports appear:

  • ERP systems (Oracle, SAP) — pre-built operational reports
  • CRM tools — customer activity and pipeline summaries
  • Internal dashboards — aggregate health views used by on-call
  • BI tools (connected to MySQL / Postgres) — standard metric views
SEEDED REPORT vs TARGETED VIEW Seeded Report — City Traffic Map ✓ Overall: NORMAL hidden blockage Targeted View — Zoomed Camera BLOCKED user: affected_customer endpoint: /api/load ⚠ Root cause visible drill down
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
27 questions · shuffled each round · score tracked.

1 · How a Query Travels Through the Database
01 TCP Connect Client opens socket to DB port 3306 02 Authenticate Credentials checked · wallet / config used 03 Session Created Slot taken from connection pool 04 max_connections? If full → rejected · "Too many connections" 05 Parse & Optimize Optimizer picks best execution plan 06 Execute Index lookup or full table scan 07 Write → Redo Log Writes logged first then committed 08 Return & Close Results sent back · connection released
⚡ Where things go wrong at each stage
TCP connectNetwork issue, firewall, DB down
AuthenticationWrong creds, wallet inaccessible, cert expired
Session / poolPool exhausted → connection timeout
max_connectionsToo many open sessions → rejected requests
OptimizeStale stats → bad plan → full table scan
ExecuteLock wait, missing index, slow query
Redo logDisk bottleneck → slow commits → sessions pile up
Close / releaseConnection leak → pool never freed
🔍 First questions as IC
  • Connection issue? Check pool exhaustion, "Too many connections" error
  • Auth issue? Recent deploy? Wallet path / certs / permissions changed?
  • Slow query? Slow query log on? Indexes being used? EXPLAIN output?
  • Blocked? Long transaction holding locks? Schema change running?
  • Write lag? Disk I/O high? Redo log flush contention?
  • Resource? CPU / Disk / Memory — which one is maxed?
Key principle: Distinguish "blocked" (low CPU, queries waiting) from "overloaded" (high CPU, everything slow). They look similar but have different fixes.
2 · How Memory Works (Buffer Pool & Cache)
Application READ request BUFFER POOL (RAM) default 128 MB · target 70–80% of RAM cached cached cached empty HIT → serve from RAM (~1ms) MISS DISK Physical read ~10–100ms Page loaded into pool Next read = cache hit
✅ Cache hit (good)

Data already in RAM. Served instantly — no disk involved. Cache hit ratio >99% is healthy; below 95% is a warning sign.

Grabbing from shelf
⚠️ Cache miss (costly)

Data not in RAM — must read from disk. 10–100× slower. Looks like slow queries even with good plans.

Cause: Working set larger than buffer pool, or memory pressure evicting pages.

Trip to distant warehouse
🔧 IC checks
  • Cache hit ratio dropping?
  • Memory utilisation high on DB host?
  • Has working data set grown?
  • Buffer pool size recently reduced?
Distinguish from disk bottleneck: pool problem = too many reads that should have been served from RAM.
3 · How a Slow Query Happens
GOOD PLAN Query Index Lookup 3 rows scanned ~5ms ✓ BAD PLAN (no index / stale stats) Query Full Table Scan 5M rows scanned ~2000ms ✗ 400× slower without index
🗂️ Why an index stops working
  • Fragmentation — inserts/updates/deletes scatter pages
  • Stale statistics — optimizer misjudges row count, picks wrong plan
  • Poor selectivity — column has few unique values (e.g. status Y/N)
  • Function on columnWHERE YEAR(date)= bypasses index
Library catalog: exists but outdated
📊 Slow query signals
slow query log spiking high CPU on DB EXPLAIN type:ALL db file sequential read
EXPLAIN the query. type:ALL = full scan. type:ref/range = index used.
🔧 IC actions
  • Check slow query log for offenders
  • Run EXPLAIN — identify full scans
  • Are statistics up to date?
  • Has data volume grown recently?
  • Any code deploy or query change?
  • Is an index rebuild running (competing I/O)?
4 · How Locking Works (Row Lock → Deadlock → Metadata Lock)
🔒 Row Lock

Locks specific rows during an update. Others needing the same rows must wait. Write always blocks write. MVCC prevents read blocks in most DBs.

enq: TX - row lock contention
One lane closed due to an accident
🔄 Deadlock

T1 holds Row A, wants Row B. T2 holds Row B, wants Row A. Circular — neither moves. DB kills one (the "victim"). May trigger retry storm.

deadlock errors in logs rolled-back transactions
Two cars blocking each other at a junction
🚫 Metadata Lock

DDL (ALTER TABLE) locks the entire table structure. All queries queue instantly. CPU stays low — blocked, not overloaded.

low CPU, high wait queries in "waiting for MDL"
Entire road shut down

BLOCKING CHAIN — how one transaction freezes the system

T1 ACTIVE — holds lock → blocks → T2 WAITING → blocks → T3 WAITING → blocks → T4+ WAITING — chain grows…
Fix: Kill T1 to unblock the entire chain. System is blocked — not overloaded. Killing the head releases all waiting transactions immediately.
5 · How Writes Are Committed (Redo / Transaction Log)
App Writes heavy writes Redo Log flush contention BOTTLENECK slow Disk Flush ✓ Commit acknowledged Slow disk → slow commits → sessions held longer → pool fills up
📋 Redo log key facts
What it recordsEvery write before commit
Why it existsDurability — recover after crash
Bottleneck signHigh write latency, slow commits
Cascade effectSlow log → slow commits → pool fills
Long transactionsHold log space — prevent truncation → log grows
⚖️ Checkpoint vs Log Backup
CheckpointFlushes data pages to disk for crash recovery
Log backupFrees log space for reuse
Common mistakeRunning checkpoint when log grows — won't help
Fix for log growthRun log backup, kill long transactions
Checkpoint = unload truck · Log backup = clear warehouse
6 · How Connection Pool Exhaustion Happens
CONNECTION POOL (max 10) IN USE IN USE IN USE IN USE IN USE IN USE IN USE IN USE FREE FREE 8 in use · 2 free ~1MB RAM per connection (MySQL) default max_connections: 151 OVERFLOW QUEUE Req 11 Req 12 Req 13 Req 14+ timing out → "Too many connections" conn timeout ≠ query timeout — client never reached the DB query timeout = got in, query too slow · connection timeout = never got a slot
⛔ Pool exhausted

All slots taken. New requests queue then time out. DB may not be overloaded — just at its connection limit.

🕳️ Connection leak

Connections opened but never closed. Pool slowly fills. Triggered by app restarts or error paths that skip close().

⏱️ Timeout types
Query timeoutGot in, query too slow, killed
Conn timeoutNever got a slot, rejected
Seated but slow vs turned away at door
🔧 IC questions
  • At max_connections?
  • Slow queries holding slots?
  • Connection leak suspected?
  • Can app layer restart to release?
7 · How an Incident Cascades

A single root cause often triggers a cascade. Recognising the chain tells you where to intervene.

Stale statistics / missing indexOptimizer picks full table scan
Queries run 400× slowerDB threads held open for much longer
Connection pool fills upNew requests can't get a connection
"Too many connections" errorApplication layer throws 500s
Users see full outageRoot cause: one missing/broken index
IC insight: Don't just fix the symptom (restart app / increase max_connections). Trace back to root cause — otherwise it recurs. Common chain: index issue → slow queries → connection exhaustion → 500s.
🔗 Other common cascades
  • Long transaction → row locks → blocking chain → throughput drops
  • Disk I/O saturation → redo log slow → commits slow → pool fills
  • Schema change (MDL) → instant table lock → all queries queue
  • Index rebuild at peak → doubles disk I/O → slow queries → cascade above
📡 Replication lag

Heavy primary writes outpace replica's apply speed. Reads return stale data — looks like a bug, not infrastructure.

replica behind by N secondsstale data reports
Quick fix: Route reads to primary. Root fix: reduce write load or increase replica resources.
🗄️ Resource saturation

CPU, disk, and memory all hitting limits simultaneously — everything degrades with no single clear cause.

CPU >90%Query processing starved
Disk I/O >85%All reads/writes slow
Memory >85%Buffer pool evicted → more disk reads
Entire city overwhelmed with traffic
8 · Quick Reference — Symptom → Likely Cause
What you seeLikely cause
"Too many connections"Pool exhausted (slow queries / leak)
Gradual slowdown, high CPUFull table scan / missing index
Sudden freeze, low CPUMetadata lock (schema change)
Localised queries waitingRow lock contention
Deadlock errors in logsCircular lock dependency
High disk I/O, slow commitsRedo log bottleneck
Log growing despite checkpointsNo log backup / long transaction
Auth failures after deployWallet inaccessible / cert expired
Stale / inconsistent dataReplication lag
Disk spike during maintenanceIndex rebuild (temp double storage)
🚦 Universal IC triage order
  1. Identify scope — all users or subset? One service?
  2. Check what changed — deploy, migration, job, config?
  3. Blocked vs overloaded? — low CPU + waits = blocked; high CPU = overloaded
  4. Find the head of the chain — what is T1 / the root blocker?
  5. Kill or pause — remove the blocker; monitor for recovery
  6. Root cause, not symptom — so it doesn't immediately recur
💬 Useful MySQL commands
Active sessionsSHOW PROCESSLIST
InnoDB locksSHOW ENGINE INNODB STATUS
Query planEXPLAIN SELECT ...
Kill sessionKILL [process_id]
Replication lagSHOW REPLICA STATUS
Slow query logSHOW VARIABLES LIKE 'slow%'

DNS Record Types Contact list with routing rules

Core understanding: DNS isn't just "name → IP." It stores different record types that control where traffic goes and how services are discovered.

What it is: A distributed directory with multiple record types, each serving a different routing purpose.

Key records:

  • A → domain → IPv4 (most common)
  • AAAA → domain → IPv6
  • CNAME → alias (domain points to another domain)
  • MX → mail routing
  • TXT → verification / policies (SPF, DKIM)
  • NS → which DNS servers are authoritative

Problem in incident: Wrong IP in A record · broken CNAME chain · missing or incorrect records

Effect (what you see): Users routed to wrong server · partial outages · some services work, others fail

Technical effect: DNS resolves — but to the wrong destination

What it means: Misconfiguration, not outage — traffic is flowing, but incorrectly

Analogy: Contact list with wrong phone numbers or forwarding rules

Incident signals:

  • Traffic hitting wrong servers
  • Sudden shift in traffic patterns
  • "It works for some domains but not others"

IC questions: "What record changed?" / "Are we resolving to the expected IP?" / "Is there a CNAME chain involved?"

Pattern: Traffic going somewhere wrong → think DNS misconfiguration

DNS RECORD TYPES RECORD MAPS USED FOR INCIDENT RISK A domain → IPv4 Website / API traffic Wrong IP → wrong server AAAA domain → IPv6 IPv6 traffic IPv6-only users broken CNAME domain → domain Aliases / CDN / subdomains Broken chain → NXDOMAIN MX domain → mail server Email routing Email fails, site still up TXT domain → text string SPF, DKIM, verification Emails marked spam NS domain → nameserver Authoritative server lookup All DNS resolution fails

TTL & Propagation Old maps still in circulation

Core understanding: DNS changes are not instant — TTL (Time To Live) controls how long old answers stay cached by resolvers across the internet.

What it does: TTL determines how long a resolver caches a DNS answer before it re-queries the authoritative server.

Problem in incident: Old records still cached · some users see new config, others see old

Effect (what you see): "Works for me but not others" · gradual recovery · region-dependent behaviour

Technical effect: Different resolvers return different answers — inconsistent global state

What it means: Not a failure — the change is still propagating. Expected behaviour after a DNS update.

Analogy: Old maps still being used while new maps are being distributed

Incident signals:

  • Mixed behaviour across regions or users
  • Gradual improvement over time after a DNS change
  • "Some users fixed, others still broken"

IC questions: "What is the TTL?" / "When was the change made?" / "Are caches cleared?"

Pattern: Inconsistent behaviour after a DNS change → think TTL propagation delay

RESOLVER A — STALE CACHE TTL not expired · cached old answer DNS query: example.com Cache hit → 203.0.113.5 (OLD) → OLD SERVER user still broken VS RESOLVER B — FRESH CACHE TTL expired · re-queried authoritative DNS query: example.com Live query → 203.0.113.9 (NEW) → NEW SERVER user is fixed

TCP vs UDP Registered mail vs postcards

Core understanding: TCP and UDP are two transport protocols — reliable vs fast. Knowing which one your traffic uses changes how you diagnose failures.

TCP (Transmission Control Protocol): Reliable, ordered, connection-based · used by HTTP/S, MySQL · retries automatically · guaranteed delivery

UDP (User Datagram Protocol): Fast, no guarantees, connectionless · used by DNS, streaming, VoIP · sends and forgets — no retry built in

Problem in incident:

  • TCP: congestion, connection limits, slow under load
  • UDP: silent drops, hard-to-detect failures, no error trail

Effect (what you see): TCP issues → timeouts, slow apps · UDP issues → intermittent failures, missing responses

What it means: TCP problems = congestion or capacity · UDP problems = loss or instability

Analogy: TCP = registered mail (guaranteed delivery) · UDP = postcards (fast but may get lost)

Incident signals:

  • TCP: high latency, connection timeouts
  • UDP: missing responses, intermittent failures, no error logs

IC questions: "Is this TCP or UDP traffic?" / "Do we see retries or silent drops?" / "Is reliability or speed more critical?"

Pattern: Silent failures with no error logs → think UDP packet loss

TCP — RELIABLE HTTP/S · MySQL · guaranteed delivery SENDER P1 ACK P2 ACK RECEIVER Every packet confirmed ✓ Retransmits if no ACK Analogy: registered mail VS UDP — FAST / NO GUARANTEE DNS · streaming · VoIP · fire and forget SENDER P1 P2? P3 RECEIVER P2 dropped — no retry ✗ No error logged · silent failure Analogy: postcard

TCP Handshake & Connection Lifecycle Knocking on a door that won't answer

Core understanding: Before any data flows, TCP must establish a connection via a 3-step handshake. If this fails, no requests can be processed at all.

The handshake: SYN → SYN-ACK → ACK

Problem in incident: Handshake fails or is delayed · SYN queue fills up · server cannot accept new connections

Effect (what you see): Connection timeouts · users can't connect · errors appear before any request is sent

Technical effect: Entry point is saturated — the problem is at the door, not inside the application

What it means: Often load-related or an attack — not an application bug

Analogy: Knocking on a door but no one answers — the house is overwhelmed before anyone can get inside

Incident signals:

  • SYN backlog warnings
  • High connection attempt counts
  • Timeouts before any request data is exchanged

IC questions: "Are connections failing before requests?" / "Is the SYN queue full?" / "Is this a traffic spike or an attack?"

Pattern: Fails before any request is processed → think TCP handshake saturation

NORMAL HANDSHAKE CLIENT SERVER 1. SYN "I want to connect" 2. SYN-ACK "OK, I'm ready" 3. ACK "Confirmed — send data" Connection established ✓ data transfer begins VS SYN QUEUE SATURATED SERVER SYN queue FULL C1 SYN C2 SYN C3 SYN C4 SYN ✗ DROPPED C4 gets: connection timeout

Retransmissions & Congestion Traffic jam where cars keep re-entering

Core understanding: When TCP packets are lost, they are automatically retransmitted. Under high load, this creates a congestion feedback loop — more retransmits = more traffic = worse congestion.

What it does: TCP guarantees delivery by resending lost packets — but each resend adds to overall traffic load.

Problem in incident: High retransmission rate · congestion builds · performance degrades progressively under sustained load

Effect (what you see): Slow responses · latency climbing · throughput dropping under load

Technical effect: More traffic → more loss → more retransmits → worse performance (self-reinforcing loop)

What it means: Network degradation spiral — not a full outage, but worsening performance under load

Analogy: Traffic jam where cars keep re-entering — clearing gets harder the more vehicles try to pass

Incident signals:

  • Retransmission rate climbing
  • Latency increasing over time
  • Throughput dropping under load

IC questions: "Are retransmissions increasing?" / "Is packet loss present?" / "Where is the congested link?"

Pattern: Progressive slowdown under load + rising retries → think TCP congestion loop

RETRANSMISSION CONGESTION LOOP TIME → Normal Load builds Packet loss Retransmits Spiral ↓ Latency Throughput Retransmits loop ↑

Kafka Model Multi-lane highway

Core understanding: Kafka is a distributed message bus. Producers write to topics, which are split into partitions for parallelism. Consumer groups read partitions independently — each partition is owned by one consumer in the group at a time.

Key concepts:

  • Producer — publishes messages to a topic
  • Topic — a named stream, split into partitions for throughput
  • Partition — ordered log; one consumer per group handles each partition
  • Consumer Group — consumers sharing the work; each partition assigned to one member
  • Offset — the consumer's position in the log; tracks how far behind it is
  • Broker — server holding partitions; one broker per partition acts as leader

Analogy: Multi-lane highway — messages are cars, partitions are lanes, consumer groups are independent fleets. A blocked lane affects only the consumers using it.

IC relevance: Kafka sits between services. Problems here cause downstream processing to stop silently — no application errors until the queue backs up visibly. Always check lag metrics before assuming the consuming app is healthy.

KAFKA MODEL: PRODUCER → TOPIC / PARTITIONS → CONSUMER GROUP PRODUCERS Service A Service B Service C TOPIC: orders Partition 0 offset 1042 Partition 1 offset 876 Partition 2 ⚠ LAG: 4,200 CONSUMER GROUP Consumer 1 → P0 Consumer 2 → P1 Consumer 3 → P2 ⚠ Consumer 3 falling behind — lag growing

Consumer Group Lag Falling behind on the highway

What it is: The gap between the latest message written to a partition and where the consumer has read to. Lag = unconsumed messages accumulating.

Signals:

  • Lag metric rising continuously
  • Consumers appear healthy but processing is slow
  • Downstream services receive events late or in bursts
  • Alerts on consumer_group_lag or records_lag

Common causes: Slow consumer processing logic · insufficient consumer instances · a stuck or crashed consumer holding a partition · sudden producer spike

IC actions:

  • Check lag metrics per consumer group and per partition — is it one partition or all?
  • Identify stuck or slow consumers — is one consumer responsible?
  • Scale out consumers (more instances = more partitions processed in parallel)
  • Determine trend: lag growing, stable, or recovering?

Pattern: Lag growing + consumers healthy → slow processing logic or stuck consumer. Lag spike + producer spike → transient burst, may self-recover. Lag on one partition only → single consumer issue.

Broker & Partition Failure Lane closure

What it is: Each partition has a leader broker. If that broker fails, partition leadership must be re-elected before producers and consumers can resume on those partitions.

Signals:

  • Producer errors: LEADER_NOT_AVAILABLE or NOT_LEADER_FOR_PARTITION
  • Consumers stop receiving messages on affected partitions
  • Alert on under-replicated partitions (should always be 0 in steady state)
  • Broker removed from cluster health view

Common causes: Broker disk full · broker OOM or crash · network partition isolating a broker · replication factor too low (no replica to elect)

IC actions:

  • Check broker health across all nodes in the cluster
  • Check under-replicated partition count — non-zero means data risk
  • Allow Kafka to auto-elect a new partition leader (usually seconds)
  • Investigate root cause on the failed broker before bringing it back

Pattern: Partial message loss or processing gap → broker failure. Under-replicated partitions → replication issue or broker degraded. Full topic unavailability → majority of brokers for that partition lost.

RabbitMQ Model Postal sorting office

Core understanding: RabbitMQ is a message broker using a push model. Producers publish to an exchange, which routes messages to queues based on binding rules. Consumers pull from queues. Unlike Kafka, messages are deleted once acknowledged — no persistent log.

Key concepts:

  • Producer — publishes messages to an exchange with a routing key
  • Exchange — routes messages to queues based on type and binding key
  • Queue — holds messages until a consumer processes and acknowledges them
  • Consumer — connects to a queue, processes messages, sends ACK to remove them
  • Dead-Letter Queue (DLQ) — receives messages that fail, expire, or are rejected
  • Prefetch — how many unacknowledged messages a consumer can hold at once

Exchange types: Direct — exact key match · Fanout — broadcast to all bound queues · Topic — wildcard pattern match · Headers — match on message attributes

Analogy: Postal sorting office — producer drops a parcel (message) with an address label (routing key). The sorting machine (exchange) reads the label and drops it in the right bin (queue). The delivery driver (consumer) collects from the bin and signs for it (ACK). Failed deliveries go to the returns pile (DLQ).

IC relevance: Problems show as queue depth growing, DLQ filling, or consumer connections dropping. The exchange layer is invisible to most monitoring — routing misconfigurations silently send messages to the wrong queue.

RABBITMQ: PRODUCER → EXCHANGE → QUEUE → CONSUMER Producer Exchange routing key Queue A Queue B DLQ ✗ Consumer A Consumer B rejected / expired / failed Exchange types: Direct (exact key) · Fanout (broadcast all queues) · Topic (wildcard pattern) · Headers (message attributes)

Dead-Letter Queue Saturation Returns pile overflowing

What it is: A Dead-Letter Queue (DLQ) receives messages that cannot be processed — due to repeated failures, TTL expiry, or explicit rejection. When the root cause isn't fixed, the DLQ grows without bound.

Signals:

  • DLQ depth metric climbing continuously
  • Consumer error rate elevated — NACKs or exceptions in logs
  • Upstream queue may appear healthy but messages are being lost silently to the DLQ
  • Memory pressure on the broker if DLQ is unbounded and large

Common causes: Application bug in consumer processing logic · schema mismatch (consumer can't parse message format) · downstream dependency the consumer calls is unavailable · message TTL set too low

IC actions:

  • Check DLQ depth and rate of growth — is it accelerating?
  • Read a sample message from the DLQ and inspect its content
  • Check consumer logs for the error being thrown on each failure
  • Fix the root cause first — clearing the DLQ without fixing the cause just refills it
  • Once fixed, replay DLQ messages in a controlled way (don't flood the queue)

Pattern: DLQ growing + consumer errors → processing bug or schema mismatch. DLQ growing + consumer healthy → TTL expiry or routing misconfiguration. DLQ suddenly growing + recent deploy → code change broke the consumer.

Consumer Connection Storm Revolving door jammed open

What it is: A large number of consumers repeatedly disconnect and reconnect in rapid succession, overwhelming the broker with connection state management. The broker spends more time handling connect/disconnect churn than delivering messages.

Signals:

  • Broker connection count spiking and thrashing (rapid up-down pattern)
  • High CPU on the broker despite low message throughput
  • Consumer application logs showing repeated connection errors and retries
  • Queue processing stalled even though consumers appear to be running

Common causes: Consumer crash loop (pod restarting repeatedly) · incorrect prefetch setting (consumer takes too many messages, times out, gets disconnected) · aggressive health-check misconfiguration forcing disconnections · network instability between consumer hosts and broker

IC actions:

  • Check broker connection count over time — is there a churn pattern?
  • Identify which consumer group or host is responsible for the churn
  • Check for crash loops: kubectl get pods restart counts, or process monitor
  • Check prefetch setting — a value too high causes slow ack, triggering disconnect
  • Isolate and restart affected consumer group; monitor stabilisation

Pattern: Connection churn + consumer crash loop → fix the crash cause (bad code, OOM, bad config). Connection churn + consumer healthy → prefetch misconfiguration or network instability. Broker CPU high with low message rate → connection management overhead, not processing load.

OSI Model 7-floor building

Core understanding: The OSI model gives you a shared language to pinpoint where a problem lives. Different layers are owned by different teams — knowing the layer tells you who to call.

Analogy: A 7-floor building. A fire on floor 3 is a different team's problem than a broken window on floor 7. You need to know which floor is burning before you radio anyone.

OSI MODEL — HARDWARE, PROTOCOLS & IC RELEVANCE LAYER NAME HARDWARE / DEVICES PROTOCOLS IC SYMPTOM 7 Application What the user sees WAF · L7 Load Balancer · Proxy CDN · API Gateway HTTP/S · DNS · SMTP · FTP HTTP errors · 4xx/5xx · auth failures → App team / Security team 6 Presentation Encryption / encoding SSL accelerator / offloader HSM (key storage) TLS/SSL · JPEG · MPEG TLS handshake fail · cert errors → Platform / Security team 5 Session Open / manage / close sessions Mostly software (OS / app) NetBIOS · RPC · SQL sessions Session drops · mid-session logouts → App team 4 Transport End-to-end delivery L4 Firewall · L4 Load Balancer Stateful firewall TCP · UDP · port numbers Connection timeouts · port blocked → Network team 3 Network Routing between networks Router · L3 Switch IP · ICMP · BGP · OSPF Routing failure · wrong CIDR → Network team 2 Data Link Node-to-node delivery Switch · Bridge · NIC (MAC) Ethernet · Wi-Fi · VLANs VLAN misconfiguration · switch loop → DC / Infra team 1 Physical Bits over physical medium Cable · Hub · NIC · Wi-Fi AP Electrical / optical signals Cable unplugged · NIC failure → DC / Infra team

IC use: "Which layer is failing?" is the first isolation question. Failing before connection (L1–L4) is a network/infra problem. Failing after connection (L5–L7) is an app or security problem. Different layers mean different on-call groups.

Example — browser connects to company login page:

  • L7: Browser sends HTTPS GET. WAF inspects the request. App processes it.
  • L6: TLS encrypts/decrypts the payload between browser and server.
  • L5: Session is established and maintained between client and server.
  • L4: TCP connection on port 443. Firewall checks source IP and port.
  • L3: IP routing selects the path to the destination IP across the internet.
  • L2: Ethernet frames hop between switches. MACs used within each segment.
  • L1: Electrical or optical signal travels down the cable or Wi-Fi.

Key distinction — Hub vs Switch: A Hub (L1) blindly repeats signals to all ports — it doesn't understand addresses. A Switch (L2) reads MAC addresses and forwards frames only to the correct port. If a switch fails, specific segments lose connectivity. If a hub fails, everything on that segment drops.

IC question: "Does the problem affect all hosts or just hosts in a specific segment?" — L1 vs L2 distinction. "Is routing broken?" — L3. "Is a port blocked?" — L4.

WAF vs Firewall Customs vs border fence

Core understanding: Both are security controls that block traffic — but they operate at entirely different layers, filter different things, and are owned by different teams. Knowing which one is blocking traffic determines who you call.

FIREWALL vs WAF — WHERE THEY SIT & WHAT THEY BLOCK 🛡 Firewall Layer L3 / L4 — Network & Transport Filters by IP address · port · protocol Blocks IP ranges · ports · CIDR rules Sits at Network perimeter (DMZ / zone edge) IC signal Whole IP or port unreachable Owns Network team Analogy: Border fence — block by country of origin 🔍 WAF Layer L7 — Application Filters by HTTP headers · URL · request body Blocks SQL injection · XSS · bad payloads Sits at In front of web / API services IC signal Specific requests 403'd, others fine Owns Security / App team Analogy: Customs officer — inspects contents, not just origin

Key distinction: A Firewall says "I don't care what's in the parcel — I only care where it came from and which door it's heading to." A WAF opens the parcel and reads it — if it contains malicious content, it blocks the specific request, not the sender's entire address.

IC triage:

  • Whole IP/CIDR unreachable? → Check firewall rules (network team)
  • Specific HTTP requests returning 403, others fine? → Check WAF rules (security team)
  • All traffic through a port suddenly blocked? → Firewall rule change (network team)
  • New deploy causing request failures with no code error? → WAF may be matching a new payload pattern (security team)
  • Legitimate user traffic blocked after load spike? → WAF rate-limiting rule triggered (security team)

Common IC mistake: Assuming a 403 error is an application permission problem. It may be a WAF block — the app never even received the request. Check WAF logs before escalating to the app team.

Pattern: All requests blocked to an IP range → firewall. Only specific URL paths or payload patterns blocked → WAF. Sudden 403 spike after a deployment → WAF rule matched something in the new request format.

Why WAF comes before the firewall in modern cloud

The OSI comparison might suggest firewall (L4) sits in front of WAF (L7) because lower layers precede higher ones. In practice the order is the opposite — and for good reason.

  • WAF lives at the edge — it is typically part of the CDN or reverse proxy layer, closest to the internet. Application attacks (SQL injection, XSS, credential stuffing) are blocked there, before traffic ever enters the cloud network.
  • Early blocking saves compute — stopping a malicious request at the edge means the load balancer, firewall, and app tier never see it. Fewer resources consumed, lower blast radius.
  • Firewall/NSGs protect internal resources — once traffic passes the WAF and load balancer it enters a VCN (virtual cloud network). Firewalls and security groups here enforce zone-to-zone rules: which tier can talk to which, on which ports. They are not designed to inspect HTTP payloads.
  • Cloud providers separate edge security from network security — WAF/CDN is one product (e.g. OCI WAF, AWS WAF, Azure Front Door), firewalls/NSGs are another (e.g. OCI Security Lists, AWS Security Groups, Azure NSG). Different teams own each, different change-management processes apply.

What actually happens in modern cloud (OCI / AWS / Azure style):

MODERN CLOUD TRAFFIC FLOW Internet User's browser, mobile app, API client — untrusted source WAF (edge layer — CDN / reverse proxy) Inspects HTTP content · blocks SQL injection, XSS, rate-limit abuse · OCI WAF / AWS WAF / Azure Front Door Load Balancer Distributes traffic across app instances · TLS termination · health checks Firewall / Security Lists / NSGs Enforces zone-to-zone rules by IP / port / protocol · OCI Security Lists & NSGs / AWS SGs / Azure NSGs App Tier Application servers, containers, functions — only now does app logic run

IC implication of this ordering: When a user reports they can't reach a service, the triage path follows this stack top-down. A block at the WAF produces a 403 and never reaches the load balancer. A firewall/NSG block causes a TCP timeout — no HTTP response at all. An app error produces a 5xx after a full connection is established. The failure signature tells you which layer to investigate first.

Why this matters for escalation: WAF is owned by a different team than NSGs, which is owned by a different team than the app. Calling the wrong team wastes critical incident minutes. Match the symptom to the layer, then call the right team once.

Physical Infrastructure Hardware Fundamentals

Every server, packet, and connection ultimately runs on physical hardware. When a networking problem can't be explained by software, config, or DNS, the answer may be at the physical layer — and physical failures are typically total, sudden, and clean-cut in monitoring.

Physical Server

A computer in a data centre. It has CPU, RAM, storage (disk/SSD), and one or more NICs. Physical problems — hardware failure, power loss, overheating — cause total server failure with no useful application-level error messages.

NIC — Network Interface Card

The hardware component connecting a server to the network. Operates at L1 (Physical) and L2 (Data Link) — handles electrical signals, MAC addresses, and frame transmission. A failed or misconfigured NIC means 100% packet loss for that server. NICs come in 1G, 10G, 25G, and 100G speeds; a speed mismatch with the switch port causes connectivity or performance problems.

Switch (Top-of-Rack / TOR)

Connects multiple servers in the same network segment. Operates at L2 — reads MAC addresses and forwards frames to the correct port. One TOR switch typically serves an entire rack. A switch failure takes down all servers in that rack simultaneously.

Fiber Optic Cable

Carries data as pulses of light. Used within data centres and between DCs. Much faster and longer-range than copper.

  • Multi-mode: Shorter distances (within a DC, up to ~300m). Wider core, multiple light paths.
  • Single-mode: Long distances (DC-to-DC, km scale). Narrower core, one light path. Used for backbone links.

A dirty fiber connector or bad end-face causes intermittent packet loss and CRC errors — frustrating to diagnose remotely because the link stays up but degrades unpredictably.

SFP — Small Form-factor Pluggable

A transceiver module plugged into a NIC or switch port to convert electrical signals to light for fiber connections. A failed SFP causes complete link loss on that port — from software, it looks exactly like the cable is unplugged.

IC Relevance — Scoping a Physical Fault

  • One server unreachable: NIC, its patch cable, the SFP, or the switch port it connects to
  • Whole rack unreachable: TOR switch failure or its uplink fiber
  • Multiple racks / a zone: Aggregation switch or inter-DC uplink fiber
  • Intermittent drops + CRC errors: Dirty fiber connector, failing SFP, or marginal cable — the link is up but unreliable

Key question for the DC team: "Has anyone done any cabling work, port moves, or hardware changes in that rack recently?"

PHYSICAL PATH — SERVER TO NETWORK SERVER CPU · RAM · Disk OS · Application NIC L1/L2 · MAC addr 10 / 25 / 100G SFP fiber TOR SWITCH L2 · MAC table one per rack uplink NETWORK L3 / Internet ▲ NIC fails → 1 server offline ▲ Switch fails → whole rack offline Orange dashed = fiber optic cable · SFP = transceiver (converts electrical signal to light)

Proxy vs Reverse Proxy Forward vs Reverse

A proxy is a server that sits between two parties in a network connection — either on behalf of the client (forward proxy) or on behalf of the server (reverse proxy). The direction determines what it protects and what it hides.

Forward Proxy — represents the client

A forward proxy sits in front of the client. Client traffic passes through it on the way out to the internet.

  • What it hides: the client's identity from the destination server
  • Use cases: corporate content filtering, outbound traffic control, caching for groups of users, anonymity
  • IC scenario: all users in an office can't reach external sites → suspect forward proxy misconfiguration or outage. Check proxy logs. The app isn't the problem — the outbound path is.
  • Examples: Squid, corporate web proxy, VPN exit node

Reverse Proxy — represents the server

A reverse proxy sits in front of the server. External traffic reaches the reverse proxy first, which then routes it to the right backend.

  • What it hides: the backend server's identity and internal topology from the client
  • Use cases: TLS termination, load balancing across app servers, rate limiting, caching static content, WAF integration
  • IC scenarios:
    • 502 Bad Gateway — reverse proxy can't reach the upstream app (app crashed or connection refused)
    • 504 Gateway Timeout — upstream app is alive but not responding fast enough
    • 499 — client gave up waiting before the reverse proxy responded
  • Examples: Nginx (see Cloud Infra tab), HAProxy, Caddy, AWS ALB, Cloudflare

The one-line difference: A forward proxy knows who you are and fetches the internet for you. A reverse proxy knows the internet is calling and routes it to the right server for you.

FORWARD PROXY vs REVERSE PROXY FORWARD PROXY Client (user) Forward Proxy Internet Hides: client identity Content filtering · outbound control Squid · corporate web proxy REVERSE PROXY Internet (clients) Reverse Proxy App Server Hides: server topology TLS termination · load balancing Nginx · HAProxy · AWS ALB Reverse Proxy Error Codes (IC signals) 502 upstream down 504 upstream too slow 499 client gave up
0 / 5 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
10 questions · shuffled each round · score tracked.

1 · How DNS Works
📋 DNS Record Types
AHostname → IPv4 address
AAAAHostname → IPv6 address
CNAMEAlias → another hostname (chain)
MXMail routing for domain
TXTVerification, SPF, DKIM records
NSWhich nameserver is authoritative
IC insight: Wrong record type = traffic routes correctly at DNS level but hits the wrong place. DNS can be "working" and still be wrong.
⏱️ TTL & Propagation

TTL (Time To Live) controls how long DNS answers are cached. After a change, old answers persist across the internet until every cache expires.

Low TTL (60s)Changes propagate fast
High TTL (3600s)Changes take up to 1 hour to spread
"Works for me"Your cache has new record; others still have old
IC questions: What is the TTL? When was the change made? Are caches cleared?
Old maps still in circulation
2 · TCP vs UDP
TCP — Reliable & Ordered ✓ Guaranteed delivery · ordered packets · connection-based ✓ Automatic retransmission on loss Uses: HTTP/S, MySQL, SSH, SMTP, FTP Issues appear as: timeouts, congestion, retransmits Cost: higher overhead, slower setup (handshake required) UDP — Fast & Connectionless ✓ Send and forget · no connection setup · low overhead ✗ No guaranteed delivery · no ordering Uses: DNS, VoIP, video streaming, gaming Issues appear as: silent drops, choppy audio/video Cost: no recovery on loss — app must handle it
3 · TCP Connection Lifecycle & What Can Go Wrong
3-WAY HANDSHAKE Client Server SYN → ← SYN-ACK ACK → Connection established FAILURE POINTS SYN sent but no SYN-ACK → server unreachable / firewall drops SYN queue full → server can't accept new connections (overload/SYN flood) Connection reset mid-session → RST packet, timeout, or crash TIME_WAIT accumulation → ephemeral port exhaustion under high load
⚠️ SYN queue full

Server can't accept new connections. Cause: traffic spike or SYN flood attack. Connections fail before the app is even involved.

IC: Is this load or attack? Check connection rate vs normal baseline.
🔄 Retransmissions & congestion

Lost packets trigger retransmit. Under load, retransmits add more traffic → more loss → feedback loop. Progressive slowdown that worsens without intervention.

Cars re-entering a traffic jam
🔧 IC questions
  • Failing before or after connection established?
  • SYN queue depth — is it filling?
  • Retransmit rate increasing?
  • Is packet loss present on the link?
  • Traffic spike or sustained high load?
  • Is this an attack (SYN flood)?
4 · Quick Reference — Symptom → Likely Cause
What you seeLikely cause
"Works for me" but not othersTTL — stale cache on some resolvers
Traffic routing to wrong serverWrong DNS record (A/CNAME pointing old IP)
Connections failing before any dataTCP handshake failing — firewall / SYN queue
Progressive slowdown under loadTCP congestion / retransmission loop
Silent drops, choppy audio/videoUDP packet loss — no retransmit
Service recovers after DNS TTL expiresStale DNS cache — needed to propagate
🚦 Networking IC triage
  1. Layer first — DNS (name resolution) or TCP (connection) or app?
  2. Who sees it? — all users or subset? Points to DNS propagation
  3. What changed? — DNS record, IP, certificate, firewall rule?
  4. Failing before or after handshake? — pre-handshake = network; post = app
  5. TCP or UDP? — determines whether retransmit or silent drop
5 · RabbitMQ
📬 Exchange → Queue → Consumer

Producers publish to an exchange with a routing key. The exchange routes to queues based on its type. Consumers pull from queues and ACK each message to remove it.

DirectExact routing key match
FanoutBroadcast to all bound queues
TopicWildcard pattern match on key
IC key: Unlike Kafka, messages are deleted on ACK. Silent routing bugs send messages to the wrong queue — they don't error, they just disappear.
Postal sorting office
☠️ Dead-Letter Queue (DLQ)

Failed, rejected, or TTL-expired messages are routed to the DLQ. A growing DLQ means the consumer is failing to process messages — without fixing the root cause, clearing the DLQ just refills it.

DLQ growing fastConsumer bug or schema mismatch
Recent deploy + DLQ spikeCode change broke the consumer
DLQ growing, consumer OKTTL too low or routing error
IC: Read a sample DLQ message, check consumer error logs, fix root cause before replaying.
🔄 Connection Storm

Consumers rapidly disconnect and reconnect, overwhelming the broker with state management. Broker CPU spikes with low message throughput — it's handling churn, not messages.

Cause: Consumer crash loop · prefetch too high → timeout → disconnect · network instability
⚙️ Prefetch setting

Controls how many unACKed messages a consumer holds at once. Too high → slow ACK → broker disconnects the consumer. Too low → consumer starved, slow throughput.

IC: Prefetch misconfiguration is a common hidden cause of connection churn and slow queues.
🚦 RabbitMQ IC triage
  • Queue depth growing? → consumer keeping up?
  • DLQ filling? → consumer errors, check logs
  • Broker CPU high, low throughput? → connection churn
  • Messages missing? → routing / exchange config
  • Recent deploy? → schema or code change
6 · OSI Model — Layer Quick Reference
LAYER HARDWARE PROTOCOLS IC SYMPTOM → WHO TO CALL 7 ApplicationWAF · L7 LB · API Gateway · CDNHTTP/S · DNS · SMTPHTTP errors · specific 403s → App / Security team 6 PresentationSSL accelerator · HSMTLS/SSL · JPEGTLS handshake failure · cert errors → Platform / Security 5 SessionMostly software (OS / app)NetBIOS · RPC · SQLMid-session drops · session timeouts → App team 4 TransportL4 Firewall · L4 Load BalancerTCP · UDP · portsConnection timeouts · port blocked → Network team 3 NetworkRouter · L3 SwitchIP · ICMP · BGP · OSPFRouting failure · wrong subnet → Network team 2 Data LinkSwitch · Bridge · NICEthernet · Wi-Fi · VLANVLAN issue · switch loop → DC / Infra team 1 PhysicalCable · Hub · NIC · Wi-Fi APElectrical / opticalNo link light · cable unplugged → DC / Infra team
🏢 7-Floor Building Analogy

Each floor handles a different job. A fire on floor 3 (Network) doesn't mean the top floors (App) are broken — but they can't work if floors below are burning.

IC question: "Which floor is failing?" — determines who to call before you start escalating.
🔑 Hub vs Switch

Hub (L1): Repeats signal to all ports — no address awareness. Everything on the segment goes down together.

Switch (L2): Reads MAC addresses, forwards only to correct port. One port failure isolates one host.

IC: "Is it all hosts on the segment or just one?" separates L1 from L2.
🗺️ IC Layer Triage
  • Pre-connection failure → L1–L4 (network/infra)
  • Post-connection failure → L5–L7 (app/security)
  • All hosts in range → L3 routing or L4 firewall
  • Specific requests 403'd → L7 WAF
  • TLS errors → L6 cert issue
7 · WAF vs Firewall
🛡 Firewall — L3/L4
Filters byIP address · port · protocol
BlocksIP ranges · CIDR rules · ports
Sits atInside VCN — zone-to-zone rules
IC signalTCP timeout — no HTTP response at all
Owned byNetwork team
Border fence — blocks by country of origin
🔍 WAF — L7
Filters byHTTP headers · URL · request body
BlocksSQL injection · XSS · bad payloads
Sits atEdge — CDN / reverse proxy (before LB)
IC signalHTTP 403 — specific requests blocked
Owned bySecurity / App team
Customs inspector — reads parcel contents
MODERN CLOUD TRAFFIC FLOW
InternetUntrusted — all traffic starts here
↓ WAF (CDN / edge)Blocks app attacks early · HTTP 403 on match
↓ Load BalancerDistributes · TLS termination
↓ Firewall / NSGsZone rules by IP/port · TCP drop on block
↓ App TierApp logic — only reached after all layers pass
Why WAF is first: Blocking application attacks at the edge means the load balancer, firewall, and app tier never see them. Early kill = lower resource cost + smaller blast radius.
⚠️ Common IC Mistake

Assuming a 403 is an app permission error. If the app logs show nothing, the request never reached the app — WAF blocked it at the edge. Check WAF logs before escalating to the app team.

📋 Failure Signature
HTTP 403, specific pathsWAF
TCP timeout, no responseFirewall / NSG
HTTP 5xx after connectApp tier
Connection refusedPort blocked / NSG
🚦 Who to call
  • HTTP 403 → Security team (WAF)
  • TCP timeout → Network team (NSG/FW)
  • 5xx after connect → App team
  • Nothing logged anywhere → start at edge (WAF)
8 · Physical Infrastructure
Scope → Suspect Component
  • 1 server unreachable: NIC, patch cable, SFP, or switch port
  • Whole rack down: TOR (top-of-rack) switch or its uplink
  • Multiple racks / zone: Aggregation switch or inter-DC fiber
  • Intermittent drops + CRC errors: Dirty SFP, bad fiber connector
Key IC Questions
  • "Has anyone done cabling work or hardware changes in that rack?"
  • "Is it exactly one rack, or partial?" (scope the switch)
  • "Are there CRC errors on the NIC?" (physical layer signal)
  • "Can you try re-seating the SFP?" (quick physical fix)
9 · Proxy vs Reverse Proxy
Forward Proxy — represents the client
  • Sits in front of the client — traffic goes Client → Proxy → Internet
  • Hides the client's identity from the destination
  • Used for outbound content filtering, corporate traffic control, anonymity
  • IC signal: all users behind a network can't reach external sites → check forward proxy health and config
  • Examples: Squid, corporate web proxy
Reverse Proxy — represents the server
  • Sits in front of the server — traffic goes Internet → Proxy → App
  • Hides backend topology; handles TLS, load balancing, rate limiting
  • 502 = upstream app is down · 504 = upstream too slow · 499 = client gave up
  • IC signal: Nginx 502/504 → the problem is behind Nginx, not Nginx itself
  • Examples: Nginx (Cloud Infra tab), HAProxy, AWS ALB

IDCS Global Authentication Failure Highway entrance closed

Core understanding: IDCS is a centralised cloud identity provider. It acts as the first gate users must pass through before reaching any system. If it becomes unavailable, users cannot authenticate anywhere — even though the underlying apps may still be healthy.

What it is: A shared login authority used across multiple systems.

What it does: Authenticates users and issues access tokens.

Problem in incident: IDCS outage or service disruption.

Effect (what you see):

  • All apps inaccessible after login attempt
  • 401/403 spike across every service simultaneously

Technical effect: No tokens issued — authentication cannot begin.

IC interpretation: Central dependency failure — the authentication hub is down.

Analogy: Highway entrance closed — all routes blocked even though the roads beyond are clear.

Incident signals: Login failures across all apps at once · drop in successful auth metrics.

IC questions: "Are all apps affected?" / "Is IDCS reachable?" / "When did auth success rate drop?"

Pattern recognition: All apps fail login simultaneously → suspect IDCS.

IDCS GLOBAL AUTH FAILURE — hub-and-spoke, all connections broken IDCS ⚠ SERVICE DOWN App A App B App C App D 👤 login fails 👤 login fails All apps: 401 / 403 spike — apps healthy but unreachable

Token Expiry / Validation Issues Expired train ticket during journey

Core understanding: After login, users don't continuously re-authenticate — they use tokens as proof of identity. These tokens have rules like expiration time and validation checks. If those rules are misconfigured or systems disagree on time, valid users can suddenly appear invalid.

What it does: Maintains authenticated sessions across systems.

Problem in incident: Expired or misvalidated tokens.

Effect (what you see):

  • Random mid-session logouts
  • Intermittent 401 errors for users already logged in

Technical effect: Token rejected by applications.

IC interpretation: Misconfiguration or time sync issue — not an outage.

Analogy: Expired train ticket during the journey — you bought it, you're on the train, but the gate says it's invalid.

Incident signals: Token validation errors in logs · session drops without user action.

IC questions: "Are tokens expiring earlier than expected?" / "Is system time consistent across services?"

Pattern recognition: Random auth failures for already-logged-in users → token issue.

TOKEN LIFECYCLE — misconfigured expiry vs expected time Issued Valid window (expected) Misconfigured / clock drift Expected expiry User gets 401 — mid-session logout Fix: check token TTL config → sync system clocks → rollback if recently changed

Federation / SSO Misconfiguration Two border checkpoints refusing each other

Core understanding: Federation allows one identity system to trust another (e.g., corporate login into cloud apps). This relies on precise configuration and certificates. If that trust breaks, users get stuck in login flows or cannot authenticate at all.

What it does: Enables login via external identity providers.

Problem in incident: Broken trust configuration or certificate mismatch.

Effect (what you see):

  • Redirect loops — browser bounces between app and login page
  • Login fails after being redirected to SSO

Technical effect: Authentication handshake fails between identity providers.

IC interpretation: Integration misconfiguration — the two systems no longer agree on trust.

Analogy: Two border checkpoints refusing to accept each other's stamps.

Incident signals: Repeated redirect errors · SSO-specific error codes · only SSO users affected.

IC questions: "Are only SSO users affected (local accounts still work)?" / "Any cert or config changes recently?"

Pattern recognition: Redirect loop → SSO / federation issue.

FEDERATION / SSO — broken trust → redirect loop 👤 User tries login Corp IdP (e.g. AD FS) IDCS ⚠ trust broken App redirect loop Check: certificate validity · SAML/OIDC metadata · recent cert or config changes

LDAP Latency (IDM) Traffic jam at ID checkpoint

Core understanding: LDAP is the directory service that stores user identities in IDM environments. During login, systems query LDAP to verify users. If LDAP is slow, every authentication request slows down — even if nothing is technically broken.

What it does: Provides user data for authentication queries.

Problem in incident: Slow directory responses.

Effect (what you see):

  • Login takes much longer than normal (15–20s instead of 1–2s)
  • Occasional timeouts for some users

Technical effect: Queued or delayed auth requests — high LDAP response times.

IC interpretation: Performance bottleneck — slowness, not failure.

Analogy: Traffic jam at the ID checkpoint — everyone gets through eventually, but very slowly.

Incident signals: High auth latency · complaints about slow login, not login failure.

IC questions: "Is login slow or actually failing?" / "What are LDAP query response times?" / "Any load increase recently?"

Pattern recognition: Login eventually works but is very slow → LDAP latency.

LDAP LATENCY — auth request queue building up Auth requests queuing: R1 R2 R3 R4 R5 R6 slow drain LDAP ⏳ high latency Response time: Normal: ~20ms Under load: 8 000ms+ Check: LDAP query times · index health · connection pool exhaustion · server load

User Provisioning / Sync Issues Different checkpoints, different passenger lists

Core understanding: Users and permissions are synchronised across systems. If this process fails, different systems may have different views of who a user is or what they can access — creating inconsistent, hard-to-diagnose failures.

What it does: Keeps user identities and roles consistent across all systems.

Problem in incident: Sync delays or failures.

Effect (what you see):

  • Some users fail while others succeed
  • Permissions missing or incorrect for affected users

Technical effect: Data inconsistency across systems.

IC interpretation: State mismatch — not an outage, but a divergence between systems.

Analogy: Different checkpoints using different passenger lists.

Incident signals: Only specific users or groups affected · new users, recently changed roles, or recently onboarded teams impacted.

IC questions: "Who exactly is affected?" / "Any recent provisioning changes or new user onboarding?"

Pattern recognition: Partial user failures (not everyone) → sync or provisioning issue.

USER PROVISIONING — sync failure creates state mismatch Identity Source Alice · Bob · Carol (new) System A — synced ✓ Alice ✓ Bob ✓ Carol ✓ System B — stale ✗ Alice ✓ Bob ✓ Carol ✗ sync failed

MFA Failure Second checkpoint blocked

Core understanding: MFA adds a second verification step after password authentication. This step often depends on external systems (SMS providers, authenticator apps). If it fails, users are authenticated on password but cannot complete login.

What it does: Provides additional identity verification beyond password.

Problem in incident: MFA system or provider failure.

Effect (what you see):

  • Users stuck after entering their password
  • MFA prompts that never arrive or fail to validate

Technical effect: Second authentication step cannot complete.

IC interpretation: Partial authentication failure — first step worked, second step blocked.

Analogy: Getting through the first checkpoint but being blocked at the second.

Incident signals: MFA error messages in logs · push notifications or SMS not arriving.

IC questions: "Where exactly does login stop — before or after MFA prompt?" / "Is this an external MFA provider?"

Pattern recognition: Login stalls after password entry → MFA failure.

MFA FAILURE — stuck at second checkpoint 👤 User Step 1 Password ✓ Step 2 MFA ✗ FAILED App External provider (SMS / push) unreachable Check: MFA provider status · SMS gateway · push service · consider temp bypass for recovery

OAuth / OIDC Misconfiguration Wrong key for one door

Core understanding: Applications must be correctly configured to trust IDCS tokens. This includes client IDs, secrets, and redirect URLs. A small mismatch can break authentication for a single app while others work fine.

What it does: Connects individual applications to the identity provider.

Problem in incident: Incorrect client configuration in one app.

Effect (what you see):

  • One specific app fails login
  • All other apps still work fine

Technical effect: Token rejected by the misconfigured application.

IC interpretation: App-specific misconfiguration — scope is narrow, not a platform issue.

Analogy: Wrong key for one door — master key still works on all others.

Incident signals: Single app impacted · OAuth error codes (invalid_client, redirect_uri_mismatch).

IC questions: "Is this only one app or multiple?" / "Any config deployment to this app recently?"

Pattern recognition: One app broken while others work → OAuth / OIDC misconfiguration.

OAUTH / OIDC MISCONFIG — one app broken, others healthy IDCS issuing tokens ✓ App A ✓ correct config App B ✓ correct config App C ✗ wrong client_id token rejected Check: client_id · client_secret · redirect_uri · scopes · recent app deployment

Certificate Expiry Expired passport

Core understanding: Certificates establish trust between systems in authentication flows. They have expiration dates. When they expire, systems stop trusting each other — causing sudden, complete failures with no degraded middle period.

What it does: Secures and validates identity communication between systems.

Problem in incident: Expired certificate.

Effect (what you see):

  • Sudden, complete login failure — was working, now completely broken
  • SSO stops working

Technical effect: Trust validation fails — systems refuse to communicate.

IC interpretation: Preventable config failure — a known expiry date was missed.

Analogy: Expired passport — valid until midnight on the expiry date, then refused everywhere instantly.

Incident signals: Certificate error messages in logs · sudden complete outage with no deployment.

IC questions: "Did any certificate expire recently?" / "Was there a cert change or renewal attempt?"

Pattern recognition: Sudden auth break with no deployment → check certificate expiry first.

CERTIFICATE EXPIRY — sudden trust failure at expiry date Certificate VALID — auth working Issued EXPIRED Trust fails — complete auth outage ✓ Auth working normally ✗ All auth fails instantly Fix: renew cert → deploy → verify trust chain · Prevent: monitor cert expiry dates proactively

Rate Limiting / Throttling Road closed due to too much traffic

Core understanding: Identity systems protect themselves by limiting how many requests they accept per time window. During traffic spikes, legitimate users can be blocked if limits are hit — even when the identity system itself is completely healthy.

What it does: Prevents overload or abuse by capping request rates.

Problem in incident: Too many requests trigger the limit.

Effect (what you see):

  • Login failures during peak usage times
  • 429 (Too Many Requests) responses

Technical effect: Requests rejected or delayed by the rate limiter.

IC interpretation: Capacity or protection issue — the limit may be correct or may need tuning.

Analogy: Road closed due to too much traffic — the road is fine, volume exceeded what's allowed.

Incident signals: Traffic spike correlates exactly with login failure onset · 429 errors in logs.

IC questions: "Is there a traffic spike right now?" / "Are 429 errors visible?" / "What are the configured rate limit thresholds?"

Pattern recognition: Peak usage + login failures + 429 errors → throttling.

RATE LIMITING — traffic spike breaches limit ceiling Limit 429 normal peak — throttled recovery Req/s Check: 429 errors · request rate vs configured limits · genuine spike vs client retry storm

Identity Dependency Failure Checkpoint staff can't access records

Core understanding: Identity systems rely on underlying services like databases, network, and storage. If those fail, identity services degrade or stop working — even if the identity system's own processes are healthy.

What it does: Depends on backend infrastructure to function.

Problem in incident: Database, network, or storage failure beneath IDCS.

Effect (what you see):

  • Slow or failed login
  • Auth errors combined with infrastructure alerts

Technical effect: Backend dependency unavailable — IDCS cannot complete auth lookups.

IC interpretation: Downstream dependency issue — the visible failure is auth, but the root cause is infrastructure.

Analogy: Checkpoint staff can't access the records database — they're present but unable to do their job.

Incident signals: Infra alerts fire alongside auth failures · auth latency spike coincides with DB / network alerts.

IC questions: "Are there DB or network alerts at the same time?" / "Is this auth-only or a wider infrastructure issue?"

Pattern recognition: Auth failures + infra alerts simultaneously → dependency failure.

IDENTITY DEPENDENCY FAILURE — root cause is below IDCS 👤 User auth fails IDCS process up ⚠ degraded query fails Database ⚠ DOWN / slow 🔔 Infra Alert firing Key insight: IDCS may look healthy — root cause is in the infrastructure layer below it

Oracle RAC — Real Application Clusters Multiple highways, one shared tunnel

Core understanding: Oracle RAC is multiple servers running the same database at the same time, all connected to shared storage. It exists to improve availability and handle more load — but coordination between nodes introduces complexity and specific failure points.

What it does: Allows multiple servers to access the same database simultaneously, share workload across nodes, and continue operating if one server fails.

Problem in incident: Things go wrong when nodes stop syncing properly, one node becomes slow or fails, or the shared storage or interconnect network becomes a bottleneck.

Effect (what you see):

  • Intermittent slowness — not a full outage
  • Some requests fast, others very slow or timing out
  • Random errors under load
  • Latency spikes, especially during high traffic

Technical effect: Nodes are competing over shared data access. Delays in synchronisation between nodes. Traffic imbalance (some nodes overloaded). Possible node eviction from the cluster.

IC interpretation: Usually a contention problem (nodes competing), a coordination failure (cluster not in sync), or an infrastructure bottleneck (network or storage). Rarely a simple "server down" — more often partial degradation, not total failure.

Analogy: Multiple highways merging into one shared tunnel. Highways = servers, tunnel = shared database storage, traffic = queries. Too many cars → congestion. Poor coordination at the merge → traffic jams. One highway blocked → the others become overloaded.

Incident signals:

  • "High DB latency" or "cluster node evicted"
  • "Global cache wait" events in Oracle monitoring
  • Connection timeouts under load
  • Uneven CPU across nodes
  • Spike in lock or enqueue waits

IC questions: "Is this affecting all users or intermittent?" / "Are all nodes healthy or is one degraded?" / "Is load evenly distributed?" / "Any recent scaling or config changes?" / "Is storage or the interconnect showing latency?"

ORACLE RAC ARCHITECTURE — MULTIPLE NODES, ONE SHARED STORAGE Node 1 Server + DB Instance ● Active Node 2 Server + DB Instance ● Active Node 3 Server + DB Instance ⚠ Overloaded ← Private Cluster Interconnect (node coordination) → Shared Disk Storage All nodes read & write to the same physical data — ASM / SAN / NFS Single source of truth · bottleneck if saturated If Node 3 is evicted — Nodes 1 & 2 continue serving. Storage stays online. RAC INCIDENT PATTERN — UNEVEN LOAD + BLOCK TRANSFER CONGESTION Load distribution Node A CPU 94% ↑ ▶▶▶ Node B CPU 12% ↓ imbalance Queries pile up on one node while another sits idle → IC: check load balancer / service routing Block transfer (pinging) Node X Node Y block request → ← block transfer Nodes constantly sending data blocks to each other over the interconnect → saturates network · raises latency

Pattern recognition: Partial slowness (not full outage) + uneven CPU across nodes + intermittent timeouts → think RAC imbalance or coordination issue.

0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
20 questions · shuffled each round · score tracked.

1 · The Authentication Chain
User / App login request IDCS Cloud identity provider 1st gate — all apps depend on this Token Issued JWT / OAuth token MFA Check 2nd factor via external provider Session / App user reaches application
🚪 IDCS failure

IDCS is the first gate — all apps depend on it. If IDCS is down, all apps are unreachable even if they're perfectly healthy.

all apps affected simultaneously
IC: Is IDCS reachable? When did auth success rate drop?
Highway entrance closed
🎟️ Token expiry

Users authenticate successfully but get kicked out mid-session. Token has expired or systems disagree on expiry rules. Not an outage — a misconfiguration or time sync issue.

mid-session logoutsvalid users appear invalid
IC: Are tokens expiring early? Is clock sync consistent?
Expired train ticket mid-journey
🔒 MFA failure

User passes password check but can't complete the second factor. Often an external MFA provider issue — not the identity system itself. Partial auth failure.

login stops at MFA screen
IC: Where exactly does login stop? Is the MFA provider external?
Second checkpoint blocked
2 · Federation, SSO & OAuth
🔗 Federation / SSO misconfiguration

SSO relies on exact config and certificate trust between identity systems. Small mismatch = login loops or redirect failures. Only SSO users affected.

redirect loopsSSO users only
IC: Are only SSO users affected? Any cert or config change?
Two border checkpoints refusing each other
🔑 OAuth / OIDC misconfiguration

One app has wrong client ID, secret, or redirect URL. That app's auth breaks while all others work fine. App-specific, not platform-wide.

one app broken, others fine
IC: Is this one app or multiple? Recent config deploy?
Wrong key for one door
📜 Certificate expiry

Certs have hard expiry dates. When they expire, systems instantly stop trusting each other — no degraded period. Complete, sudden failure. Entirely preventable.

SSL handshake failedsudden auth breakage
IC: Did any cert expire? Any renewal attempt recently?
Expired passport
3 · Directory, Sync & Infrastructure
📂 LDAP latency & provisioning issues
LDAP slowEvery auth request slows — not broken, just sluggish. Eventually works.
Provisioning lagNew user exists in one system, not another. Inconsistent access per system.
Sync failureDifferent systems have different user states — specific users/groups only.
IC: Who exactly is affected? Is login slow or failing? Any recent provisioning changes?
Different checkpoints, different passenger lists
🚦 Rate limiting & dependency failures
Rate limit429 errors during traffic spikes. Identity system is healthy — it's protecting itself.
Dependency failureIdentity DB or network fails. Auth service processes are up but can't function. Root cause is infra, not identity.
IC: Are 429 errors visible? Any DB or network alerts at the same time? Is auth-only or wider infra?
4 · Oracle RAC
Node 1 DB instance SGA / PGA Node 2 DB instance SGA / PGA IC Shared Storage Both nodes read/write same data ASM / SAN / NFS RAC Incident Patterns Node imbalance — one node taking all load Interconnect latency — block transfer too slow Storage bottleneck — shared disk contention
🏗️ What RAC means for incidents

Multiple servers run the same DB simultaneously using shared storage. Adds availability but adds coordination complexity.

  • Intermittent failure — one node degraded, not all
  • Load imbalance — sessions not evenly spread across nodes
  • Interconnect slowness — block transfers between nodes cause latency
Multiple highways sharing one tunnel
🔧 IC questions
  • Affecting all users or intermittent?
  • Are all RAC nodes healthy?
  • Is load evenly distributed across nodes?
  • Any recent scaling or config changes?
  • Is storage or the interconnect showing latency?
5 · Quick Reference — Symptom → Likely Cause
What you seeLikely cause
All apps failing auth simultaneouslyIDCS down — central dependency
Only SSO users can't log inFederation / SSO misconfiguration
One specific app brokenOAuth/OIDC config on that app
Login stops at MFA screenMFA provider issue
Users logged out mid-sessionToken expiry / clock sync
Sudden auth breakage (no deploy)Certificate expired
Slow login, eventually worksLDAP latency
Specific users/groups affectedProvisioning / sync failure
429 errors during traffic spikeRate limiting — identity self-protecting
Intermittent DB issues on RACNode imbalance or interconnect lag
🚦 Oracle Stack IC triage
  1. All apps or one? — all = IDCS; one = app OAuth config
  2. All users or subset? — all = platform; subset = provisioning/sync
  3. Where does login stop? — password/MFA/redirect = different layer
  4. What changed? — cert, config, deploy, rotation
  5. Slow or failing? — slow = LDAP; failing = IDCS/cert/config

Framing the Incident (Impact First) Side street vs motorway

Core understanding: Framing means quickly defining what is broken and how bad it is. Without it, teams focus on the wrong things or move too slowly.

What it does: Aligns everyone on what matters most and how urgent the situation is.

Problem in incident: Engineers jump into debugging without confirming impact. Low-priority issues get equal attention as critical ones. No urgency → slow decisions.

Effect (what you see): People asking different questions, no shared sense of severity, delayed mitigation.

What it means (IC interpretation): This is a priority alignment problem. The system isn't just failing — the response is unfocused.

Analogy: An accident happens but no one knows if it's on a side street or a major motorway. If it's the motorway (checkout), you need immediate response and all resources focused.

Incident signals: "Is this actually impacting users?" / "How bad is this?" / "Are we sure this is critical?" / Multiple threads of investigation.

IC questions: "What is the user impact right now?" / "Which functionality is affected?" / "Is this revenue-critical (checkout/login)?" / "How many users are impacted?" / "When did this start?"

Then state clearly: "Checkout is failing → high priority → focus on mitigation."

IMPACT FRAMING — WHICH ROAD IS BLOCKED? Side Street (Low Priority) blocked Low traffic · 1 lane · low urgency Motorway (High Priority) BLOCKED Checkout · all users · urgent now vs

Ownership Assignment Uncontrolled junction

Core understanding: Every critical task needs a clearly named person or team responsible. Without this, work is assumed, duplicated, or not done at all.

What it does: Ensures work happens without delay and everyone knows who is doing what.

Problem in incident: Tasks are suggested but not assigned. People assume "someone else is doing it." Gaps or duplication in work.

Effect (what you see): "I thought that was already happening." Silence after actions are suggested. Same task done twice or not at all.

What it means (IC interpretation): This is a responsibility gap. The system is slow because no one owns execution.

Analogy: Traffic lights exist but no one is assigned to operate them. Cars hesitate, collide, or stop moving entirely.

Incident signals: "Who is doing that?" / "Is that being worked on?" / Long pauses after instructions.

IC questions: "Who owns the app right now?" / "Who is handling DB investigation?" / "Who is managing infra/network?"

Then assign clearly: "App team → initiate rollback now. DBA → investigate queries. Network → prepare to drain nodes."

NO OWNER vs CLEAR OWNER No Owner Assigned App Team DBA Network Infra ? Owner Assigned IC App → rollback DBA → queries Net → traffic vs

Timeline Tracking Sequence before the crash

Core understanding: Timeline tracking means keeping a clear sequence of events during the incident. This helps connect cause and effect quickly.

What it does: Identifies what changed before the failure. Prevents confusion during the incident.

Problem in incident: Events get mixed up. Teams argue about what happened first. Root cause becomes harder to identify.

Effect (what you see): "Wait, did that happen before or after the deploy?" Repeated questions. Confusion about sequence.

Technical effect: Slower diagnosis. Missed correlations (e.g., deploy → failure).

What it means (IC interpretation): This is a visibility problem over time. You can't solve what you can't sequence.

Analogy: Trying to understand a crash without knowing which car entered the junction first or when the collision happened.

Incident signals: Confusion about timing / "When did that happen?" repeated / Misaligned understanding across teams.

IC questions: "When did alerts start?" / "When was the last deploy?" / "When did user impact begin?"

Then state: "09:05 deploy → 09:12 alerts → likely related."

INCIDENT TIMELINE 09:05 Deploy 09:12 Alerts fire 09:15 Users report impact 09:18 IC engaged 7 min gap → likely related

Parallel Work (Avoid Serial Investigation) Multi-lane road

Core understanding: Parallel work means multiple teams investigate different areas at the same time. Serial work (one after another) slows everything down.

What it does: Speeds up diagnosis and mitigation simultaneously.

Problem in incident: Teams wait for each other. Only one path investigated at a time. Bottlenecks form.

Effect (what you see): "Let's wait for DB before doing anything." Idle teams. Slow progress.

What it means (IC interpretation): This is a throughput problem. Not enough work happening simultaneously.

Analogy: Only opening one lane when multiple lanes are available — traffic builds up unnecessarily.

Incident signals: Teams waiting / Sequential updates / Slow momentum.

IC questions: "What can each team investigate right now?" / "Are we blocked or just waiting?" / "Can we run these in parallel?"

Then assign: App → deploy/rollback. DBA → queries. Network → traffic. All simultaneously.

SERIAL vs PARALLEL INVESTIGATION Serial (Slow) App DBA (waiting) Net (waiting) Total time: A + B + C Outage extended unnecessarily Parallel (Fast) App → rollback DBA → queries Net → traffic All done together vs

Decisive Action (Mitigation First) Clear the road before the inquest

Core understanding: Incident command requires making fast, reasonable decisions to reduce impact — even without full information.

What it does: Stops user impact quickly. Buys time for deeper investigation.

Problem in incident: Over-analysis. Fear of making the wrong decision. Delayed action.

Effect (what you see): Endless discussion. No clear plan. Metrics not improving.

What it means (IC interpretation): This is a decision paralysis problem. The system isn't recovering because no action is taken.

Analogy: Seeing a blocked road but debating the causes instead of clearing it first.

Incident signals: "We're still investigating…" with no action taken / No improvement in metrics / Repeated theories.

IC questions: "What is the fastest way to reduce impact?" / "Can we roll back?" / "What is the safest immediate mitigation?"

Then decide: "We are rolling back — execute now."

DECISION PARALYSIS vs DECISIVE ACTION Paralysis BLOCKED ? ? "Why did this happen?" Road stays blocked. Impact grows. Decisive Action "Roll back — execute now" Road clears. Then investigate why. vs

Structured Communication (Who / What / Priority) Clear junction signs

Core understanding: Communication must be clear, direct, and structured so actions happen immediately.

What it does: Removes ambiguity. Speeds up execution.

Problem in incident: Vague instructions. Long explanations. Misunderstandings.

Effect (what you see): "Sorry, what was I doing?" Delayed responses. Confusion.

What it means (IC interpretation): This is a clarity problem. Work slows because instructions are unclear.

Analogy: Giving unclear directions at a busy junction — cars hesitate or go the wrong way.

Incident signals: Repeated clarifications / Tasks misunderstood / Slow execution after instruction.

Structure: Every instruction = Who is doing this + What exactly + Priority (now / next).

Example: "App team → roll back all nodes → priority now." (not "let's look into rollback")

VAGUE vs STRUCTURED COMMUNICATION Vague "Let's look into rollback…" ? ? no action ? Structured WHO WHAT PRIORITY "App team → rollback all nodes → priority NOW" Action starts immediately vs
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.

1 · The IC Mindset — What Makes a Good Incident Commander
Framing Impact first Ownership Named person per task Timeline Sequence of events Parallel Work Multiple teams, same time Decisive Action Mitigate first, investigate after Communication Who + What + Priority
2 · Framing & Ownership
🎯 Framing the Incident (Impact First)

The first thing an IC does is define what is broken and how bad it is. Without framing, teams focus on the wrong things or move too slowly.

User impactWhat exactly can't users do right now?
ScopeAll users or a subset? One service or many?
SeverityIs this revenue-critical (checkout / login)?
Start timeWhen did this begin?
Bad framing: "Something seems wrong with the DB."
Good framing: "Checkout is broken for all users since 14:32 — zero orders completing."
Side street vs motorway — know which road is blocked
👤 Ownership Assignment

Every critical task needs a clearly named person responsible. Without this, work is assumed, duplicated, or falls through the gaps.

App team → owns app investigation DB team → owns DB investigation Infra → owns network / server checks
IC question every few minutes: "Who owns [task]? Are they actively working it?"
Uncontrolled junction — someone must direct traffic
3 · Timeline & Parallel Work
📅 Timeline Tracking

A clear sequence of events connects cause and effect. Without it, the team wastes time re-discovering what happened.

When did alerts start?First signal
Last deploy?Common cause — always check
When did user impact begin?May differ from first alert
What changed just before?Config, data migration, traffic spike
Even rough notes in a shared doc are better than nothing. You'll need the timeline for the post-incident review.
Sequence of events before a crash
🏎️ Parallel Work

Multiple teams investigate different areas simultaneously. Serial investigation (one after another) is the most common time-waster in incidents.

SERIAL (slow) App team then DB team then Infra team PARALLEL (fast) App team DB team Infra team
IC question: "What can each team investigate right now, simultaneously?"
4 · Decisive Action & Communication
⚡ Decisive Action (Mitigation First)

IC must make fast, reasonable decisions to reduce impact — even without full information. Over-analysis during an active incident costs users time.

Can we roll back?Usually fastest mitigation after a deploy
Can we redirect traffic?Bypass broken component immediately
Can we disable a feature?Reduce blast radius, keep rest working
Can we scale up?Buy time if it's capacity-related
Principle: Clear the road first — the inquest (root cause) comes after users are no longer impacted.
📢 Structured Communication

Every IC instruction = Who + What + Priority. Ambiguous instructions don't get actioned immediately.

BAD "Let's maybe look into whether we should consider a rollback?" GOOD "App team → roll back all nodes → priority NOW"
Format: [Team/Person] → [Specific action] → [Priority / time]
Clear junction signs — no ambiguity about which way to go
5 · IC Checklist — What to Do in the First 10 Minutes
✅ First 10 minutes
  1. Frame it — state impact, scope, and severity clearly to the room
  2. Assign owners — App / DB / Infra / Comms — named, not assumed
  3. Check the timeline — when did it start? What changed just before?
  4. Launch parallel investigation — don't wait for one team to finish
  5. First mitigation action — rollback? redirect? disable? Do it fast
  6. Communicate out — status to stakeholders, even if "investigating"
⚠️ Common IC failure modes
  • Vague framing — "something's broken" → nobody knows urgency
  • No named owner — "someone look into the DB" → nobody does
  • Serial investigation — waiting for each team before the next starts
  • Analysis paralysis — waiting for certainty before acting
  • Unclear instructions — "maybe try rolling back?" → treated as optional
  • No comms out — stakeholders escalate, creating noise

Docker, Kubernetes & Terraform — How They Fit Together The Full Picture

Docker packages an application and everything it needs into a container — so it runs the same everywhere.

Kubernetes runs and manages those containers at scale — scheduling, healing, and load-balancing them across machines.

Terraform builds the underlying infrastructure — servers, networks, and storage — using code.

Together they:

  • Define — Terraform provisions the environment
  • Run — Docker packages and isolates the app
  • Manage — Kubernetes keeps it running at scale

The Port Analogy:

  • Terraform → the company that builds the port (designs and provisions the docks, cranes, and warehouses)
  • Kubernetes → the port authority running daily operations (decides which ship takes which container, reschedules when a ship is overloaded, and reroutes when one goes down)
  • Docker → the standardised shipping container (sealed, identical, and portable — contents are the same no matter where it lands)

Inside a Docker container:

  • Application code (e.g. Node.js, Python app)
  • Runtime (Node, Python, Java, etc.)
  • Dependencies (libraries, packages)
  • Config needed to run

IC relevance: When an incident spans multiple layers, knowing which tool owns which layer helps you ask the right question first. Container crashing = Docker layer. Pod scheduling failing = Kubernetes layer. Servers missing = Terraform layer.

CLOUD INFRA — HOW THE THREE TOOLS FIT TOGETHER DOCKER Packages apps into containers Consistent · Isolated · Portable Runs same everywhere KUBERNETES Runs containers at scale Schedules · Scales · Heals Pods · Nodes · Services TERRAFORM Builds infrastructure as code Define · Apply · Manage Servers · Networks · Storage Together: Define, run, and manage cloud systems

Docker Container packaging

What it does: Packages apps into containers. Ensures consistency across environments. Runs isolated processes on a host machine.

Problem in incident: Container crashes or restarts, resource limits hit (CPU/memory), misconfigured image or environment variables.

Symptoms:

  • App randomly restarting
  • Slow or failing requests
  • "Service unavailable" errors

Technical effect:

  • Container process dies or is killed by the OS
  • Resource starvation — CPU throttled or memory limit hit
  • Image or config mismatch between environments

What it means (IC interpretation): Usually resource exhaustion, a bad deploy or config issue, or the isolation hiding the root cause from standard monitoring.

Analogy: A standardised shipping container at a port. Every container is sealed with the app code, runtime, dependencies, and config inside — identical no matter which ship (host machine) carries it. If the contents are wrong, it fails the same way everywhere.

Incident signals: "Container restarted" · "OOMKilled" · High CPU / memory · CrashLoopBackOff

IC questions: Are containers restarting? Is resource usage high? Was there a recent deploy? Is this one container or all of them?

DOCKER — CONTAINERS ON A HOST MACHINE HOST MACHINE CONTAINER A App Server ● Running CPU 45% MEM 40% Normal CONTAINER B App Server ✗ OOMKilled CPU 12% MEM 100% Memory limit exceeded — killed CONTAINER C Worker Service CrashLoopBackOff Restarts: 14 Crash → restart → crash → Bad config or missing dep

Kubernetes (K8s) Container orchestration

What it does: Runs containers at scale across multiple machines. Balances load, restarts failed workloads, and manages traffic routing between services.

Problem in incident: Pods not starting, traffic not reaching services, scaling or scheduling failures.

Symptoms:

  • Intermittent outages — some requests succeed, others fail
  • Services unreachable
  • High latency across the cluster

Technical effect:

  • Pods failing or stuck in Pending/CrashLoop state
  • Networking or service routing issues
  • Cluster imbalance — one node overloaded, others idle

What it means (IC interpretation): Usually a coordination failure, resource contention between pods, or a networking issue at the service mesh layer.

Analogy: The port authority running daily operations. Kubernetes decides which ship (node) takes which container (pod), manages the schedule, reroutes when a ship is overloaded, and replaces containers that fall into the sea (crash).

Incident signals: "Pod CrashLoopBackOff" · "Pending pods" · "Service unavailable" · Uneven latency

IC questions: Are pods running or pending? Is traffic reaching services? Any node overloaded? Any recent deploy?

KUBERNETES — ORCHESTRATION ACROSS NODES KUBERNETES ORCHESTRATOR Schedules · Scales · Heals · Routes traffic NODE 1 POD A api-server ● Running CPU 32% POD B api-server ● Running CPU 29% POD C api-server CrashLoop Restart #8 CPU 91% NODE 2 POD D worker ● Running CPU 44% POD E worker Pending No node available

Terraform Infrastructure as Code

What it does: Defines infrastructure using code (.tf files) and ensures the real system matches that definition. Creates and manages servers, networks, and storage automatically.

Problem in incident: Wrong infrastructure deployed, accidental deletion or change, drift between the expected and real state.

Symptoms:

  • Sudden outages immediately after a deployment pipeline runs
  • Missing resources — servers or services that should exist don't
  • Wrong environment behaviour despite identical app code

Technical effect:

  • Infrastructure changed or destroyed by a bad apply
  • State mismatch — Terraform's state file diverges from reality
  • Resources recreated with different config (different size, region, network)

What it means (IC interpretation): Usually a misconfiguration, a bad change rollout, or an automation error where Terraform enforced an incorrect "desired state".

Analogy: The company that builds the port itself — the docks, cranes, and warehouses. Terraform defines and provisions the physical infrastructure before any containers arrive. If the blueprint is wrong, the port doesn't exist or is misbuilt, and the port authority (Kubernetes) has nothing to work with.

Incident signals: "Resource deleted" · "Apply completed" · Sudden infra change · Missing instances

IC questions: Was Terraform run recently? What changed in the config? Was this intentional? Can we rollback or restore state?

TERRAFORM — CONFIG TO INFRASTRUCTURE main.tf resource "aws_instance" type = "t3.medium" count = 3 resource "aws_vpc" cidr = "10.0.0.0/16" ... terraform apply Enforces desired state SERVERS (x3) t3.medium · running NETWORK (VPC) 10.0.0.0/16 · attached STORAGE (S3) bucket provisioned

Nginx Reverse proxy / Web server

What it is: Nginx is a high-performance web server and reverse proxy. In most production setups it sits in front of your application, handling incoming HTTP/HTTPS requests and forwarding them to the app server (e.g. Gunicorn).

Key roles:

  • Reverse proxy — receives client requests and forwards them to the correct backend
  • TLS termination — handles HTTPS so the app server only sees plain HTTP internally
  • Static file serving — serves CSS, JS, images directly without touching the app
  • Load balancing — distributes requests across multiple app instances
  • Rate limiting / access control — rejects abusive clients before they reach the app

Analogy: The hotel front desk. Every guest walks in, the front desk decides where to route them — regular check-in, concierge, restaurant — without each department needing to handle its own door.

Common incident signals:

  • 502 Bad Gateway — Nginx can't reach the upstream app server (app is down or restarting)
  • 504 Gateway Timeout — app server is responding too slowly; Nginx gave up
  • Connection refused — nothing is listening on the upstream socket/port
  • High 499 rate — clients are closing connections before Nginx responds (slow backend)

IC questions: Is Nginx running? What do the Nginx error logs say? Is the upstream app server reachable on its port? Did a recent config change get reloaded?

NGINX — REVERSE PROXY REQUEST FLOW Client Browser / API HTTPS Nginx TLS · routing rate limit · static HTTP App Server Gunicorn / Node uWSGI / etc. DB / API upstream If Nginx fails 502 = app server down 504 = app too slow refused = port closed

Gunicorn Python WSGI app server

What it is: Gunicorn (Green Unicorn) is a Python WSGI HTTP server. It runs Python web applications (Django, Flask) by spawning multiple worker processes to handle concurrent requests. It typically sits behind Nginx in production.

What is WSGI? WSGI (Web Server Gateway Interface) is the standard protocol that defines how Python web frameworks communicate with a server. Think of it as the shape of the power socket: Flask and Django are appliances that plug into the WSGI socket; Gunicorn is the socket provider. Because they both speak WSGI, you can swap one framework for another without changing the server, or swap Gunicorn for uWSGI without changing your app. Without WSGI, every framework would need its own server.

Key concepts:

  • Worker processes — each worker handles one request at a time; more workers = more concurrency
  • Worker types — sync (default), async (gevent/eventlet), or thread-based — chosen based on workload
  • Master process — manages workers, restarts crashed ones, handles signals (reload, shutdown)
  • Binding — listens on a TCP port (e.g. 8000) or Unix socket; Nginx connects to this
  • Timeout — workers that don't respond within the timeout (default 30s) are killed and restarted

Analogy: The kitchen behind the hotel front desk. Nginx (front desk) routes the request; Gunicorn (kitchen) processes it using multiple chefs (workers). If the kitchen is too slow or understaffed, orders back up and the front desk starts returning "sorry, we're busy" errors.

Common incident signals:

  • [CRITICAL] WORKER TIMEOUT — a worker didn't finish its request in time; was killed and restarted
  • 502 seen by clients — all workers are busy; Nginx gets no response
  • High process memory — worker leak; workers grow until they're killed by OOM or max_requests
  • Gunicorn not responding after deploy — new code failing to import; workers crash on start

IC questions: How many workers are configured vs request rate? Are workers timing out (slow DB call? external API?)? Is Gunicorn actually running? Did a recent code deploy cause worker crashes?

GUNICORN — MASTER + WORKER MODEL Master Process spawns · monitors · restarts workers Worker 1 handling req Worker 2 handling req Worker 3 idle Worker 4 handling req Worker 5 TIMEOUT → killed All workers busy → Nginx gets 502. One worker timed out → master restarts it. Rule of thumb: workers = (2 × CPU cores) + 1. Increase if requests are I/O-bound.

Node.js JavaScript runtime

What it is: Node.js is a JavaScript runtime built on Chrome's V8 engine. It runs server-side JavaScript using a single-threaded, non-blocking event loop — meaning it can handle many concurrent connections without spawning a thread per request. Commonly used for APIs, real-time apps, and microservices.

Key concepts:

  • Event loop — a single loop processes callbacks; I/O operations are handed off asynchronously so the loop stays free for other work
  • Non-blocking I/O — DB queries, file reads, and network calls don't block the loop; they return via callbacks, Promises, or async/await
  • Single thread — CPU-intensive work blocks the event loop for everyone; offload to worker threads or a separate service
  • npm — the package ecosystem; a missing or mismatched package version can cause startup failure
  • Cluster mode / PM2 — spawns one process per CPU core to use multiple cores; PM2 also handles restarts and logging

Analogy: A single barista handling many orders at once — they pass each order to the coffee machine (async I/O) and move on. They can juggle 50 orders. But if one order requires them to stand and stir manually for 10 minutes (CPU block), every other customer waits.

Common incident signals:

  • Event loop lag / high latency — CPU-intensive code blocking the loop; all requests slow down simultaneously
  • Process exits with uncaught exception — unhandled Promise rejection or thrown error; app crashes until PM2/systemd restarts it
  • Memory growth / OOM kill — listener leak or unbounded cache; process grows until killed
  • EADDRINUSE on startup — port already in use; previous process didn't exit cleanly

IC questions: Is the event loop blocked (all requests slow at once)? Did a deploy introduce CPU-heavy code? Is the process actually running? Is memory growing per restart? Are there unhandled Promise rejections in logs?

NODE.JS — EVENT LOOP MODEL Event Loop single thread · processes callbacks Async I/O DB / file / network CPU-heavy work BLOCKS the loop Incoming requests queued in libuv Async I/O frees the loop. CPU block freezes it — every request waits until it finishes.

Flask Python microframework

What it is: Flask is a lightweight Python web framework. It provides routing, request handling, and templating but has no built-in ORM, admin panel, or authentication — you add only what you need. Flask apps are WSGI applications, typically served by Gunicorn in production behind Nginx.

Key concepts:

  • WSGI — Web Server Gateway Interface; the standard for Python web apps to communicate with a server like Gunicorn
  • Routes — URL patterns mapped to Python functions using @app.route('/path')
  • Application factory — a pattern where the Flask app is created inside a function, making config and testing cleaner
  • Blueprints — modular groupings of routes; large Flask apps split into blueprints for each feature area
  • Context — Flask uses a request context (per-request data) and app context (app-level data like DB connections)

Analogy: A pop-up food stall versus a full restaurant (Django). Flask gives you a table, a gas burner, and a knife — you bring the rest. Fast to set up, easy to keep simple, but you wire up every component yourself.

Common incident signals:

  • 500 Internal Server Error — unhandled exception in a route; check Gunicorn/app logs for the traceback
  • App fails to start after deploy — import error, missing env var, or broken dependency in requirements.txt
  • Slow responses on specific routes — synchronous DB call, missing index, or external API call blocking a Gunicorn worker
  • Working directory / config not found — Flask looks for files relative to the app root; a path mismatch breaks startup

IC questions: Is the app actually running (Gunicorn workers up)? Which route is failing — is it all routes or one? Did a deploy change requirements.txt or env vars? Is there a slow DB call on the failing route?

FLASK — REQUEST PATH Client Nginx Gunicorn worker Flask App route fn → response 500 if exception DB / API 500 = unhandled exception in Flask route. Startup failure = bad import or missing env var.

Django Python batteries-included framework

What it is: Django is a full-featured Python web framework. Unlike Flask, it includes an ORM, admin panel, authentication, form handling, and migrations out of the box. Also a WSGI app — served by Gunicorn behind Nginx in production. Its philosophy is "don't repeat yourself" — conventions reduce the amount of code needed.

Key concepts:

  • ORM — Django's built-in Object-Relational Mapper translates Python model classes to SQL; powerful but can generate inefficient queries if used carelessly
  • Migrations — schema changes are tracked as migration files; running manage.py migrate applies them to the database
  • Settings — all configuration lives in settings.py; DEBUG, database credentials, allowed hosts, installed apps
  • Admin panel — auto-generated at /admin; very useful for manual data inspection during incidents
  • WSGI entry point — Gunicorn points at project.wsgi:application; if this import fails, no workers start

Analogy: A fully equipped commercial kitchen (vs Flask's pop-up stall). The oven, the walk-in fridge, the dishwasher — all included. More opinionated about layout, but you get to cooking faster. The trade-off: more moving parts that can break.

Common incident signals:

  • App fails to start after deploy — unapplied migrations, missing settings, or a broken import in models/apps
  • Slow queries / high DB CPU — N+1 query problem (one query per object in a loop); use select_related / prefetch_related
  • DEBUG=True in production — shows full stack traces to users; also disables template and query caching — major performance and security issue
  • 500 on a specific URL — unhandled exception in a view; check Gunicorn logs for the traceback
  • Migration conflicts after merge — two branches added migrations to the same app; need to squash or re-number

IC questions: Were migrations applied after the deploy? Is DEBUG True in production? Which view is causing 500s? Are there N+1 query patterns in the slow endpoint? Is the WSGI entry point importable?

DJANGO — COMPONENT OVERVIEW Django App URL Router urls.py View views.py ORM models.py Database SQL via ORM Admin Panel /admin · auto-generated ⚠ DEBUG=True risk exposes stack traces in prod Migrations manage.py migrate · schema sync

OCI Physical Hierarchy OCI Infrastructure

Oracle Cloud Infrastructure organises resources in a three-level hierarchy: Region → Availability Domain → Fault Domain. Understanding which level a failure is at determines the blast radius and recovery options.

Region

A geographic area (e.g. uk-london-1, us-ashburn-1). Completely isolated from other regions — an outage in one region does not affect others. OCI has 40+ regions globally.

IC relevance: If users in only one country are affected, ask: "Which region do they connect to?" Regional failures are rare and escalated immediately to Oracle.

Availability Domain (AD)

Within a region there are 1–3 ADs. Each AD is a physically separate data centre with its own power, cooling, and networking. Failure in one AD does not cascade to others in the same region.

IC relevance: If some users are affected and others are not within the same region, ask: "Are the affected services deployed in only one AD? Is there cross-AD load balancing?"

Fault Domain (FD)

Each AD contains 3 FDs. A FD groups physical hardware — servers and top-of-rack switches — sharing a power circuit. A hardware failure (power circuit, rack switch) affects only the instances in that FD.

IC relevance: If some VMs within an AD are down but others are fine, ask: "Are all the affected instances in the same FD?" Spreading instances across all 3 FDs gives hardware-level redundancy inside an AD.

The Analogy

Region = the city. AD = a separate building in the city, with its own power supply and entrance — a fire in building A doesn't affect building B. FD = a floor within that building — a tripped circuit on floor 3 doesn't affect floors 1 and 2.

IC First Questions

  • "Which region are the affected resources in?" — rules in/out a regional event
  • "Are affected services in the same AD, or spread across ADs?" — narrows to AD-level failure
  • "Which FD are the affected instances in?" — points to hardware-level fault
  • "Are any other resources in the same FD also affected?" — confirms blast radius
OCI HIERARCHY — REGION → AVAILABILITY DOMAIN → FAULT DOMAIN REGION uk-london-1 Isolated geography · own DNS & routing · 40+ globally Availability Domain 1 Separate DC · own power & cooling FD 1 ✓ healthy FD 2 ✓ healthy FD 3 ⚠ power circuit fault FD1 & FD2 unaffected Availability Domain 2 Completely separate building FD 1 ✓ healthy FD 2 ✓ healthy FD 3 ✓ healthy AD-1 FD-3 failure ≠ AD-2 impact Analogy: Region = city · AD = separate building · FD = floor within the building

Java Garbage Collection Java GC

Java automatically reclaims heap memory that is no longer in use — this is garbage collection. The IC-relevant symptom is the stop-the-world (STW) pause: a brief period where the JVM halts every application thread to run GC. Under load, these pauses appear as periodic latency spikes (typically 200ms–2s) with no CPU, disk, or network cause visible in infrastructure monitoring. The JVM resumes normally after each pause. If heap is consistently near-full, GC runs more frequently and pauses grow longer, eventually causing a java.lang.OutOfMemoryError. Modern collectors (G1GC, ZGC) reduce pause duration, but insufficient heap or a memory leak will overwhelm any collector.

Container Runtimes Beyond Docker

What is a container runtime? The low-level software that actually runs containers — it creates the isolated process, sets up namespaces and cgroups, and manages the container lifecycle. Docker is the most recognised but not the only option.

Why it matters as IC: Knowing which runtime is in use helps you read logs correctly and point to the right team. "docker ps" doesn't work if the environment uses containerd or CRI-O directly.

  • Podman — Near drop-in replacement for Docker. Daemonless (no background service required), supports rootless containers (runs without root), same CLI syntax. Used where Docker daemon is a security concern. Key difference: no daemon means no single point of failure; each container is a direct child process of the user.
  • containerd — Lightweight runtime originally extracted from Docker — Docker uses containerd under the hood. Kubernetes switched from dockershim to containerd directly in K8s v1.24. Minimal API, no CLI for end users. IC signal: in K8s environments post-1.24, container state is in containerd not Docker.
  • CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) so K8s can talk to it directly. Even more minimal than containerd. Common in OpenShift environments. IC signal: if the cluster uses OpenShift, the runtime is almost certainly CRI-O.
  • LXC / LXD — More like lightweight virtual machines than pure application containers. Each LXC container runs a full Linux userspace with init, systemd, and multiple processes — not just one application. Used for OS-level isolation rather than microservice packaging. Key difference: LXC feels like a VM; Docker feels like a process.
  • rkt (CoreOS Rocket) — Security-focused runtime. Now deprecated — CoreOS was acquired by Red Hat and rkt development stopped in 2019. Mentioned here for historical context; you may see it in older documentation.
  • Kubernetes + pluggable runtimes — K8s itself is not a container runtime; it is an orchestrator. It manages containers via the Container Runtime Interface (CRI), which lets you swap the underlying runtime (containerd, CRI-O, etc.) without changing how K8s works.

Quick decision rule for ICs:

  • Bare VM running a single app → likely Docker or Podman
  • Kubernetes cluster → containerd or CRI-O (not Docker since K8s v1.24)
  • OpenShift cluster → CRI-O
  • OS-level multi-process isolation → LXC/LXD
CONTAINER RUNTIMES — AT A GLANCE Docker Most common Has daemon Great tooling ✓ Best DX Podman Drop-in for Docker No daemon Rootless mode ✓ Secure default containerd Powers Docker+K8s Lightweight API No user CLI ✓ K8s default CRI-O K8s-native only OpenShift default Minimal footprint ✓ OCP/K8s only LXC / LXD OS-level isolation Like a lightweight VM Full userspace ≠ app containers
0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.

1 · What is a Container?
📦 The one-line definition

A container is just a process with its own mini-filesystem and dependencies — an isolated app + everything it needs to run, packaged together.

Core value: portability + consistency. The app behaves the same on any machine, any environment.

🥡 Lunchbox analogy
The food= your app
The ingredients= dependencies
The box= isolation

You can take it anywhere, and it's the same meal every time.

VM vs CONTAINER — WHAT EACH ISOLATES VIRTUAL MACHINES VM 1 App OS + Kernel Heavy — full OS each VM 2 App OS + Kernel Heavy — full OS each Each VM carries its own full OS — slow to start, resource-heavy vs CONTAINERS App A deps only Container App B deps only Container App C deps only Container Shared OS Kernel Containers share the host OS — fast, lightweight, portable
2 · How the Three Tools Fit Together
Terraform Builds the infrastructure Servers · Networks · Storage · DNS Port builder — constructs the docks Incident: resources missing / misconfigured Kubernetes Runs and manages containers at scale Scheduling · Healing · Load balancing Port authority — directs where containers go Incident: pods not running / traffic not routing Docker Packages the application into a container App code · Runtime · Dependencies · Config Shipping container — same everywhere Incident: container crashing / OOMKilled
IC layer rule: Container crashing = Docker layer. Pod scheduling failing = Kubernetes layer. Servers/network missing = Terraform layer. Knowing which layer owns the problem points you to the right team immediately.
3 · What is Nginx?
📡 One-line definition

Nginx is a reverse proxy and web server that sits in front of your app — it receives every incoming HTTP/HTTPS request and decides where to send it.

🏨 Analogy

The hotel front desk. Every guest walks in; the desk decides who handles them — restaurant, concierge, housekeeping. No department needs its own front door.

NGINX — SITS IN FRONT, ROUTES EVERYTHING Client Browser / API HTTPS Nginx TLS · routing · static files rate limit · load balance HTTP App Server Gunicorn / Node.js 502 = app server down 504 = app too slow
4 · What is Gunicorn?
🍳 One-line definition

Gunicorn is a Python WSGI app server — it takes requests from Nginx and runs your Flask or Django app using a pool of worker processes (one request per worker at a time).

WSGI (Web Server Gateway Interface) is the standard protocol that lets Python web frameworks (Flask, Django) communicate with a server like Gunicorn. Think of it as the power socket shape — the framework plugs in, the server provides the socket, and they speak a common language regardless of which framework is used.

👨‍🍳 Analogy

The kitchen behind the hotel front desk. Nginx routes the order; Gunicorn processes it using N chefs (workers). If the kitchen is full or a chef takes too long — new orders back up and the front desk starts returning errors.

GUNICORN — ONE REQUEST PER WORKER Nginx Master spawns workers restarts on crash Worker 1 — busy Worker 2 — idle Worker 3 — TIMEOUT All busy → 502 Timeout → worker killed
5 · What is Node.js?
⚡ One-line definition

Node.js is a JavaScript runtime that handles many concurrent connections using a single-threaded event loop — async I/O keeps it free for other requests, but CPU-heavy code blocks every user at once.

☕ Analogy

A single barista juggling many orders — they hand each order to the machine (async I/O) and move on. But if they have to stand and manually grind beans for 10 minutes (CPU work), every other customer waits.

NODE.JS — SINGLE EVENT LOOP Event Loop single thread · processes callbacks Async I/O (DB, network) loop stays free ✓ CPU-heavy code BLOCKS loop — all users wait ✗
6 · What is Flask?
🏕️ One-line definition

Flask is a lightweight Python WSGI microframework — it gives you URL routing and request handling only. No ORM, no admin panel, no auth built in. You add exactly what you need.

🥘 Analogy

A pop-up food stall. You get a table, a gas burner, and a knife — bring the rest yourself. Fast to set up, easy to keep simple, but you wire every component.

FLASK — MINIMAL: JUST ROUTING + HANDLERS Nginx Gunicorn WSGI server WSGI Flask App routing + view functions Your code + any libs you choose 500 = unhandled exception in route
7 · What is Django?
🏭 One-line definition

Django is a batteries-included Python WSGI framework — ORM, admin panel, auth, and migrations come built in. More moving parts than Flask but faster to build standard features.

🍽️ Analogy

A commercial kitchen fully equipped — everything is there when you arrive. Faster to cook a full meal, but more equipment means more things that can break.

DJANGO — BATTERIES INCLUDED Nginx Gunicorn Django ORM · Admin · Auth · Migrations · Views · Templates Database
8 · How the Python Web Stack Fits Together
PYTHON WEB STACK — END TO END Client Browser HTTPS Nginx TLS · routing · static hotel front desk HTTP Gunicorn WSGI server · worker pool kitchen with N chefs WSGI Flask / Django routes · views · ORM pop-up stall / commercial kitchen Database PostgreSQL · MySQL · Oracle
IC layer rule: 502/504 errors → check Nginx logs first. 500 errors on specific routes → check Gunicorn/app logs for traceback. Slow but not crashing → check for blocked event loop (Node.js) or slow DB query (Flask/Django). App won't start → check imports and env vars.
9 · Oracle Cloud Infrastructure Hierarchy
Region

Geographic area (e.g. uk-london-1). Fully isolated from other regions. A regional failure affects all ADs and FDs within it.

Ask: "Is this one geography or global?"

Availability Domain (AD)

Separate data centre within a region (1–3 per region). Own power and cooling. AD failure does not affect other ADs.

Ask: "Are affected services in the same AD?"

Fault Domain (FD)

Hardware grouping within an AD (3 per AD). Shared power circuit + top-of-rack switch. Failure affects only instances in that FD.

Ask: "Are all downed VMs in the same FD?"

Analogy & IC Lens

Region = city · AD = separate building in the city · FD = floor within the building.
A tripped circuit on one floor doesn't affect other floors or other buildings.

Scope first: Region → AD → FD. The level of the failure determines who you call and what options you have for recovery.

10 · Java GC
Stop-the-World Pause

The JVM briefly halts all threads to reclaim heap memory. Symptom: periodic latency spikes (200ms–2s), no CPU/disk/network cause, clean recovery after each spike.

IC signal: intermittent spikes with no infrastructure alert → ask if it's a Java service → suspect GC.

11 · Container Runtimes
What each runtime is used for
  • Docker — general-purpose app containers, best developer tooling
  • Podman — drop-in Docker replacement, daemonless, rootless mode — preferred where security posture matters
  • containerd — lightweight runtime used by Docker and by Kubernetes since v1.24 (replaced dockershim)
  • CRI-O — Kubernetes-native only, OpenShift default, minimal footprint
  • LXC / LXD — OS-level isolation, more like a lightweight VM than an app container
  • rkt — deprecated (CoreOS acquired by Red Hat, 2019)
IC decision rule
  • Bare VM / single app → Docker or Podman — use docker ps
  • Kubernetes cluster (v1.24+) → containerd — use crictl ps
  • OpenShift cluster → CRI-O — use crictl ps
  • Multi-process OS isolation → LXC/LXD — use lxc list

Key difference from Docker: Podman has no daemon — each container is a direct child process of the user, so there is no central point of failure.

🏗️

OCI Architecture Puzzle

A visual quiz where you identify and place components in a Flask container deployment on Oracle Cloud Infrastructure.

6 questions · click the glowing node · instant feedback · score tracked

Term → Definition

Select the correct one-sentence definition for each term.
50 terms · shuffled each round · score tracked.