IC Study Guide

Query Optimizer GPS choosing the route

What it does: Chooses how queries are executed. Decides indexes, join order, and access paths.

Problem in incident: Picks inefficient execution plan. Ignores indexes or misjudges data.

Effect (what you see): Gradual slowdown, queries pile up, CPU increases.

Technical effect:

Full table scans instead of index lookups
More rows processed than needed
Increased CPU / disk I/O
Connections held longer

What it means: System doing too much work per query. Inefficiency spreading across system. Can lead to saturation or connection exhaustion.

Analogy: GPS sends cars through small roads instead of highways.

Incident signals:

Slow query logs increasing
db file sequential read
Rising latency

Key insight: The optimizer makes its decision automatically based on statistics. If stats are stale or data distribution has shifted, it can pick the wrong plan even when a good index exists — causing a sudden slowdown with no code change.

IC Questions: "Any slow queries?" / "What changed?" / "Are indexes being used?" / "Are statistics up to date?"

When Does an Index Lose Its Effectiveness? Library catalog

Core understanding: An index isn't "broken" — it becomes less useful when the optimizer decides it's no longer efficient. This happens due to fragmentation, poor selectivity, or outdated statistics.

What it does: Helps the database find data quickly.

Problem in incident: Index exists but queries are slow.

Effect (what you see): Slow queries, full table scans.

Technical effect:

Fragmentation from frequent inserts/updates/deletes
Statistics out of date
Optimizer ignores index

What it means: Navigation system exists but is unreliable.

Analogy: Library catalog that's messy or outdated.

Incident signals: Full table scan, high read I/O.

IC Questions: "Has data changed recently?" / "Are indexes still used?"

Slow Queries & Indexing Road choice and quality

What it does: Determines how fast data is accessed.

Problem in incident: Missing indexes or inefficient queries.

Effect (what you see): Gradual slowdown, high CPU.

Technical effect:

Full scans
High CPU / I/O
Increased query duration

What it means: System inefficiency under load. Can cascade into bigger issues.

Analogy: Cars using small roads instead of highways.

Incident signals:

Slow query logs
High CPU
db file sequential read

IC Questions: "Any slow queries?" / "Indexes being used?" / "Recent changes?"

Buffer Pool / Cache Hit Ratio City warehouse vs distant storage depot

What it does: The buffer pool (or buffer cache) holds frequently accessed data pages in memory so the DB can serve reads from RAM instead of disk.

Problem in incident: If the buffer pool is too small or gets evicted under memory pressure, the DB must read from disk more often — causing high read I/O and latency even when queries are efficient.

Effect (what you see): High disk read I/O, slow reads, elevated "physical reads" metric. Looks similar to a missing index but queries may have good plans.

Technical effect:

Low cache hit ratio → frequent physical reads from disk
Memory pressure → pages evicted before they can be reused
Working set larger than available buffer pool

Key distinction from disk I/O bottleneck: Disk I/O bottleneck = disk can't keep up with demand. Buffer pool problem = too many requests hitting disk that could be served from memory.

Analogy: Warehouse runs out of stock — every request requires a trip to a distant depot instead of grabbing from the shelf.

Incident signals: Low cache hit ratio alert, high physical reads, memory utilisation high on DB host.

IC Questions: "What is the cache hit ratio?" / "Has memory pressure increased?" / "Has the working data set grown recently?"

Row Lock One lane blocked

What it does: Locks specific rows during updates.

Problem in incident: Long transactions hold locks.

Effect (what you see): Queries waiting, localised slowdown.

Technical effect:

Other queries blocked on same rows
Increased wait times
Queue formation

What it means: One piece of work is blocking others. Can escalate if widespread.

Analogy: One lane closed due to accident.

Incident signals:

enq: TX - row lock contention
TX enqueue (mode 6)
Queries waiting

Key insight: Write always blocks write. Whether a write blocks a read depends on isolation level — in some databases reads are never blocked (MVCC); in others they wait. Important distinction for diagnosing who is actually stuck.

IC Questions: "What's blocking?" / "Any long transactions?" / "Can we clear it?" / "Is this write-write or write-read contention?"

Deadlocks Two cars blocking each other at a junction

What it does: Two transactions each hold a lock the other needs, causing a circular wait that neither can resolve.

Problem in incident: Transactions freeze waiting on each other — the database must detect and kill one to break the cycle.

Effect (what you see): One transaction is rolled back with a deadlock error. Throughput drops if deadlocks are frequent.

Technical effect:

T1 holds lock on Row A, wants Row B
T2 holds lock on Row B, wants Row A
DB deadlock detector kills one (the "victim") and rolls it back

Key distinction from row lock: Row lock contention is one-directional (one waits). A deadlock is circular (both wait on each other). The DB resolves it automatically but the rolled-back transaction may retry and repeat.

Analogy: Two cars at a narrow junction, each waiting for the other to reverse — neither can move until one backs down.

Incident signals: Deadlock errors in logs, rolled-back transactions, retry storms.

IC Questions: "Are deadlock errors in the logs?" / "Is the same pair of transactions involved?" / "Are retries making it worse?"

Metadata Lock Entire road closed

What it does: Locks entire table structure.

Problem in incident: Schema change blocks all access.

Effect (what you see): Sudden freeze — queries pile up instantly.

Technical effect:

All queries blocked waiting on metadata
No progress despite low CPU

What it means: System is blocked, not overloaded. One operation is halting everything.

Analogy: Entire road shut down.

Incident signals:

Queries stuck "waiting"
Low CPU but high latency

IC Questions: "Any schema changes?" / "What's blocking?" / "Can we stop it?"

Locks & Contention Blocked roads and junctions

What it does: Controls access to shared data.

Problem in incident: Too many locks or long transactions.

Effect (what you see): Queries waiting — system appears stuck.

Technical effect:

Blocking chains
Increased wait times
Throughput drops

What it means: Work is queued behind blockers. System not overloaded — just blocked.

Analogy: Traffic jam behind blocked road.

Incident signals:

Lock wait alerts
Waiting queries

IC Questions: "What's blocking?" / "How long?" / "Can we remove it?"

Long-Running Transactions A lorry blocking a side road for hours

What it does: A transaction that stays open much longer than normal, holding locks and resources throughout.

Problem in incident: Long transactions are a root cause that triggers several other issues — they hold row locks (blocking others), prevent log truncation (causing log growth), and inflate undo/rollback segments.

Effect (what you see): Depends on what the transaction is doing — could appear as row lock contention, log growth, or disk pressure rather than the transaction itself.

Technical effect:

Holds row locks for extended period → blocks other transactions
Prevents transaction log from being truncated → log grows
Holds undo/rollback space → undo segment pressure

Key insight: Often invisible as a direct alert — you see the symptoms (lock waits, log growth) but must look for long-running transactions as the underlying cause.

Analogy: A lorry parked across a side road for hours — blocking everything behind it and preventing road crews from clearing the area.

Incident signals: Long transaction time in monitoring, lock waits, log growth, undo pressure.

IC Questions: "Any transactions open for an unusual length of time?" / "Is this causing lock waits or log growth?" / "Can it be safely rolled back?"

Redo Log / Transaction Log Traffic control recording every car movement

What it does: Records all changes for durability and recovery.

Problem in incident: Heavy write activity overwhelms logging. Logs become a bottleneck.

Effect (what you see): System slows under write load. Even simple operations delayed.

Technical effect:

Increased disk writes
Log flush contention
Transactions slowed waiting for log writes

What it means: Write throughput is limiting performance. System can't commit changes fast enough. Risk of cascading slowdown.

Analogy: Cars must stop at a checkpoint before continuing.

Incident signals:

High write latency
Disk pressure
Slow commits

IC Questions: "Is write volume high?" / "Any long transactions?" / "Is disk under pressure?"

Bottleneck in Transaction Log Single toll booth

Core understanding: All write operations must be recorded in the transaction log first. If the log can't keep up (slow disk or high write volume), everything slows down.

What it does: Ensures durability of writes.

Problem: Log becomes a bottleneck.

Effect (what you see): Slow transactions, connection buildup.

Technical effect:

Log write delays
Commit latency rises

What it means: Central write system is congested.

Analogy: Single toll booth causing traffic backup.

Incident signals: Log write waits, rising active sessions.

IC Questions: "Is disk slow?" / "Too many writes?"

Are Items Removed from Transaction Log? Black box recorder

Core understanding: Completed transactions are not immediately removed. The log keeps them until it is safe to reuse the space — after checkpoints and/or log backups, depending on system.

What it does: Stores transaction history for recovery.

Problem: Log keeps growing.

Effect (what you see): Disk pressure.

Technical effect:

Entries retained until safe for recovery
Space reused later (not deleted immediately)

What it means: Log is controlled reuse, not deletion.

Analogy: Black box recorder that overwrites old data later.

Incident signals: Log growth alerts.

IC Questions: "Are log backups running?" / "Any long transactions?"

Checkpoint vs Log Backup Unloading truck vs clearing warehouse

Core understanding: Checkpoint writes data pages to disk for recovery. Log backup allows the transaction log to reuse space. They solve different problems — using the wrong one won't fix the issue.

What it does:

Checkpoint → flushes data pages to disk
Log backup → frees log space for reuse

Problem: Log growing unexpectedly.

Effect (what you see): Disk issues despite checkpoints running.

Technical effect:

Checkpoint does not truncate the log
Log backup is required to free space

What it means: Wrong tool applied to the problem.

Analogy: Unloading a truck (checkpoint) vs clearing the whole warehouse (log backup).

Incident signals: Log growth despite checkpoints running.

IC Questions: "Are log backups configured?" / "What recovery mode is set?"

Database Connections / Connection Pooling Cars entering the city

What it does: Limits number of active DB connections.

Problem in incident: Too many connections or leaks.

Effect (what you see): Requests waiting or timing out.

Technical effect:

Connection pool exhausted
Requests queued before DB
Threads blocked waiting

What it means: System can't accept more work. Often caused by slow queries or leaks.

Analogy: Cars queued at city entrance.

Incident signals:

"Too many connections"
Timeouts
Low DB utilisation sometimes

IC Questions: "Are we at max connections?" / "Are connections released?" / "What's holding them?"

Connection Pathway + Redo Log Club capacity + slow bar

Core understanding: A client must connect before running queries. Write operations are logged first (redo/transaction log). If the system is slow, connections stay open longer and can hit limits.

What it does: Handles access and write durability.

Problem: Too many connections / slow commits.

Effect (what you see): Connection errors, requests rejected.

Technical effect:

Flow: Client → Connect → Limit check → Query → Execute → Log
Slow log → slow commits → connections pile up → limit hit

What it means: System saturated at entry or commit stage.

Analogy: Club at capacity with slow bar service — people can't get in or get stuck inside.

Incident signals: "Too many connections" error, rising active sessions.

IC Questions: "Are connections being released?" / "Where is the bottleneck?"

Query Timeout vs Connection Timeout Order taking too long vs never getting a table

What it does: Two different timeout types that produce similar-looking errors but have different causes and fixes.

Problem in incident: Teams often conflate them — treating a connection timeout like a slow query problem, or vice versa. Diagnosing the wrong one wastes time.

Technical effect:

Query timeout: Connection was made, query started, but it ran too long — DB or app killed it. Cause: slow query, missing index, lock wait.
Connection timeout: App could not get a connection within the time limit — never reached a query. Cause: pool exhausted, DB overloaded, network issue.

Key distinction:

Query timeout → you got in, but service was too slow
Connection timeout → you never got a table

Analogy: Query timeout = seated at a restaurant but your order never arrives. Connection timeout = no tables available, turned away at the door.

Incident signals: Error message wording — "query exceeded timeout" vs "connection timed out" / "could not acquire connection".

IC Questions: "What does the exact error say?" / "Did the connection succeed?" / "Is the pool full or are queries just slow?"

Temp Index Rebuild Road maintenance during rush hour

What it does: Rebuilds or reorganises indexes.

Problem in incident: Happens during peak load. Competes for resources.

Effect (what you see): Sudden slowdown, increased I/O and CPU.

Technical effect:

Heavy disk usage
Temporary space consumption
Increased contention with live queries

What it means: Background work is stealing capacity from production traffic. Can trigger wider performance issues.

Analogy: Roadworks reducing available lanes.

Incident signals:

Maintenance job running
"tablespace is full" (possible)
Disk spikes

Key insight: Rebuilding creates a new index alongside the old one before swapping — temporarily doubling the storage needed. Disk full alerts during maintenance are often this, not a general storage leak.

IC Questions: "Any maintenance running?" / "Can we pause it?" / "Is disk space OK?" / "Was disk headroom checked before the job started?"

Resource Saturation (CPU / Disk / Memory) City at full capacity

What it does: Provides compute and storage resources.

Problem in incident: System exceeds capacity.

Effect (what you see): Everything slows — no single clear cause.

Technical effect:

CPU maxed → slow processing
Disk maxed → slow reads/writes
Memory pressure → less caching

What it means: System overloaded. Needs load reduction or scaling.

Analogy: Entire city overwhelmed with traffic.

Incident signals:

High CPU / disk
System-wide latency

IC Questions: "Which resource is maxed?" / "Load spike or inefficiency?" / "Can we reduce load?"

Replication Lag Branch office receiving yesterday's updates

What it does: Changes written to the primary database are replicated to read replicas, usually with a small delay.

Problem in incident: Lag grows — reads from replicas return stale data. Users see outdated results or inconsistencies.

Effect (what you see): Data appears to "go backwards" or users see different data depending on which replica they hit. May look like a bug rather than an infrastructure issue.

Technical effect:

Primary processes writes faster than replica can apply them
Replica falls behind — lag measured in seconds or minutes
Reads routed to replica return old data

Common causes: Heavy write load on primary, slow replica disk, long-running queries on replica blocking apply, network issues.

Analogy: Head office sends updates daily — branch office is working from yesterday's data.

Incident signals: Replication lag metric rising, user reports of stale data, replica behind primary by N seconds.

IC Questions: "What is current replica lag?" / "Are reads being routed to replicas?" / "Is write load on primary spiking?" / "Can we route reads to primary temporarily?"

Database Wallet Secure key locker

What it does: A database wallet is a secure store for credentials, certificates, and encryption keys. Applications and databases retrieve passwords and keys from the wallet instead of having them exposed in plain-text config files or code.

Problem in incident: Wallet missing, corrupted, or inaccessible; wrong file permissions; expired certificates; config pointing to the wrong wallet path.

Symptoms:

Apps suddenly can't connect to the database
Authentication failures spike — often immediately after a deploy
Services fail on startup or restart

Technical effect: The system can't retrieve credentials or encryption material, so DB connections fail, TLS/SSL handshakes may fail, and authentication breaks even if the underlying credentials are correct.

What it means (IC interpretation): Likely a misconfiguration or dependency failure — not load-related. Often triggered by deployments, certificate rotation, or permission changes. The credentials themselves may be fine; it's access to them that has broken.

Analogy: A secure key locker for delivery drivers. Drivers (apps) don't carry keys themselves — they go to the locker to pick them up before each delivery. If the locker is locked, broken, or empty, no deliveries happen regardless of whether the drivers are available.

Incident signals: "Authentication failed" · "Cannot load wallet" · "Permission denied" · "SSL handshake failed" · Spike in connection errors immediately after a deploy

IC questions: Did anything change recently (deploy, config, cert rotation)? Is the wallet file path accessible from the service? Are file permissions correct? Has anything expired (certs/keys)? Is this affecting all services or just one?

Incident Chain How it all connects

Undo & Read Consistency (RAC) Old maps for drivers

Core understanding: Oracle lets readers see a consistent past version of data using undo, even while writes are happening. In RAC, this consistency must work across multiple nodes, which adds coordination overhead.

What it does:

Stores before-images of data (undo)
Lets queries read a stable snapshot
Prevents read/write blocking

Problem in incident: Undo too small or overwritten; long queries need old data that no longer exists; RAC adds delay due to cross-node access.

Effect (what you see): "Snapshot too old" query failures; sudden query slowdowns; intermittent errors on long-running reports.

Technical effect: Required undo data no longer available, or slow retrieval across RAC nodes.

What it means: Capacity issue (undo too small) or workload mismatch (long queries vs high churn). In RAC, could also be inter-node latency.

Analogy: Cars (queries) need a map of the road from 5 minutes ago. Old maps (undo) keep getting thrown away. If the map is gone, the driver gets lost — query fails.

Incident signals: "snapshot too old" errors; long-running queries failing; spikes in undo usage; RAC: interconnect latency warnings.

IC Questions: Are queries long-running? Has data change rate increased? Any recent batch jobs? Is this happening across all RAC nodes or one?

Memory Architecture (SGA/PGA, RAC) Kitchens with shared fridges

Core understanding: Oracle uses memory to cache data and speed up queries. In RAC, each node has its own memory but must share data via interconnect — the "pinging" problem.

What it does:

SGA = shared memory (data cache, SQL cache)
PGA = per-session memory
Reduces disk I/O by caching hot data

Problem in incident: Memory pressure (too many queries); cache inefficiency; RAC blocks constantly moving between nodes.

Effect (what you see): High latency; high CPU; slow queries across cluster; sudden performance degradation.

Technical effect: Cache misses lead to more disk reads; RAC block transfer overhead between nodes ("gc" waits).

What it means: Resource contention (memory/CPU) or bad workload distribution across RAC. Often: too many queries, poor query patterns, or hot blocks bouncing between nodes.

Analogy: Each RAC node is a separate kitchen with its own fridge. If a chef needs something from another kitchen, they must run across the street. Too much running = everything slows down.

Incident signals: High CPU; high memory usage; RAC interconnect traffic spikes; "buffer busy waits" / "gc" waits.

IC Questions: Is load evenly distributed across nodes? Any spike in query volume? Are specific queries dominating? Is one node worse than others?

Undo + Memory Interaction (RAC) Bridge congestion + roadworks

Core understanding: Undo and memory work together to serve consistent reads quickly. In RAC, this may involve remote memory access between nodes — heavy writes and long reads colliding causes compounding pressure.

What it does:

Memory serves cached data quickly
Undo reconstructs older versions for consistency
RAC shares both mechanisms across nodes

Problem in incident: Heavy writes + long reads + RAC traffic causes simultaneous contention and latency.

Effect (what you see): Cluster-wide slowdown; queries inconsistent in performance; timeouts; mixed symptoms (CPU + latency + errors).

Technical effect: Undo reconstruction + memory contention happening at the same time; inter-node block transfers compound both.

What it means: System under stress — multiple subsystems interacting badly. Often triggered by batch jobs or reporting running alongside heavy writes.

Analogy: Cars need old maps (undo). Roads are busy (writes). Cities are connected by bridges (RAC). Too many cars crossing bridges + changing roads = gridlock.

Incident signals: Mixed symptoms (CPU + latency + errors); RAC interconnect spikes; query variability; undo errors alongside memory pressure.

IC Questions: What changed? (batch job, release) Is this cluster-wide? Are reads and writes colliding at the same time?

Seeded Reports City-wide traffic map

Core understanding: A seeded report is a pre-built, default report that ships with a system. Designed for common use cases — not tailored to your specific environment or incident needs.

What it does: Provides standard visibility into data (performance, usage, sales) without requiring a custom build.

Problem in incident: Seeded reports often lack the detail, speed, or focus needed during an active incident.

Effect (what you see):

Missing key data you need right now
Reports too slow to load
Data feels generic — "nothing looks wrong"
Teams say "the report looks fine" but users are impacted

Technical effect: Queries are broad and inefficient; not optimised for real-time debugging; may miss critical filters or dimensions (specific customer, query, endpoint).

What it means (IC interpretation): Observability gap. You're relying on generic tooling instead of targeted insight — this slows decision-making and prolongs the incident.

Analogy: A city-wide traffic map. It shows "traffic looks normal overall" — but your incident is a single blocked lane on one street. You need a zoomed-in camera, not a general map.

Incident signals:

"Dashboard shows normal but users report slowness"
"Report takes too long to generate"
"No visibility into specific query / user / service"
Conflicting statements between teams

IC Questions: "Do we have a more granular or real-time view?" / "Can we filter to affected users or endpoints?" / "Is this report cached or delayed?" / "Who can run a targeted query or log search instead?"

Real-world example — Top Customers Report: A classic seeded report you'll find pre-installed in many systems:

SELECT
    customer_id,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10;

This query shows your top 10 customers by spending — a common business report that ships by default. It's useful day-to-day, but during an incident it tells you almost nothing: it doesn't filter by time window, affected region, or error type. You'd need a targeted query scoped to the problem instead.

Where seeded reports appear:

ERP systems (Oracle, SAP) — pre-built operational reports
CRM tools — customer activity and pipeline summaries
Internal dashboards — aggregate health views used by on-call
BI tools (connected to MySQL / Postgres) — standard metric views

0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
27 questions · shuffled each round · score tracked.

1 · How a Query Travels Through the Database

⚡ Where things go wrong at each stage

TCP connect	Network issue, firewall, DB down
Authentication	Wrong creds, wallet inaccessible, cert expired
Session / pool	Pool exhausted → connection timeout
max_connections	Too many open sessions → rejected requests
Optimize	Stale stats → bad plan → full table scan
Execute	Lock wait, missing index, slow query
Redo log	Disk bottleneck → slow commits → sessions pile up
Close / release	Connection leak → pool never freed

🔍 First questions as IC

Connection issue? Check pool exhaustion, "Too many connections" error
Auth issue? Recent deploy? Wallet path / certs / permissions changed?
Slow query? Slow query log on? Indexes being used? EXPLAIN output?
Blocked? Long transaction holding locks? Schema change running?
Write lag? Disk I/O high? Redo log flush contention?
Resource? CPU / Disk / Memory — which one is maxed?

Key principle: Distinguish "blocked" (low CPU, queries waiting) from "overloaded" (high CPU, everything slow). They look similar but have different fixes.

2 · How Memory Works (Buffer Pool & Cache)

✅ Cache hit (good)

Data already in RAM. Served instantly — no disk involved. Cache hit ratio >99% is healthy; below 95% is a warning sign.

Grabbing from shelf

⚠️ Cache miss (costly)

Data not in RAM — must read from disk. 10–100× slower. Looks like slow queries even with good plans.

Cause: Working set larger than buffer pool, or memory pressure evicting pages.

Trip to distant warehouse

🔧 IC checks

Cache hit ratio dropping?
Memory utilisation high on DB host?
Has working data set grown?
Buffer pool size recently reduced?

Distinguish from disk bottleneck: pool problem = too many reads that should have been served from RAM.

3 · How a Slow Query Happens

🗂️ Why an index stops working

Fragmentation — inserts/updates/deletes scatter pages
Stale statistics — optimizer misjudges row count, picks wrong plan
Poor selectivity — column has few unique values (e.g. status Y/N)
Function on column — WHERE YEAR(date)= bypasses index

Library catalog: exists but outdated

📊 Slow query signals

slow query log spiking high CPU on DB EXPLAIN type:ALL db file sequential read

EXPLAIN the query. type:ALL = full scan. type:ref/range = index used.

🔧 IC actions

Check slow query log for offenders
Run EXPLAIN — identify full scans
Are statistics up to date?
Has data volume grown recently?
Any code deploy or query change?
Is an index rebuild running (competing I/O)?

4 · How Locking Works (Row Lock → Deadlock → Metadata Lock)

🔒 Row Lock

Locks specific rows during an update. Others needing the same rows must wait. Write always blocks write. MVCC prevents read blocks in most DBs.

enq: TX - row lock contention

One lane closed due to an accident

🔄 Deadlock

T1 holds Row A, wants Row B. T2 holds Row B, wants Row A. Circular — neither moves. DB kills one (the "victim"). May trigger retry storm.

deadlock errors in logs rolled-back transactions

Two cars blocking each other at a junction

🚫 Metadata Lock

DDL (ALTER TABLE) locks the entire table structure. All queries queue instantly. CPU stays low — blocked, not overloaded.

low CPU, high wait queries in "waiting for MDL"

Entire road shut down

BLOCKING CHAIN — how one transaction freezes the system

T1 ACTIVE — holds lock → blocks → T2 WAITING → blocks → T3 WAITING → blocks → T4+ WAITING — chain grows…

Fix: Kill T1 to unblock the entire chain. System is blocked — not overloaded. Killing the head releases all waiting transactions immediately.

5 · How Writes Are Committed (Redo / Transaction Log)

📋 Redo log key facts

What it records	Every write before commit
Why it exists	Durability — recover after crash
Bottleneck sign	High write latency, slow commits
Cascade effect	Slow log → slow commits → pool fills
Long transactions	Hold log space — prevent truncation → log grows

⚖️ Checkpoint vs Log Backup

Checkpoint	Flushes data pages to disk for crash recovery
Log backup	Frees log space for reuse
Common mistake	Running checkpoint when log grows — won't help
Fix for log growth	Run log backup, kill long transactions

Checkpoint = unload truck · Log backup = clear warehouse

6 · How Connection Pool Exhaustion Happens

⛔ Pool exhausted

All slots taken. New requests queue then time out. DB may not be overloaded — just at its connection limit.

🕳️ Connection leak

Connections opened but never closed. Pool slowly fills. Triggered by app restarts or error paths that skip close().

⏱️ Timeout types

Query timeout	Got in, query too slow, killed
Conn timeout	Never got a slot, rejected

Seated but slow vs turned away at door

🔧 IC questions

At max_connections?
Slow queries holding slots?
Connection leak suspected?
Can app layer restart to release?

7 · How an Incident Cascades

A single root cause often triggers a cascade. Recognising the chain tells you where to intervene.

Stale statistics / missing index→Optimizer picks full table scan

Queries run 400× slower→DB threads held open for much longer

Connection pool fills up→New requests can't get a connection

"Too many connections" error→Application layer throws 500s

Users see full outage→Root cause: one missing/broken index

IC insight: Don't just fix the symptom (restart app / increase max_connections). Trace back to root cause — otherwise it recurs. Common chain: index issue → slow queries → connection exhaustion → 500s.

🔗 Other common cascades

Long transaction → row locks → blocking chain → throughput drops
Disk I/O saturation → redo log slow → commits slow → pool fills
Schema change (MDL) → instant table lock → all queries queue
Index rebuild at peak → doubles disk I/O → slow queries → cascade above

📡 Replication lag

Heavy primary writes outpace replica's apply speed. Reads return stale data — looks like a bug, not infrastructure.

replica behind by N secondsstale data reports

Quick fix: Route reads to primary. Root fix: reduce write load or increase replica resources.

🗄️ Resource saturation

CPU, disk, and memory all hitting limits simultaneously — everything degrades with no single clear cause.

CPU >90%	Query processing starved
Disk I/O >85%	All reads/writes slow
Memory >85%	Buffer pool evicted → more disk reads

Entire city overwhelmed with traffic

8 · Quick Reference — Symptom → Likely Cause

What you see	Likely cause
"Too many connections"	Pool exhausted (slow queries / leak)
Gradual slowdown, high CPU	Full table scan / missing index
Sudden freeze, low CPU	Metadata lock (schema change)
Localised queries waiting	Row lock contention
Deadlock errors in logs	Circular lock dependency
High disk I/O, slow commits	Redo log bottleneck
Log growing despite checkpoints	No log backup / long transaction
Auth failures after deploy	Wallet inaccessible / cert expired
Stale / inconsistent data	Replication lag
Disk spike during maintenance	Index rebuild (temp double storage)

🚦 Universal IC triage order

Identify scope — all users or subset? One service?
Check what changed — deploy, migration, job, config?
Blocked vs overloaded? — low CPU + waits = blocked; high CPU = overloaded
Find the head of the chain — what is T1 / the root blocker?
Kill or pause — remove the blocker; monitor for recovery
Root cause, not symptom — so it doesn't immediately recur

💬 Useful MySQL commands

Active sessions	`SHOW PROCESSLIST`
InnoDB locks	`SHOW ENGINE INNODB STATUS`
Query plan	`EXPLAIN SELECT ...`
Kill session	`KILL [process_id]`
Replication lag	`SHOW REPLICA STATUS`
Slow query log	`SHOW VARIABLES LIKE 'slow%'`

DNS Record Types Contact list with routing rules

Core understanding: DNS isn't just "name → IP." It stores different record types that control where traffic goes and how services are discovered.

What it is: A distributed directory with multiple record types, each serving a different routing purpose.

Key records:

A → domain → IPv4 (most common)
AAAA → domain → IPv6
CNAME → alias (domain points to another domain)
MX → mail routing
TXT → verification / policies (SPF, DKIM)
NS → which DNS servers are authoritative

Problem in incident: Wrong IP in A record · broken CNAME chain · missing or incorrect records

Effect (what you see): Users routed to wrong server · partial outages · some services work, others fail

Technical effect: DNS resolves — but to the wrong destination

What it means: Misconfiguration, not outage — traffic is flowing, but incorrectly

Analogy: Contact list with wrong phone numbers or forwarding rules

Incident signals:

Traffic hitting wrong servers
Sudden shift in traffic patterns
"It works for some domains but not others"

IC questions: "What record changed?" / "Are we resolving to the expected IP?" / "Is there a CNAME chain involved?"

Pattern: Traffic going somewhere wrong → think DNS misconfiguration

TTL & Propagation Old maps still in circulation

Core understanding: DNS changes are not instant — TTL (Time To Live) controls how long old answers stay cached by resolvers across the internet.

What it does: TTL determines how long a resolver caches a DNS answer before it re-queries the authoritative server.

Problem in incident: Old records still cached · some users see new config, others see old

Effect (what you see): "Works for me but not others" · gradual recovery · region-dependent behaviour

Technical effect: Different resolvers return different answers — inconsistent global state

What it means: Not a failure — the change is still propagating. Expected behaviour after a DNS update.

Analogy: Old maps still being used while new maps are being distributed

Incident signals:

Mixed behaviour across regions or users
Gradual improvement over time after a DNS change
"Some users fixed, others still broken"

IC questions: "What is the TTL?" / "When was the change made?" / "Are caches cleared?"

Pattern: Inconsistent behaviour after a DNS change → think TTL propagation delay

TCP vs UDP Registered mail vs postcards

Core understanding: TCP and UDP are two transport protocols — reliable vs fast. Knowing which one your traffic uses changes how you diagnose failures.

TCP (Transmission Control Protocol): Reliable, ordered, connection-based · used by HTTP/S, MySQL · retries automatically · guaranteed delivery

UDP (User Datagram Protocol): Fast, no guarantees, connectionless · used by DNS, streaming, VoIP · sends and forgets — no retry built in

Problem in incident:

TCP: congestion, connection limits, slow under load
UDP: silent drops, hard-to-detect failures, no error trail

Effect (what you see): TCP issues → timeouts, slow apps · UDP issues → intermittent failures, missing responses

What it means: TCP problems = congestion or capacity · UDP problems = loss or instability

Analogy: TCP = registered mail (guaranteed delivery) · UDP = postcards (fast but may get lost)

Incident signals:

TCP: high latency, connection timeouts
UDP: missing responses, intermittent failures, no error logs

IC questions: "Is this TCP or UDP traffic?" / "Do we see retries or silent drops?" / "Is reliability or speed more critical?"

Pattern: Silent failures with no error logs → think UDP packet loss

TCP Handshake & Connection Lifecycle Knocking on a door that won't answer

Core understanding: Before any data flows, TCP must establish a connection via a 3-step handshake. If this fails, no requests can be processed at all.

The handshake: SYN → SYN-ACK → ACK

Problem in incident: Handshake fails or is delayed · SYN queue fills up · server cannot accept new connections

Effect (what you see): Connection timeouts · users can't connect · errors appear before any request is sent

Technical effect: Entry point is saturated — the problem is at the door, not inside the application

What it means: Often load-related or an attack — not an application bug

Analogy: Knocking on a door but no one answers — the house is overwhelmed before anyone can get inside

Incident signals:

SYN backlog warnings
High connection attempt counts
Timeouts before any request data is exchanged

IC questions: "Are connections failing before requests?" / "Is the SYN queue full?" / "Is this a traffic spike or an attack?"

Pattern: Fails before any request is processed → think TCP handshake saturation

Retransmissions & Congestion Traffic jam where cars keep re-entering

Core understanding: When TCP packets are lost, they are automatically retransmitted. Under high load, this creates a congestion feedback loop — more retransmits = more traffic = worse congestion.

What it does: TCP guarantees delivery by resending lost packets — but each resend adds to overall traffic load.

Problem in incident: High retransmission rate · congestion builds · performance degrades progressively under sustained load

Effect (what you see): Slow responses · latency climbing · throughput dropping under load

Technical effect: More traffic → more loss → more retransmits → worse performance (self-reinforcing loop)

What it means: Network degradation spiral — not a full outage, but worsening performance under load

Analogy: Traffic jam where cars keep re-entering — clearing gets harder the more vehicles try to pass

Incident signals:

Retransmission rate climbing
Latency increasing over time
Throughput dropping under load

IC questions: "Are retransmissions increasing?" / "Is packet loss present?" / "Where is the congested link?"

Pattern: Progressive slowdown under load + rising retries → think TCP congestion loop

Kafka Model Multi-lane highway

Core understanding: Kafka is a distributed message bus. Producers write to topics, which are split into partitions for parallelism. Consumer groups read partitions independently — each partition is owned by one consumer in the group at a time.

Key concepts:

Producer — publishes messages to a topic
Topic — a named stream, split into partitions for throughput
Partition — ordered log; one consumer per group handles each partition
Consumer Group — consumers sharing the work; each partition assigned to one member
Offset — the consumer's position in the log; tracks how far behind it is
Broker — server holding partitions; one broker per partition acts as leader

Analogy: Multi-lane highway — messages are cars, partitions are lanes, consumer groups are independent fleets. A blocked lane affects only the consumers using it.

IC relevance: Kafka sits between services. Problems here cause downstream processing to stop silently — no application errors until the queue backs up visibly. Always check lag metrics before assuming the consuming app is healthy.

Consumer Group Lag Falling behind on the highway

What it is: The gap between the latest message written to a partition and where the consumer has read to. Lag = unconsumed messages accumulating.

Signals:

Lag metric rising continuously
Consumers appear healthy but processing is slow
Downstream services receive events late or in bursts
Alerts on consumer_group_lag or records_lag

Common causes: Slow consumer processing logic · insufficient consumer instances · a stuck or crashed consumer holding a partition · sudden producer spike

IC actions:

Check lag metrics per consumer group and per partition — is it one partition or all?
Identify stuck or slow consumers — is one consumer responsible?
Scale out consumers (more instances = more partitions processed in parallel)
Determine trend: lag growing, stable, or recovering?

Pattern: Lag growing + consumers healthy → slow processing logic or stuck consumer. Lag spike + producer spike → transient burst, may self-recover. Lag on one partition only → single consumer issue.

Broker & Partition Failure Lane closure

What it is: Each partition has a leader broker. If that broker fails, partition leadership must be re-elected before producers and consumers can resume on those partitions.

Signals:

Producer errors: LEADER_NOT_AVAILABLE or NOT_LEADER_FOR_PARTITION
Consumers stop receiving messages on affected partitions
Alert on under-replicated partitions (should always be 0 in steady state)
Broker removed from cluster health view

Common causes: Broker disk full · broker OOM or crash · network partition isolating a broker · replication factor too low (no replica to elect)

IC actions:

Check broker health across all nodes in the cluster
Check under-replicated partition count — non-zero means data risk
Allow Kafka to auto-elect a new partition leader (usually seconds)
Investigate root cause on the failed broker before bringing it back

Pattern: Partial message loss or processing gap → broker failure. Under-replicated partitions → replication issue or broker degraded. Full topic unavailability → majority of brokers for that partition lost.

RabbitMQ Model Postal sorting office

Core understanding: RabbitMQ is a message broker using a push model. Producers publish to an exchange, which routes messages to queues based on binding rules. Consumers pull from queues. Unlike Kafka, messages are deleted once acknowledged — no persistent log.

Key concepts:

Producer — publishes messages to an exchange with a routing key
Exchange — routes messages to queues based on type and binding key
Queue — holds messages until a consumer processes and acknowledges them
Consumer — connects to a queue, processes messages, sends ACK to remove them
Dead-Letter Queue (DLQ) — receives messages that fail, expire, or are rejected
Prefetch — how many unacknowledged messages a consumer can hold at once

Exchange types: Direct — exact key match · Fanout — broadcast to all bound queues · Topic — wildcard pattern match · Headers — match on message attributes

Analogy: Postal sorting office — producer drops a parcel (message) with an address label (routing key). The sorting machine (exchange) reads the label and drops it in the right bin (queue). The delivery driver (consumer) collects from the bin and signs for it (ACK). Failed deliveries go to the returns pile (DLQ).

IC relevance: Problems show as queue depth growing, DLQ filling, or consumer connections dropping. The exchange layer is invisible to most monitoring — routing misconfigurations silently send messages to the wrong queue.

Dead-Letter Queue Saturation Returns pile overflowing

What it is: A Dead-Letter Queue (DLQ) receives messages that cannot be processed — due to repeated failures, TTL expiry, or explicit rejection. When the root cause isn't fixed, the DLQ grows without bound.

Signals:

DLQ depth metric climbing continuously
Consumer error rate elevated — NACKs or exceptions in logs
Upstream queue may appear healthy but messages are being lost silently to the DLQ
Memory pressure on the broker if DLQ is unbounded and large

Common causes: Application bug in consumer processing logic · schema mismatch (consumer can't parse message format) · downstream dependency the consumer calls is unavailable · message TTL set too low

IC actions:

Check DLQ depth and rate of growth — is it accelerating?
Read a sample message from the DLQ and inspect its content
Check consumer logs for the error being thrown on each failure
Fix the root cause first — clearing the DLQ without fixing the cause just refills it
Once fixed, replay DLQ messages in a controlled way (don't flood the queue)

Pattern: DLQ growing + consumer errors → processing bug or schema mismatch. DLQ growing + consumer healthy → TTL expiry or routing misconfiguration. DLQ suddenly growing + recent deploy → code change broke the consumer.

Consumer Connection Storm Revolving door jammed open

What it is: A large number of consumers repeatedly disconnect and reconnect in rapid succession, overwhelming the broker with connection state management. The broker spends more time handling connect/disconnect churn than delivering messages.

Signals:

Broker connection count spiking and thrashing (rapid up-down pattern)
High CPU on the broker despite low message throughput
Consumer application logs showing repeated connection errors and retries
Queue processing stalled even though consumers appear to be running

Common causes: Consumer crash loop (pod restarting repeatedly) · incorrect prefetch setting (consumer takes too many messages, times out, gets disconnected) · aggressive health-check misconfiguration forcing disconnections · network instability between consumer hosts and broker

IC actions:

Check broker connection count over time — is there a churn pattern?
Identify which consumer group or host is responsible for the churn
Check for crash loops: kubectl get pods restart counts, or process monitor
Check prefetch setting — a value too high causes slow ack, triggering disconnect
Isolate and restart affected consumer group; monitor stabilisation

Pattern: Connection churn + consumer crash loop → fix the crash cause (bad code, OOM, bad config). Connection churn + consumer healthy → prefetch misconfiguration or network instability. Broker CPU high with low message rate → connection management overhead, not processing load.

OSI Model 7-floor building

Core understanding: The OSI model gives you a shared language to pinpoint where a problem lives. Different layers are owned by different teams — knowing the layer tells you who to call.

Analogy: A 7-floor building. A fire on floor 3 is a different team's problem than a broken window on floor 7. You need to know which floor is burning before you radio anyone.

IC use: "Which layer is failing?" is the first isolation question. Failing before connection (L1–L4) is a network/infra problem. Failing after connection (L5–L7) is an app or security problem. Different layers mean different on-call groups.

Example — browser connects to company login page:

L7: Browser sends HTTPS GET. WAF inspects the request. App processes it.
L6: TLS encrypts/decrypts the payload between browser and server.
L5: Session is established and maintained between client and server.
L4: TCP connection on port 443. Firewall checks source IP and port.
L3: IP routing selects the path to the destination IP across the internet.
L2: Ethernet frames hop between switches. MACs used within each segment.
L1: Electrical or optical signal travels down the cable or Wi-Fi.

Key distinction — Hub vs Switch: A Hub (L1) blindly repeats signals to all ports — it doesn't understand addresses. A Switch (L2) reads MAC addresses and forwards frames only to the correct port. If a switch fails, specific segments lose connectivity. If a hub fails, everything on that segment drops.

IC question: "Does the problem affect all hosts or just hosts in a specific segment?" — L1 vs L2 distinction. "Is routing broken?" — L3. "Is a port blocked?" — L4.

WAF vs Firewall Customs vs border fence

Core understanding: Both are security controls that block traffic — but they operate at entirely different layers, filter different things, and are owned by different teams. Knowing which one is blocking traffic determines who you call.

Key distinction: A Firewall says "I don't care what's in the parcel — I only care where it came from and which door it's heading to." A WAF opens the parcel and reads it — if it contains malicious content, it blocks the specific request, not the sender's entire address.

IC triage:

Whole IP/CIDR unreachable? → Check firewall rules (network team)
Specific HTTP requests returning 403, others fine? → Check WAF rules (security team)
All traffic through a port suddenly blocked? → Firewall rule change (network team)
New deploy causing request failures with no code error? → WAF may be matching a new payload pattern (security team)
Legitimate user traffic blocked after load spike? → WAF rate-limiting rule triggered (security team)

Common IC mistake: Assuming a 403 error is an application permission problem. It may be a WAF block — the app never even received the request. Check WAF logs before escalating to the app team.

Pattern: All requests blocked to an IP range → firewall. Only specific URL paths or payload patterns blocked → WAF. Sudden 403 spike after a deployment → WAF rule matched something in the new request format.

Why WAF comes before the firewall in modern cloud

The OSI comparison might suggest firewall (L4) sits in front of WAF (L7) because lower layers precede higher ones. In practice the order is the opposite — and for good reason.

WAF lives at the edge — it is typically part of the CDN or reverse proxy layer, closest to the internet. Application attacks (SQL injection, XSS, credential stuffing) are blocked there, before traffic ever enters the cloud network.
Early blocking saves compute — stopping a malicious request at the edge means the load balancer, firewall, and app tier never see it. Fewer resources consumed, lower blast radius.
Firewall/NSGs protect internal resources — once traffic passes the WAF and load balancer it enters a VCN (virtual cloud network). Firewalls and security groups here enforce zone-to-zone rules: which tier can talk to which, on which ports. They are not designed to inspect HTTP payloads.
Cloud providers separate edge security from network security — WAF/CDN is one product (e.g. OCI WAF, AWS WAF, Azure Front Door), firewalls/NSGs are another (e.g. OCI Security Lists, AWS Security Groups, Azure NSG). Different teams own each, different change-management processes apply.

What actually happens in modern cloud (OCI / AWS / Azure style):

IC implication of this ordering: When a user reports they can't reach a service, the triage path follows this stack top-down. A block at the WAF produces a 403 and never reaches the load balancer. A firewall/NSG block causes a TCP timeout — no HTTP response at all. An app error produces a 5xx after a full connection is established. The failure signature tells you which layer to investigate first.

Why this matters for escalation: WAF is owned by a different team than NSGs, which is owned by a different team than the app. Calling the wrong team wastes critical incident minutes. Match the symptom to the layer, then call the right team once.

Physical Infrastructure Hardware Fundamentals

Every server, packet, and connection ultimately runs on physical hardware. When a networking problem can't be explained by software, config, or DNS, the answer may be at the physical layer — and physical failures are typically total, sudden, and clean-cut in monitoring.

Physical Server

A computer in a data centre. It has CPU, RAM, storage (disk/SSD), and one or more NICs. Physical problems — hardware failure, power loss, overheating — cause total server failure with no useful application-level error messages.

NIC — Network Interface Card

The hardware component connecting a server to the network. Operates at L1 (Physical) and L2 (Data Link) — handles electrical signals, MAC addresses, and frame transmission. A failed or misconfigured NIC means 100% packet loss for that server. NICs come in 1G, 10G, 25G, and 100G speeds; a speed mismatch with the switch port causes connectivity or performance problems.

Switch (Top-of-Rack / TOR)

Connects multiple servers in the same network segment. Operates at L2 — reads MAC addresses and forwards frames to the correct port. One TOR switch typically serves an entire rack. A switch failure takes down all servers in that rack simultaneously.

Fiber Optic Cable

Carries data as pulses of light. Used within data centres and between DCs. Much faster and longer-range than copper.

Multi-mode: Shorter distances (within a DC, up to ~300m). Wider core, multiple light paths.
Single-mode: Long distances (DC-to-DC, km scale). Narrower core, one light path. Used for backbone links.

A dirty fiber connector or bad end-face causes intermittent packet loss and CRC errors — frustrating to diagnose remotely because the link stays up but degrades unpredictably.

SFP — Small Form-factor Pluggable

A transceiver module plugged into a NIC or switch port to convert electrical signals to light for fiber connections. A failed SFP causes complete link loss on that port — from software, it looks exactly like the cable is unplugged.

IC Relevance — Scoping a Physical Fault

One server unreachable: NIC, its patch cable, the SFP, or the switch port it connects to
Whole rack unreachable: TOR switch failure or its uplink fiber
Multiple racks / a zone: Aggregation switch or inter-DC uplink fiber
Intermittent drops + CRC errors: Dirty fiber connector, failing SFP, or marginal cable — the link is up but unreliable

Key question for the DC team: "Has anyone done any cabling work, port moves, or hardware changes in that rack recently?"

Proxy vs Reverse Proxy Forward vs Reverse

A proxy is a server that sits between two parties in a network connection — either on behalf of the client (forward proxy) or on behalf of the server (reverse proxy). The direction determines what it protects and what it hides.

Forward Proxy — represents the client

A forward proxy sits in front of the client. Client traffic passes through it on the way out to the internet.

What it hides: the client's identity from the destination server
Use cases: corporate content filtering, outbound traffic control, caching for groups of users, anonymity
IC scenario: all users in an office can't reach external sites → suspect forward proxy misconfiguration or outage. Check proxy logs. The app isn't the problem — the outbound path is.
Examples: Squid, corporate web proxy, VPN exit node

Reverse Proxy — represents the server

A reverse proxy sits in front of the server. External traffic reaches the reverse proxy first, which then routes it to the right backend.

What it hides: the backend server's identity and internal topology from the client
Use cases: TLS termination, load balancing across app servers, rate limiting, caching static content, WAF integration
IC scenarios:
- 502 Bad Gateway — reverse proxy can't reach the upstream app (app crashed or connection refused)
- 504 Gateway Timeout — upstream app is alive but not responding fast enough
- 499 — client gave up waiting before the reverse proxy responded
Examples: Nginx (see Cloud Infra tab), HAProxy, Caddy, AWS ALB, Cloudflare

The one-line difference: A forward proxy knows who you are and fetches the internet for you. A reverse proxy knows the internet is calling and routes it to the right server for you.

0 / 5 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
10 questions · shuffled each round · score tracked.

1 · How DNS Works

📋 DNS Record Types

A	Hostname → IPv4 address
AAAA	Hostname → IPv6 address
CNAME	Alias → another hostname (chain)
MX	Mail routing for domain
TXT	Verification, SPF, DKIM records
NS	Which nameserver is authoritative

IC insight: Wrong record type = traffic routes correctly at DNS level but hits the wrong place. DNS can be "working" and still be wrong.

⏱️ TTL & Propagation

TTL (Time To Live) controls how long DNS answers are cached. After a change, old answers persist across the internet until every cache expires.

Low TTL (60s)	Changes propagate fast
High TTL (3600s)	Changes take up to 1 hour to spread
"Works for me"	Your cache has new record; others still have old

IC questions: What is the TTL? When was the change made? Are caches cleared?

Old maps still in circulation

2 · TCP vs UDP

3 · TCP Connection Lifecycle & What Can Go Wrong

⚠️ SYN queue full

Server can't accept new connections. Cause: traffic spike or SYN flood attack. Connections fail before the app is even involved.

IC: Is this load or attack? Check connection rate vs normal baseline.

🔄 Retransmissions & congestion

Lost packets trigger retransmit. Under load, retransmits add more traffic → more loss → feedback loop. Progressive slowdown that worsens without intervention.

Cars re-entering a traffic jam

🔧 IC questions

Failing before or after connection established?
SYN queue depth — is it filling?
Retransmit rate increasing?
Is packet loss present on the link?
Traffic spike or sustained high load?
Is this an attack (SYN flood)?

4 · Quick Reference — Symptom → Likely Cause

What you see	Likely cause
"Works for me" but not others	TTL — stale cache on some resolvers
Traffic routing to wrong server	Wrong DNS record (A/CNAME pointing old IP)
Connections failing before any data	TCP handshake failing — firewall / SYN queue
Progressive slowdown under load	TCP congestion / retransmission loop
Silent drops, choppy audio/video	UDP packet loss — no retransmit
Service recovers after DNS TTL expires	Stale DNS cache — needed to propagate

🚦 Networking IC triage

Layer first — DNS (name resolution) or TCP (connection) or app?
Who sees it? — all users or subset? Points to DNS propagation
What changed? — DNS record, IP, certificate, firewall rule?
Failing before or after handshake? — pre-handshake = network; post = app
TCP or UDP? — determines whether retransmit or silent drop

5 · RabbitMQ

📬 Exchange → Queue → Consumer

Producers publish to an exchange with a routing key. The exchange routes to queues based on its type. Consumers pull from queues and ACK each message to remove it.

Direct	Exact routing key match
Fanout	Broadcast to all bound queues
Topic	Wildcard pattern match on key

IC key: Unlike Kafka, messages are deleted on ACK. Silent routing bugs send messages to the wrong queue — they don't error, they just disappear.

Postal sorting office

☠️ Dead-Letter Queue (DLQ)

Failed, rejected, or TTL-expired messages are routed to the DLQ. A growing DLQ means the consumer is failing to process messages — without fixing the root cause, clearing the DLQ just refills it.

DLQ growing fast	Consumer bug or schema mismatch
Recent deploy + DLQ spike	Code change broke the consumer
DLQ growing, consumer OK	TTL too low or routing error

IC: Read a sample DLQ message, check consumer error logs, fix root cause before replaying.

🔄 Connection Storm

Consumers rapidly disconnect and reconnect, overwhelming the broker with state management. Broker CPU spikes with low message throughput — it's handling churn, not messages.

Cause: Consumer crash loop · prefetch too high → timeout → disconnect · network instability

⚙️ Prefetch setting

Controls how many unACKed messages a consumer holds at once. Too high → slow ACK → broker disconnects the consumer. Too low → consumer starved, slow throughput.

IC: Prefetch misconfiguration is a common hidden cause of connection churn and slow queues.

🚦 RabbitMQ IC triage

Queue depth growing? → consumer keeping up?
DLQ filling? → consumer errors, check logs
Broker CPU high, low throughput? → connection churn
Messages missing? → routing / exchange config
Recent deploy? → schema or code change

6 · OSI Model — Layer Quick Reference

🏢 7-Floor Building Analogy

Each floor handles a different job. A fire on floor 3 (Network) doesn't mean the top floors (App) are broken — but they can't work if floors below are burning.

IC question: "Which floor is failing?" — determines who to call before you start escalating.

🔑 Hub vs Switch

Hub (L1): Repeats signal to all ports — no address awareness. Everything on the segment goes down together.

Switch (L2): Reads MAC addresses, forwards only to correct port. One port failure isolates one host.

IC: "Is it all hosts on the segment or just one?" separates L1 from L2.

🗺️ IC Layer Triage

Pre-connection failure → L1–L4 (network/infra)
Post-connection failure → L5–L7 (app/security)
All hosts in range → L3 routing or L4 firewall
Specific requests 403'd → L7 WAF
TLS errors → L6 cert issue

7 · WAF vs Firewall

🛡 Firewall — L3/L4

Filters by	IP address · port · protocol
Blocks	IP ranges · CIDR rules · ports
Sits at	Inside VCN — zone-to-zone rules
IC signal	TCP timeout — no HTTP response at all
Owned by	Network team

Border fence — blocks by country of origin

🔍 WAF — L7

Filters by	HTTP headers · URL · request body
Blocks	SQL injection · XSS · bad payloads
Sits at	Edge — CDN / reverse proxy (before LB)
IC signal	HTTP 403 — specific requests blocked
Owned by	Security / App team

Customs inspector — reads parcel contents

MODERN CLOUD TRAFFIC FLOW

Internet	Untrusted — all traffic starts here
↓ WAF (CDN / edge)	Blocks app attacks early · HTTP 403 on match
↓ Load Balancer	Distributes · TLS termination
↓ Firewall / NSGs	Zone rules by IP/port · TCP drop on block
↓ App Tier	App logic — only reached after all layers pass

Why WAF is first: Blocking application attacks at the edge means the load balancer, firewall, and app tier never see them. Early kill = lower resource cost + smaller blast radius.

⚠️ Common IC Mistake

Assuming a 403 is an app permission error. If the app logs show nothing, the request never reached the app — WAF blocked it at the edge. Check WAF logs before escalating to the app team.

📋 Failure Signature

HTTP 403, specific paths	WAF
TCP timeout, no response	Firewall / NSG
HTTP 5xx after connect	App tier
Connection refused	Port blocked / NSG

🚦 Who to call

HTTP 403 → Security team (WAF)
TCP timeout → Network team (NSG/FW)
5xx after connect → App team
Nothing logged anywhere → start at edge (WAF)

8 · Physical Infrastructure

Scope → Suspect Component

1 server unreachable: NIC, patch cable, SFP, or switch port
Whole rack down: TOR (top-of-rack) switch or its uplink
Multiple racks / zone: Aggregation switch or inter-DC fiber
Intermittent drops + CRC errors: Dirty SFP, bad fiber connector

Key IC Questions

"Has anyone done cabling work or hardware changes in that rack?"
"Is it exactly one rack, or partial?" (scope the switch)
"Are there CRC errors on the NIC?" (physical layer signal)
"Can you try re-seating the SFP?" (quick physical fix)

9 · Proxy vs Reverse Proxy

Forward Proxy — represents the client

Sits in front of the client — traffic goes Client → Proxy → Internet
Hides the client's identity from the destination
Used for outbound content filtering, corporate traffic control, anonymity
IC signal: all users behind a network can't reach external sites → check forward proxy health and config
Examples: Squid, corporate web proxy

Reverse Proxy — represents the server

Sits in front of the server — traffic goes Internet → Proxy → App
Hides backend topology; handles TLS, load balancing, rate limiting
502 = upstream app is down · 504 = upstream too slow · 499 = client gave up
IC signal: Nginx 502/504 → the problem is behind Nginx, not Nginx itself
Examples: Nginx (Cloud Infra tab), HAProxy, AWS ALB

IDCS Global Authentication Failure Highway entrance closed

Core understanding: IDCS is a centralised cloud identity provider. It acts as the first gate users must pass through before reaching any system. If it becomes unavailable, users cannot authenticate anywhere — even though the underlying apps may still be healthy.

What it is: A shared login authority used across multiple systems.

What it does: Authenticates users and issues access tokens.

Problem in incident: IDCS outage or service disruption.

Effect (what you see):

All apps inaccessible after login attempt
401/403 spike across every service simultaneously

Technical effect: No tokens issued — authentication cannot begin.

IC interpretation: Central dependency failure — the authentication hub is down.

Analogy: Highway entrance closed — all routes blocked even though the roads beyond are clear.

Incident signals: Login failures across all apps at once · drop in successful auth metrics.

IC questions: "Are all apps affected?" / "Is IDCS reachable?" / "When did auth success rate drop?"

Pattern recognition: All apps fail login simultaneously → suspect IDCS.

Token Expiry / Validation Issues Expired train ticket during journey

Core understanding: After login, users don't continuously re-authenticate — they use tokens as proof of identity. These tokens have rules like expiration time and validation checks. If those rules are misconfigured or systems disagree on time, valid users can suddenly appear invalid.

What it does: Maintains authenticated sessions across systems.

Problem in incident: Expired or misvalidated tokens.

Effect (what you see):

Random mid-session logouts
Intermittent 401 errors for users already logged in

Technical effect: Token rejected by applications.

IC interpretation: Misconfiguration or time sync issue — not an outage.

Analogy: Expired train ticket during the journey — you bought it, you're on the train, but the gate says it's invalid.

Incident signals: Token validation errors in logs · session drops without user action.

IC questions: "Are tokens expiring earlier than expected?" / "Is system time consistent across services?"

Pattern recognition: Random auth failures for already-logged-in users → token issue.

Federation / SSO Misconfiguration Two border checkpoints refusing each other

Core understanding: Federation allows one identity system to trust another (e.g., corporate login into cloud apps). This relies on precise configuration and certificates. If that trust breaks, users get stuck in login flows or cannot authenticate at all.

What it does: Enables login via external identity providers.

Problem in incident: Broken trust configuration or certificate mismatch.

Effect (what you see):

Redirect loops — browser bounces between app and login page
Login fails after being redirected to SSO

Technical effect: Authentication handshake fails between identity providers.

IC interpretation: Integration misconfiguration — the two systems no longer agree on trust.

Analogy: Two border checkpoints refusing to accept each other's stamps.

Incident signals: Repeated redirect errors · SSO-specific error codes · only SSO users affected.

IC questions: "Are only SSO users affected (local accounts still work)?" / "Any cert or config changes recently?"

Pattern recognition: Redirect loop → SSO / federation issue.

LDAP Latency (IDM) Traffic jam at ID checkpoint

Core understanding: LDAP is the directory service that stores user identities in IDM environments. During login, systems query LDAP to verify users. If LDAP is slow, every authentication request slows down — even if nothing is technically broken.

What it does: Provides user data for authentication queries.

Problem in incident: Slow directory responses.

Effect (what you see):

Login takes much longer than normal (15–20s instead of 1–2s)
Occasional timeouts for some users

Technical effect: Queued or delayed auth requests — high LDAP response times.

IC interpretation: Performance bottleneck — slowness, not failure.

Analogy: Traffic jam at the ID checkpoint — everyone gets through eventually, but very slowly.

Incident signals: High auth latency · complaints about slow login, not login failure.

IC questions: "Is login slow or actually failing?" / "What are LDAP query response times?" / "Any load increase recently?"

Pattern recognition: Login eventually works but is very slow → LDAP latency.

User Provisioning / Sync Issues Different checkpoints, different passenger lists

Core understanding: Users and permissions are synchronised across systems. If this process fails, different systems may have different views of who a user is or what they can access — creating inconsistent, hard-to-diagnose failures.

What it does: Keeps user identities and roles consistent across all systems.

Problem in incident: Sync delays or failures.

Effect (what you see):

Some users fail while others succeed
Permissions missing or incorrect for affected users

Technical effect: Data inconsistency across systems.

IC interpretation: State mismatch — not an outage, but a divergence between systems.

Analogy: Different checkpoints using different passenger lists.

Incident signals: Only specific users or groups affected · new users, recently changed roles, or recently onboarded teams impacted.

IC questions: "Who exactly is affected?" / "Any recent provisioning changes or new user onboarding?"

Pattern recognition: Partial user failures (not everyone) → sync or provisioning issue.

MFA Failure Second checkpoint blocked

Core understanding: MFA adds a second verification step after password authentication. This step often depends on external systems (SMS providers, authenticator apps). If it fails, users are authenticated on password but cannot complete login.

What it does: Provides additional identity verification beyond password.

Problem in incident: MFA system or provider failure.

Effect (what you see):

Users stuck after entering their password
MFA prompts that never arrive or fail to validate

Technical effect: Second authentication step cannot complete.

IC interpretation: Partial authentication failure — first step worked, second step blocked.

Analogy: Getting through the first checkpoint but being blocked at the second.

Incident signals: MFA error messages in logs · push notifications or SMS not arriving.

IC questions: "Where exactly does login stop — before or after MFA prompt?" / "Is this an external MFA provider?"

Pattern recognition: Login stalls after password entry → MFA failure.

OAuth / OIDC Misconfiguration Wrong key for one door

Core understanding: Applications must be correctly configured to trust IDCS tokens. This includes client IDs, secrets, and redirect URLs. A small mismatch can break authentication for a single app while others work fine.

What it does: Connects individual applications to the identity provider.

Problem in incident: Incorrect client configuration in one app.

Effect (what you see):

One specific app fails login
All other apps still work fine

Technical effect: Token rejected by the misconfigured application.

IC interpretation: App-specific misconfiguration — scope is narrow, not a platform issue.

Analogy: Wrong key for one door — master key still works on all others.

Incident signals: Single app impacted · OAuth error codes (invalid_client, redirect_uri_mismatch).

IC questions: "Is this only one app or multiple?" / "Any config deployment to this app recently?"

Pattern recognition: One app broken while others work → OAuth / OIDC misconfiguration.

Certificate Expiry Expired passport

Core understanding: Certificates establish trust between systems in authentication flows. They have expiration dates. When they expire, systems stop trusting each other — causing sudden, complete failures with no degraded middle period.

What it does: Secures and validates identity communication between systems.

Problem in incident: Expired certificate.

Effect (what you see):

Sudden, complete login failure — was working, now completely broken
SSO stops working

Technical effect: Trust validation fails — systems refuse to communicate.

IC interpretation: Preventable config failure — a known expiry date was missed.

Analogy: Expired passport — valid until midnight on the expiry date, then refused everywhere instantly.

Incident signals: Certificate error messages in logs · sudden complete outage with no deployment.

IC questions: "Did any certificate expire recently?" / "Was there a cert change or renewal attempt?"

Pattern recognition: Sudden auth break with no deployment → check certificate expiry first.

Rate Limiting / Throttling Road closed due to too much traffic

Core understanding: Identity systems protect themselves by limiting how many requests they accept per time window. During traffic spikes, legitimate users can be blocked if limits are hit — even when the identity system itself is completely healthy.

What it does: Prevents overload or abuse by capping request rates.

Problem in incident: Too many requests trigger the limit.

Effect (what you see):

Login failures during peak usage times
429 (Too Many Requests) responses

Technical effect: Requests rejected or delayed by the rate limiter.

IC interpretation: Capacity or protection issue — the limit may be correct or may need tuning.

Analogy: Road closed due to too much traffic — the road is fine, volume exceeded what's allowed.

Incident signals: Traffic spike correlates exactly with login failure onset · 429 errors in logs.

IC questions: "Is there a traffic spike right now?" / "Are 429 errors visible?" / "What are the configured rate limit thresholds?"

Pattern recognition: Peak usage + login failures + 429 errors → throttling.

Identity Dependency Failure Checkpoint staff can't access records

Core understanding: Identity systems rely on underlying services like databases, network, and storage. If those fail, identity services degrade or stop working — even if the identity system's own processes are healthy.

What it does: Depends on backend infrastructure to function.

Problem in incident: Database, network, or storage failure beneath IDCS.

Effect (what you see):

Slow or failed login
Auth errors combined with infrastructure alerts

Technical effect: Backend dependency unavailable — IDCS cannot complete auth lookups.

IC interpretation: Downstream dependency issue — the visible failure is auth, but the root cause is infrastructure.

Analogy: Checkpoint staff can't access the records database — they're present but unable to do their job.

Incident signals: Infra alerts fire alongside auth failures · auth latency spike coincides with DB / network alerts.

IC questions: "Are there DB or network alerts at the same time?" / "Is this auth-only or a wider infrastructure issue?"

Pattern recognition: Auth failures + infra alerts simultaneously → dependency failure.

Oracle RAC — Real Application Clusters Multiple highways, one shared tunnel

Core understanding: Oracle RAC is multiple servers running the same database at the same time, all connected to shared storage. It exists to improve availability and handle more load — but coordination between nodes introduces complexity and specific failure points.

What it does: Allows multiple servers to access the same database simultaneously, share workload across nodes, and continue operating if one server fails.

Problem in incident: Things go wrong when nodes stop syncing properly, one node becomes slow or fails, or the shared storage or interconnect network becomes a bottleneck.

Effect (what you see):

Intermittent slowness — not a full outage
Some requests fast, others very slow or timing out
Random errors under load
Latency spikes, especially during high traffic

Technical effect: Nodes are competing over shared data access. Delays in synchronisation between nodes. Traffic imbalance (some nodes overloaded). Possible node eviction from the cluster.

IC interpretation: Usually a contention problem (nodes competing), a coordination failure (cluster not in sync), or an infrastructure bottleneck (network or storage). Rarely a simple "server down" — more often partial degradation, not total failure.

Analogy: Multiple highways merging into one shared tunnel. Highways = servers, tunnel = shared database storage, traffic = queries. Too many cars → congestion. Poor coordination at the merge → traffic jams. One highway blocked → the others become overloaded.

Incident signals:

"High DB latency" or "cluster node evicted"
"Global cache wait" events in Oracle monitoring
Connection timeouts under load
Uneven CPU across nodes
Spike in lock or enqueue waits

IC questions: "Is this affecting all users or intermittent?" / "Are all nodes healthy or is one degraded?" / "Is load evenly distributed?" / "Any recent scaling or config changes?" / "Is storage or the interconnect showing latency?"

Pattern recognition: Partial slowness (not full outage) + uneven CPU across nodes + intermittent timeouts → think RAC imbalance or coordination issue.

0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
20 questions · shuffled each round · score tracked.

1 · The Authentication Chain

🚪 IDCS failure

IDCS is the first gate — all apps depend on it. If IDCS is down, all apps are unreachable even if they're perfectly healthy.

all apps affected simultaneously

IC: Is IDCS reachable? When did auth success rate drop?

Highway entrance closed

🎟️ Token expiry

Users authenticate successfully but get kicked out mid-session. Token has expired or systems disagree on expiry rules. Not an outage — a misconfiguration or time sync issue.

mid-session logoutsvalid users appear invalid

IC: Are tokens expiring early? Is clock sync consistent?

Expired train ticket mid-journey

🔒 MFA failure

User passes password check but can't complete the second factor. Often an external MFA provider issue — not the identity system itself. Partial auth failure.

IC: Where exactly does login stop? Is the MFA provider external?

Second checkpoint blocked

2 · Federation, SSO & OAuth

🔗 Federation / SSO misconfiguration

SSO relies on exact config and certificate trust between identity systems. Small mismatch = login loops or redirect failures. Only SSO users affected.

redirect loopsSSO users only

IC: Are only SSO users affected? Any cert or config change?

Two border checkpoints refusing each other

🔑 OAuth / OIDC misconfiguration

One app has wrong client ID, secret, or redirect URL. That app's auth breaks while all others work fine. App-specific, not platform-wide.

one app broken, others fine

IC: Is this one app or multiple? Recent config deploy?

Wrong key for one door

📜 Certificate expiry

Certs have hard expiry dates. When they expire, systems instantly stop trusting each other — no degraded period. Complete, sudden failure. Entirely preventable.

SSL handshake failedsudden auth breakage

IC: Did any cert expire? Any renewal attempt recently?

Expired passport

3 · Directory, Sync & Infrastructure

📂 LDAP latency & provisioning issues

LDAP slow	Every auth request slows — not broken, just sluggish. Eventually works.
Provisioning lag	New user exists in one system, not another. Inconsistent access per system.
Sync failure	Different systems have different user states — specific users/groups only.

IC: Who exactly is affected? Is login slow or failing? Any recent provisioning changes?

Different checkpoints, different passenger lists

🚦 Rate limiting & dependency failures

Rate limit	429 errors during traffic spikes. Identity system is healthy — it's protecting itself.
Dependency failure	Identity DB or network fails. Auth service processes are up but can't function. Root cause is infra, not identity.

IC: Are 429 errors visible? Any DB or network alerts at the same time? Is auth-only or wider infra?

4 · Oracle RAC

🏗️ What RAC means for incidents

Multiple servers run the same DB simultaneously using shared storage. Adds availability but adds coordination complexity.

Intermittent failure — one node degraded, not all
Load imbalance — sessions not evenly spread across nodes
Interconnect slowness — block transfers between nodes cause latency

Multiple highways sharing one tunnel

🔧 IC questions

Affecting all users or intermittent?
Are all RAC nodes healthy?
Is load evenly distributed across nodes?
Any recent scaling or config changes?
Is storage or the interconnect showing latency?

5 · Quick Reference — Symptom → Likely Cause

What you see	Likely cause
All apps failing auth simultaneously	IDCS down — central dependency
Only SSO users can't log in	Federation / SSO misconfiguration
One specific app broken	OAuth/OIDC config on that app
Login stops at MFA screen	MFA provider issue
Users logged out mid-session	Token expiry / clock sync
Sudden auth breakage (no deploy)	Certificate expired
Slow login, eventually works	LDAP latency
Specific users/groups affected	Provisioning / sync failure
429 errors during traffic spike	Rate limiting — identity self-protecting
Intermittent DB issues on RAC	Node imbalance or interconnect lag

🚦 Oracle Stack IC triage

All apps or one? — all = IDCS; one = app OAuth config
All users or subset? — all = platform; subset = provisioning/sync
Where does login stop? — password/MFA/redirect = different layer
What changed? — cert, config, deploy, rotation
Slow or failing? — slow = LDAP; failing = IDCS/cert/config

Framing the Incident (Impact First) Side street vs motorway

Core understanding: Framing means quickly defining what is broken and how bad it is. Without it, teams focus on the wrong things or move too slowly.

What it does: Aligns everyone on what matters most and how urgent the situation is.

Problem in incident: Engineers jump into debugging without confirming impact. Low-priority issues get equal attention as critical ones. No urgency → slow decisions.

Effect (what you see): People asking different questions, no shared sense of severity, delayed mitigation.

What it means (IC interpretation): This is a priority alignment problem. The system isn't just failing — the response is unfocused.

Analogy: An accident happens but no one knows if it's on a side street or a major motorway. If it's the motorway (checkout), you need immediate response and all resources focused.

Incident signals: "Is this actually impacting users?" / "How bad is this?" / "Are we sure this is critical?" / Multiple threads of investigation.

IC questions: "What is the user impact right now?" / "Which functionality is affected?" / "Is this revenue-critical (checkout/login)?" / "How many users are impacted?" / "When did this start?"

Then state clearly: "Checkout is failing → high priority → focus on mitigation."

Ownership Assignment Uncontrolled junction

Core understanding: Every critical task needs a clearly named person or team responsible. Without this, work is assumed, duplicated, or not done at all.

What it does: Ensures work happens without delay and everyone knows who is doing what.

Problem in incident: Tasks are suggested but not assigned. People assume "someone else is doing it." Gaps or duplication in work.

Effect (what you see): "I thought that was already happening." Silence after actions are suggested. Same task done twice or not at all.

What it means (IC interpretation): This is a responsibility gap. The system is slow because no one owns execution.

Analogy: Traffic lights exist but no one is assigned to operate them. Cars hesitate, collide, or stop moving entirely.

Incident signals: "Who is doing that?" / "Is that being worked on?" / Long pauses after instructions.

IC questions: "Who owns the app right now?" / "Who is handling DB investigation?" / "Who is managing infra/network?"

Then assign clearly: "App team → initiate rollback now. DBA → investigate queries. Network → prepare to drain nodes."

Timeline Tracking Sequence before the crash

Core understanding: Timeline tracking means keeping a clear sequence of events during the incident. This helps connect cause and effect quickly.

What it does: Identifies what changed before the failure. Prevents confusion during the incident.

Problem in incident: Events get mixed up. Teams argue about what happened first. Root cause becomes harder to identify.

Effect (what you see): "Wait, did that happen before or after the deploy?" Repeated questions. Confusion about sequence.

Technical effect: Slower diagnosis. Missed correlations (e.g., deploy → failure).

What it means (IC interpretation): This is a visibility problem over time. You can't solve what you can't sequence.

Analogy: Trying to understand a crash without knowing which car entered the junction first or when the collision happened.

Incident signals: Confusion about timing / "When did that happen?" repeated / Misaligned understanding across teams.

IC questions: "When did alerts start?" / "When was the last deploy?" / "When did user impact begin?"

Then state: "09:05 deploy → 09:12 alerts → likely related."

Parallel Work (Avoid Serial Investigation) Multi-lane road

Core understanding: Parallel work means multiple teams investigate different areas at the same time. Serial work (one after another) slows everything down.

What it does: Speeds up diagnosis and mitigation simultaneously.

Problem in incident: Teams wait for each other. Only one path investigated at a time. Bottlenecks form.

Effect (what you see): "Let's wait for DB before doing anything." Idle teams. Slow progress.

What it means (IC interpretation): This is a throughput problem. Not enough work happening simultaneously.

Analogy: Only opening one lane when multiple lanes are available — traffic builds up unnecessarily.

Incident signals: Teams waiting / Sequential updates / Slow momentum.

IC questions: "What can each team investigate right now?" / "Are we blocked or just waiting?" / "Can we run these in parallel?"

Then assign: App → deploy/rollback. DBA → queries. Network → traffic. All simultaneously.

Decisive Action (Mitigation First) Clear the road before the inquest

Core understanding: Incident command requires making fast, reasonable decisions to reduce impact — even without full information.

What it does: Stops user impact quickly. Buys time for deeper investigation.

Problem in incident: Over-analysis. Fear of making the wrong decision. Delayed action.

Effect (what you see): Endless discussion. No clear plan. Metrics not improving.

What it means (IC interpretation): This is a decision paralysis problem. The system isn't recovering because no action is taken.

Analogy: Seeing a blocked road but debating the causes instead of clearing it first.

Incident signals: "We're still investigating…" with no action taken / No improvement in metrics / Repeated theories.

IC questions: "What is the fastest way to reduce impact?" / "Can we roll back?" / "What is the safest immediate mitigation?"

Then decide: "We are rolling back — execute now."

Structured Communication (Who / What / Priority) Clear junction signs

Core understanding: Communication must be clear, direct, and structured so actions happen immediately.

What it does: Removes ambiguity. Speeds up execution.

Problem in incident: Vague instructions. Long explanations. Misunderstandings.

Effect (what you see): "Sorry, what was I doing?" Delayed responses. Confusion.

What it means (IC interpretation): This is a clarity problem. Work slows because instructions are unclear.

Analogy: Giving unclear directions at a busy junction — cars hesitate or go the wrong way.

Incident signals: Repeated clarifications / Tasks misunderstood / Slow execution after instruction.

Structure: Every instruction = Who is doing this + What exactly + Priority (now / next).

Example: "App team → roll back all nodes → priority now." (not "let's look into rollback")

0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.

1 · The IC Mindset — What Makes a Good Incident Commander

2 · Framing & Ownership

🎯 Framing the Incident (Impact First)

The first thing an IC does is define what is broken and how bad it is. Without framing, teams focus on the wrong things or move too slowly.

User impact	What exactly can't users do right now?
Scope	All users or a subset? One service or many?
Severity	Is this revenue-critical (checkout / login)?
Start time	When did this begin?

Bad framing: "Something seems wrong with the DB."
Good framing: "Checkout is broken for all users since 14:32 — zero orders completing."

Side street vs motorway — know which road is blocked

👤 Ownership Assignment

Every critical task needs a clearly named person responsible. Without this, work is assumed, duplicated, or falls through the gaps.

App team → owns app investigation DB team → owns DB investigation Infra → owns network / server checks

IC question every few minutes: "Who owns [task]? Are they actively working it?"

Uncontrolled junction — someone must direct traffic

3 · Timeline & Parallel Work

📅 Timeline Tracking

A clear sequence of events connects cause and effect. Without it, the team wastes time re-discovering what happened.

When did alerts start?	First signal
Last deploy?	Common cause — always check
When did user impact begin?	May differ from first alert
What changed just before?	Config, data migration, traffic spike

Even rough notes in a shared doc are better than nothing. You'll need the timeline for the post-incident review.

Sequence of events before a crash

🏎️ Parallel Work

Multiple teams investigate different areas simultaneously. Serial investigation (one after another) is the most common time-waster in incidents.

IC question: "What can each team investigate right now, simultaneously?"

4 · Decisive Action & Communication

⚡ Decisive Action (Mitigation First)

IC must make fast, reasonable decisions to reduce impact — even without full information. Over-analysis during an active incident costs users time.

Can we roll back?	Usually fastest mitigation after a deploy
Can we redirect traffic?	Bypass broken component immediately
Can we disable a feature?	Reduce blast radius, keep rest working
Can we scale up?	Buy time if it's capacity-related

Principle: Clear the road first — the inquest (root cause) comes after users are no longer impacted.

📢 Structured Communication

Every IC instruction = Who + What + Priority. Ambiguous instructions don't get actioned immediately.

Format: [Team/Person] → [Specific action] → [Priority / time]

Clear junction signs — no ambiguity about which way to go

5 · IC Checklist — What to Do in the First 10 Minutes

✅ First 10 minutes

Frame it — state impact, scope, and severity clearly to the room
Assign owners — App / DB / Infra / Comms — named, not assumed
Check the timeline — when did it start? What changed just before?
Launch parallel investigation — don't wait for one team to finish
First mitigation action — rollback? redirect? disable? Do it fast
Communicate out — status to stakeholders, even if "investigating"

⚠️ Common IC failure modes

Vague framing — "something's broken" → nobody knows urgency
No named owner — "someone look into the DB" → nobody does
Serial investigation — waiting for each team before the next starts
Analysis paralysis — waiting for certainty before acting
Unclear instructions — "maybe try rolling back?" → treated as optional
No comms out — stakeholders escalate, creating noise

Overview Docker, K8s & Terraform — How They Fit Together

Docker Docker Container Runtimes

Kubernetes Kubernetes

Terraform Terraform

Nginx Nginx

Gunicorn Gunicorn

Node.js Node.js

Flask Flask

Django Django

Oracle Cloud OCI Hierarchy

Java GC Java GC

Docker, Kubernetes & Terraform — How They Fit Together The Full Picture

Docker packages an application and everything it needs into a container — so it runs the same everywhere.

Kubernetes runs and manages those containers at scale — scheduling, healing, and load-balancing them across machines.

Terraform builds the underlying infrastructure — servers, networks, and storage — using code.

Together they:

Define — Terraform provisions the environment
Run — Docker packages and isolates the app
Manage — Kubernetes keeps it running at scale

The Port Analogy:

Terraform → the company that builds the port (designs and provisions the docks, cranes, and warehouses)
Kubernetes → the port authority running daily operations (decides which ship takes which container, reschedules when a ship is overloaded, and reroutes when one goes down)
Docker → the standardised shipping container (sealed, identical, and portable — contents are the same no matter where it lands)

Inside a Docker container:

Application code (e.g. Node.js, Python app)
Runtime (Node, Python, Java, etc.)
Dependencies (libraries, packages)
Config needed to run

IC relevance: When an incident spans multiple layers, knowing which tool owns which layer helps you ask the right question first. Container crashing = Docker layer. Pod scheduling failing = Kubernetes layer. Servers missing = Terraform layer.

Docker Container packaging

What it does: Packages apps into containers. Ensures consistency across environments. Runs isolated processes on a host machine.

Problem in incident: Container crashes or restarts, resource limits hit (CPU/memory), misconfigured image or environment variables.

Symptoms:

App randomly restarting
Slow or failing requests
"Service unavailable" errors

Technical effect:

Container process dies or is killed by the OS
Resource starvation — CPU throttled or memory limit hit
Image or config mismatch between environments

What it means (IC interpretation): Usually resource exhaustion, a bad deploy or config issue, or the isolation hiding the root cause from standard monitoring.

Analogy: A standardised shipping container at a port. Every container is sealed with the app code, runtime, dependencies, and config inside — identical no matter which ship (host machine) carries it. If the contents are wrong, it fails the same way everywhere.

Incident signals: "Container restarted" · "OOMKilled" · High CPU / memory · CrashLoopBackOff

IC questions: Are containers restarting? Is resource usage high? Was there a recent deploy? Is this one container or all of them?

Kubernetes (K8s) Container orchestration

What it does: Runs containers at scale across multiple machines. Balances load, restarts failed workloads, and manages traffic routing between services.

Problem in incident: Pods not starting, traffic not reaching services, scaling or scheduling failures.

Symptoms:

Intermittent outages — some requests succeed, others fail
Services unreachable
High latency across the cluster

Technical effect:

Pods failing or stuck in Pending/CrashLoop state
Networking or service routing issues
Cluster imbalance — one node overloaded, others idle

What it means (IC interpretation): Usually a coordination failure, resource contention between pods, or a networking issue at the service mesh layer.

Analogy: The port authority running daily operations. Kubernetes decides which ship (node) takes which container (pod), manages the schedule, reroutes when a ship is overloaded, and replaces containers that fall into the sea (crash).

Incident signals: "Pod CrashLoopBackOff" · "Pending pods" · "Service unavailable" · Uneven latency

IC questions: Are pods running or pending? Is traffic reaching services? Any node overloaded? Any recent deploy?

Terraform Infrastructure as Code

What it does: Defines infrastructure using code (.tf files) and ensures the real system matches that definition. Creates and manages servers, networks, and storage automatically.

Problem in incident: Wrong infrastructure deployed, accidental deletion or change, drift between the expected and real state.

Symptoms:

Sudden outages immediately after a deployment pipeline runs
Missing resources — servers or services that should exist don't
Wrong environment behaviour despite identical app code

Technical effect:

Infrastructure changed or destroyed by a bad apply
State mismatch — Terraform's state file diverges from reality
Resources recreated with different config (different size, region, network)

What it means (IC interpretation): Usually a misconfiguration, a bad change rollout, or an automation error where Terraform enforced an incorrect "desired state".

Analogy: The company that builds the port itself — the docks, cranes, and warehouses. Terraform defines and provisions the physical infrastructure before any containers arrive. If the blueprint is wrong, the port doesn't exist or is misbuilt, and the port authority (Kubernetes) has nothing to work with.

Incident signals: "Resource deleted" · "Apply completed" · Sudden infra change · Missing instances

IC questions: Was Terraform run recently? What changed in the config? Was this intentional? Can we rollback or restore state?

Nginx Reverse proxy / Web server

What it is: Nginx is a high-performance web server and reverse proxy. In most production setups it sits in front of your application, handling incoming HTTP/HTTPS requests and forwarding them to the app server (e.g. Gunicorn).

Key roles:

Reverse proxy — receives client requests and forwards them to the correct backend
TLS termination — handles HTTPS so the app server only sees plain HTTP internally
Static file serving — serves CSS, JS, images directly without touching the app
Load balancing — distributes requests across multiple app instances
Rate limiting / access control — rejects abusive clients before they reach the app

Analogy: The hotel front desk. Every guest walks in, the front desk decides where to route them — regular check-in, concierge, restaurant — without each department needing to handle its own door.

Common incident signals:

502 Bad Gateway — Nginx can't reach the upstream app server (app is down or restarting)
504 Gateway Timeout — app server is responding too slowly; Nginx gave up
Connection refused — nothing is listening on the upstream socket/port
High 499 rate — clients are closing connections before Nginx responds (slow backend)

IC questions: Is Nginx running? What do the Nginx error logs say? Is the upstream app server reachable on its port? Did a recent config change get reloaded?

Gunicorn Python WSGI app server

What it is: Gunicorn (Green Unicorn) is a Python WSGI HTTP server. It runs Python web applications (Django, Flask) by spawning multiple worker processes to handle concurrent requests. It typically sits behind Nginx in production.

What is WSGI? WSGI (Web Server Gateway Interface) is the standard protocol that defines how Python web frameworks communicate with a server. Think of it as the shape of the power socket: Flask and Django are appliances that plug into the WSGI socket; Gunicorn is the socket provider. Because they both speak WSGI, you can swap one framework for another without changing the server, or swap Gunicorn for uWSGI without changing your app. Without WSGI, every framework would need its own server.

Key concepts:

Worker processes — each worker handles one request at a time; more workers = more concurrency
Worker types — sync (default), async (gevent/eventlet), or thread-based — chosen based on workload
Master process — manages workers, restarts crashed ones, handles signals (reload, shutdown)
Binding — listens on a TCP port (e.g. 8000) or Unix socket; Nginx connects to this
Timeout — workers that don't respond within the timeout (default 30s) are killed and restarted

Analogy: The kitchen behind the hotel front desk. Nginx (front desk) routes the request; Gunicorn (kitchen) processes it using multiple chefs (workers). If the kitchen is too slow or understaffed, orders back up and the front desk starts returning "sorry, we're busy" errors.

Common incident signals:

[CRITICAL] WORKER TIMEOUT — a worker didn't finish its request in time; was killed and restarted
502 seen by clients — all workers are busy; Nginx gets no response
High process memory — worker leak; workers grow until they're killed by OOM or max_requests
Gunicorn not responding after deploy — new code failing to import; workers crash on start

IC questions: How many workers are configured vs request rate? Are workers timing out (slow DB call? external API?)? Is Gunicorn actually running? Did a recent code deploy cause worker crashes?

Node.js JavaScript runtime

What it is: Node.js is a JavaScript runtime built on Chrome's V8 engine. It runs server-side JavaScript using a single-threaded, non-blocking event loop — meaning it can handle many concurrent connections without spawning a thread per request. Commonly used for APIs, real-time apps, and microservices.

Key concepts:

Event loop — a single loop processes callbacks; I/O operations are handed off asynchronously so the loop stays free for other work
Non-blocking I/O — DB queries, file reads, and network calls don't block the loop; they return via callbacks, Promises, or async/await
Single thread — CPU-intensive work blocks the event loop for everyone; offload to worker threads or a separate service
npm — the package ecosystem; a missing or mismatched package version can cause startup failure
Cluster mode / PM2 — spawns one process per CPU core to use multiple cores; PM2 also handles restarts and logging

Analogy: A single barista handling many orders at once — they pass each order to the coffee machine (async I/O) and move on. They can juggle 50 orders. But if one order requires them to stand and stir manually for 10 minutes (CPU block), every other customer waits.

Common incident signals:

Event loop lag / high latency — CPU-intensive code blocking the loop; all requests slow down simultaneously
Process exits with uncaught exception — unhandled Promise rejection or thrown error; app crashes until PM2/systemd restarts it
Memory growth / OOM kill — listener leak or unbounded cache; process grows until killed
EADDRINUSE on startup — port already in use; previous process didn't exit cleanly

IC questions: Is the event loop blocked (all requests slow at once)? Did a deploy introduce CPU-heavy code? Is the process actually running? Is memory growing per restart? Are there unhandled Promise rejections in logs?

Flask Python microframework

What it is: Flask is a lightweight Python web framework. It provides routing, request handling, and templating but has no built-in ORM, admin panel, or authentication — you add only what you need. Flask apps are WSGI applications, typically served by Gunicorn in production behind Nginx.

Key concepts:

WSGI — Web Server Gateway Interface; the standard for Python web apps to communicate with a server like Gunicorn
Routes — URL patterns mapped to Python functions using @app.route('/path')
Application factory — a pattern where the Flask app is created inside a function, making config and testing cleaner
Blueprints — modular groupings of routes; large Flask apps split into blueprints for each feature area
Context — Flask uses a request context (per-request data) and app context (app-level data like DB connections)

Analogy: A pop-up food stall versus a full restaurant (Django). Flask gives you a table, a gas burner, and a knife — you bring the rest. Fast to set up, easy to keep simple, but you wire up every component yourself.

Common incident signals:

500 Internal Server Error — unhandled exception in a route; check Gunicorn/app logs for the traceback
App fails to start after deploy — import error, missing env var, or broken dependency in requirements.txt
Slow responses on specific routes — synchronous DB call, missing index, or external API call blocking a Gunicorn worker
Working directory / config not found — Flask looks for files relative to the app root; a path mismatch breaks startup

IC questions: Is the app actually running (Gunicorn workers up)? Which route is failing — is it all routes or one? Did a deploy change requirements.txt or env vars? Is there a slow DB call on the failing route?

Django Python batteries-included framework

What it is: Django is a full-featured Python web framework. Unlike Flask, it includes an ORM, admin panel, authentication, form handling, and migrations out of the box. Also a WSGI app — served by Gunicorn behind Nginx in production. Its philosophy is "don't repeat yourself" — conventions reduce the amount of code needed.

Key concepts:

ORM — Django's built-in Object-Relational Mapper translates Python model classes to SQL; powerful but can generate inefficient queries if used carelessly
Migrations — schema changes are tracked as migration files; running manage.py migrate applies them to the database
Settings — all configuration lives in settings.py; DEBUG, database credentials, allowed hosts, installed apps
Admin panel — auto-generated at /admin; very useful for manual data inspection during incidents
WSGI entry point — Gunicorn points at project.wsgi:application; if this import fails, no workers start

Analogy: A fully equipped commercial kitchen (vs Flask's pop-up stall). The oven, the walk-in fridge, the dishwasher — all included. More opinionated about layout, but you get to cooking faster. The trade-off: more moving parts that can break.

Common incident signals:

App fails to start after deploy — unapplied migrations, missing settings, or a broken import in models/apps
Slow queries / high DB CPU — N+1 query problem (one query per object in a loop); use select_related / prefetch_related
DEBUG=True in production — shows full stack traces to users; also disables template and query caching — major performance and security issue
500 on a specific URL — unhandled exception in a view; check Gunicorn logs for the traceback
Migration conflicts after merge — two branches added migrations to the same app; need to squash or re-number

IC questions: Were migrations applied after the deploy? Is DEBUG True in production? Which view is causing 500s? Are there N+1 query patterns in the slow endpoint? Is the WSGI entry point importable?

OCI Physical Hierarchy OCI Infrastructure

Oracle Cloud Infrastructure organises resources in a three-level hierarchy: Region → Availability Domain → Fault Domain. Understanding which level a failure is at determines the blast radius and recovery options.

Region

A geographic area (e.g. uk-london-1, us-ashburn-1). Completely isolated from other regions — an outage in one region does not affect others. OCI has 40+ regions globally.

IC relevance: If users in only one country are affected, ask: "Which region do they connect to?" Regional failures are rare and escalated immediately to Oracle.

Availability Domain (AD)

Within a region there are 1–3 ADs. Each AD is a physically separate data centre with its own power, cooling, and networking. Failure in one AD does not cascade to others in the same region.

IC relevance: If some users are affected and others are not within the same region, ask: "Are the affected services deployed in only one AD? Is there cross-AD load balancing?"

Fault Domain (FD)

Each AD contains 3 FDs. A FD groups physical hardware — servers and top-of-rack switches — sharing a power circuit. A hardware failure (power circuit, rack switch) affects only the instances in that FD.

IC relevance: If some VMs within an AD are down but others are fine, ask: "Are all the affected instances in the same FD?" Spreading instances across all 3 FDs gives hardware-level redundancy inside an AD.

The Analogy

Region = the city. AD = a separate building in the city, with its own power supply and entrance — a fire in building A doesn't affect building B. FD = a floor within that building — a tripped circuit on floor 3 doesn't affect floors 1 and 2.

IC First Questions

"Which region are the affected resources in?" — rules in/out a regional event
"Are affected services in the same AD, or spread across ADs?" — narrows to AD-level failure
"Which FD are the affected instances in?" — points to hardware-level fault
"Are any other resources in the same FD also affected?" — confirms blast radius

Java Garbage Collection Java GC

Java automatically reclaims heap memory that is no longer in use — this is garbage collection. The IC-relevant symptom is the stop-the-world (STW) pause: a brief period where the JVM halts every application thread to run GC. Under load, these pauses appear as periodic latency spikes (typically 200ms–2s) with no CPU, disk, or network cause visible in infrastructure monitoring. The JVM resumes normally after each pause. If heap is consistently near-full, GC runs more frequently and pauses grow longer, eventually causing a java.lang.OutOfMemoryError. Modern collectors (G1GC, ZGC) reduce pause duration, but insufficient heap or a memory leak will overwhelm any collector.

Container Runtimes Beyond Docker

What is a container runtime? The low-level software that actually runs containers — it creates the isolated process, sets up namespaces and cgroups, and manages the container lifecycle. Docker is the most recognised but not the only option.

Why it matters as IC: Knowing which runtime is in use helps you read logs correctly and point to the right team. "docker ps" doesn't work if the environment uses containerd or CRI-O directly.

Podman — Near drop-in replacement for Docker. Daemonless (no background service required), supports rootless containers (runs without root), same CLI syntax. Used where Docker daemon is a security concern. Key difference: no daemon means no single point of failure; each container is a direct child process of the user.
containerd — Lightweight runtime originally extracted from Docker — Docker uses containerd under the hood. Kubernetes switched from dockershim to containerd directly in K8s v1.24. Minimal API, no CLI for end users. IC signal: in K8s environments post-1.24, container state is in containerd not Docker.
CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) so K8s can talk to it directly. Even more minimal than containerd. Common in OpenShift environments. IC signal: if the cluster uses OpenShift, the runtime is almost certainly CRI-O.
LXC / LXD — More like lightweight virtual machines than pure application containers. Each LXC container runs a full Linux userspace with init, systemd, and multiple processes — not just one application. Used for OS-level isolation rather than microservice packaging. Key difference: LXC feels like a VM; Docker feels like a process.
rkt (CoreOS Rocket) — Security-focused runtime. Now deprecated — CoreOS was acquired by Red Hat and rkt development stopped in 2019. Mentioned here for historical context; you may see it in older documentation.
Kubernetes + pluggable runtimes — K8s itself is not a container runtime; it is an orchestrator. It manages containers via the Container Runtime Interface (CRI), which lets you swap the underlying runtime (containerd, CRI-O, etc.) without changing how K8s works.

Quick decision rule for ICs:

Bare VM running a single app → likely Docker or Podman
Kubernetes cluster → containerd or CRI-O (not Docker since K8s v1.24)
OpenShift cluster → CRI-O
OS-level multi-process isolation → LXC/LXD

0 / 0 revealed

Symptom → Diagnosis

Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.

1 · What is a Container?

📦 The one-line definition

A container is just a process with its own mini-filesystem and dependencies — an isolated app + everything it needs to run, packaged together.

Core value: portability + consistency. The app behaves the same on any machine, any environment.

🥡 Lunchbox analogy

The food	= your app
The ingredients	= dependencies
The box	= isolation

You can take it anywhere, and it's the same meal every time.

2 · How the Three Tools Fit Together

IC layer rule: Container crashing = Docker layer. Pod scheduling failing = Kubernetes layer. Servers/network missing = Terraform layer. Knowing which layer owns the problem points you to the right team immediately.

3 · What is Nginx?

📡 One-line definition

Nginx is a reverse proxy and web server that sits in front of your app — it receives every incoming HTTP/HTTPS request and decides where to send it.

🏨 Analogy

The hotel front desk. Every guest walks in; the desk decides who handles them — restaurant, concierge, housekeeping. No department needs its own front door.

4 · What is Gunicorn?

🍳 One-line definition

Gunicorn is a Python WSGI app server — it takes requests from Nginx and runs your Flask or Django app using a pool of worker processes (one request per worker at a time).

WSGI (Web Server Gateway Interface) is the standard protocol that lets Python web frameworks (Flask, Django) communicate with a server like Gunicorn. Think of it as the power socket shape — the framework plugs in, the server provides the socket, and they speak a common language regardless of which framework is used.

👨‍🍳 Analogy

The kitchen behind the hotel front desk. Nginx routes the order; Gunicorn processes it using N chefs (workers). If the kitchen is full or a chef takes too long — new orders back up and the front desk starts returning errors.

5 · What is Node.js?

⚡ One-line definition

Node.js is a JavaScript runtime that handles many concurrent connections using a single-threaded event loop — async I/O keeps it free for other requests, but CPU-heavy code blocks every user at once.

☕ Analogy

A single barista juggling many orders — they hand each order to the machine (async I/O) and move on. But if they have to stand and manually grind beans for 10 minutes (CPU work), every other customer waits.

6 · What is Flask?

🏕️ One-line definition

Flask is a lightweight Python WSGI microframework — it gives you URL routing and request handling only. No ORM, no admin panel, no auth built in. You add exactly what you need.

🥘 Analogy

A pop-up food stall. You get a table, a gas burner, and a knife — bring the rest yourself. Fast to set up, easy to keep simple, but you wire every component.

7 · What is Django?

🏭 One-line definition

Django is a batteries-included Python WSGI framework — ORM, admin panel, auth, and migrations come built in. More moving parts than Flask but faster to build standard features.

🍽️ Analogy

A commercial kitchen fully equipped — everything is there when you arrive. Faster to cook a full meal, but more equipment means more things that can break.

8 · How the Python Web Stack Fits Together

IC layer rule: 502/504 errors → check Nginx logs first. 500 errors on specific routes → check Gunicorn/app logs for traceback. Slow but not crashing → check for blocked event loop (Node.js) or slow DB query (Flask/Django). App won't start → check imports and env vars.

9 · Oracle Cloud Infrastructure Hierarchy

Region

Geographic area (e.g. uk-london-1). Fully isolated from other regions. A regional failure affects all ADs and FDs within it.

Ask: "Is this one geography or global?"

Availability Domain (AD)

Separate data centre within a region (1–3 per region). Own power and cooling. AD failure does not affect other ADs.

Ask: "Are affected services in the same AD?"

Fault Domain (FD)

Hardware grouping within an AD (3 per AD). Shared power circuit + top-of-rack switch. Failure affects only instances in that FD.

Ask: "Are all downed VMs in the same FD?"

Analogy & IC Lens

Region = city · AD = separate building in the city · FD = floor within the building.
A tripped circuit on one floor doesn't affect other floors or other buildings.

Scope first: Region → AD → FD. The level of the failure determines who you call and what options you have for recovery.

10 · Java GC

Stop-the-World Pause

The JVM briefly halts all threads to reclaim heap memory. Symptom: periodic latency spikes (200ms–2s), no CPU/disk/network cause, clean recovery after each spike.

IC signal: intermittent spikes with no infrastructure alert → ask if it's a Java service → suspect GC.

11 · Container Runtimes

What each runtime is used for

Docker — general-purpose app containers, best developer tooling
Podman — drop-in Docker replacement, daemonless, rootless mode — preferred where security posture matters
containerd — lightweight runtime used by Docker and by Kubernetes since v1.24 (replaced dockershim)
CRI-O — Kubernetes-native only, OpenShift default, minimal footprint
LXC / LXD — OS-level isolation, more like a lightweight VM than an app container
rkt — deprecated (CoreOS acquired by Red Hat, 2019)

IC decision rule

Bare VM / single app → Docker or Podman — use docker ps
Kubernetes cluster (v1.24+) → containerd — use crictl ps
OpenShift cluster → CRI-O — use crictl ps
Multi-process OS isolation → LXC/LXD — use lxc list

Key difference from Docker: Podman has no daemon — each container is a direct child process of the user, so there is no central point of failure.

🏗️

OCI Architecture Puzzle

A visual quiz where you identify and place components in a Flask container deployment on Oracle Cloud Infrastructure.

6 questions · click the glowing node · instant feedback · score tracked

Term → Definition

Select the correct one-sentence definition for each term.
50 terms · shuffled each round · score tracked.