Query Optimizer GPS choosing the route
What it does: Chooses how queries are executed. Decides indexes, join order, and access paths.
Problem in incident: Picks inefficient execution plan. Ignores indexes or misjudges data.
Effect (what you see): Gradual slowdown, queries pile up, CPU increases.
Technical effect:
- Full table scans instead of index lookups
- More rows processed than needed
- Increased CPU / disk I/O
- Connections held longer
What it means: System doing too much work per query. Inefficiency spreading across system. Can lead to saturation or connection exhaustion.
Analogy: GPS sends cars through small roads instead of highways.
Incident signals:
- Slow query logs increasing
db file sequential read- Rising latency
Key insight: The optimizer makes its decision automatically based on statistics. If stats are stale or data distribution has shifted, it can pick the wrong plan even when a good index exists — causing a sudden slowdown with no code change.
IC Questions: "Any slow queries?" / "What changed?" / "Are indexes being used?" / "Are statistics up to date?"
When Does an Index Lose Its Effectiveness? Library catalog
Core understanding: An index isn't "broken" — it becomes less useful when the optimizer decides it's no longer efficient. This happens due to fragmentation, poor selectivity, or outdated statistics.
What it does: Helps the database find data quickly.
Problem in incident: Index exists but queries are slow.
Effect (what you see): Slow queries, full table scans.
Technical effect:
- Fragmentation from frequent inserts/updates/deletes
- Statistics out of date
- Optimizer ignores index
What it means: Navigation system exists but is unreliable.
Analogy: Library catalog that's messy or outdated.
Incident signals: Full table scan, high read I/O.
IC Questions: "Has data changed recently?" / "Are indexes still used?"
Slow Queries & Indexing Road choice and quality
What it does: Determines how fast data is accessed.
Problem in incident: Missing indexes or inefficient queries.
Effect (what you see): Gradual slowdown, high CPU.
Technical effect:
- Full scans
- High CPU / I/O
- Increased query duration
What it means: System inefficiency under load. Can cascade into bigger issues.
Analogy: Cars using small roads instead of highways.
Incident signals:
- Slow query logs
- High CPU
db file sequential read
IC Questions: "Any slow queries?" / "Indexes being used?" / "Recent changes?"
Buffer Pool / Cache Hit Ratio City warehouse vs distant storage depot
What it does: The buffer pool (or buffer cache) holds frequently accessed data pages in memory so the DB can serve reads from RAM instead of disk.
Problem in incident: If the buffer pool is too small or gets evicted under memory pressure, the DB must read from disk more often — causing high read I/O and latency even when queries are efficient.
Effect (what you see): High disk read I/O, slow reads, elevated "physical reads" metric. Looks similar to a missing index but queries may have good plans.
Technical effect:
- Low cache hit ratio → frequent physical reads from disk
- Memory pressure → pages evicted before they can be reused
- Working set larger than available buffer pool
Key distinction from disk I/O bottleneck: Disk I/O bottleneck = disk can't keep up with demand. Buffer pool problem = too many requests hitting disk that could be served from memory.
Analogy: Warehouse runs out of stock — every request requires a trip to a distant depot instead of grabbing from the shelf.
Incident signals: Low cache hit ratio alert, high physical reads, memory utilisation high on DB host.
IC Questions: "What is the cache hit ratio?" / "Has memory pressure increased?" / "Has the working data set grown recently?"
Row Lock One lane blocked
What it does: Locks specific rows during updates.
Problem in incident: Long transactions hold locks.
Effect (what you see): Queries waiting, localised slowdown.
Technical effect:
- Other queries blocked on same rows
- Increased wait times
- Queue formation
What it means: One piece of work is blocking others. Can escalate if widespread.
Analogy: One lane closed due to accident.
Incident signals:
enq: TX - row lock contentionTX enqueue (mode 6)- Queries waiting
Key insight: Write always blocks write. Whether a write blocks a read depends on isolation level — in some databases reads are never blocked (MVCC); in others they wait. Important distinction for diagnosing who is actually stuck.
IC Questions: "What's blocking?" / "Any long transactions?" / "Can we clear it?" / "Is this write-write or write-read contention?"
Deadlocks Two cars blocking each other at a junction
What it does: Two transactions each hold a lock the other needs, causing a circular wait that neither can resolve.
Problem in incident: Transactions freeze waiting on each other — the database must detect and kill one to break the cycle.
Effect (what you see): One transaction is rolled back with a deadlock error. Throughput drops if deadlocks are frequent.
Technical effect:
- T1 holds lock on Row A, wants Row B
- T2 holds lock on Row B, wants Row A
- DB deadlock detector kills one (the "victim") and rolls it back
Key distinction from row lock: Row lock contention is one-directional (one waits). A deadlock is circular (both wait on each other). The DB resolves it automatically but the rolled-back transaction may retry and repeat.
Analogy: Two cars at a narrow junction, each waiting for the other to reverse — neither can move until one backs down.
Incident signals: Deadlock errors in logs, rolled-back transactions, retry storms.
IC Questions: "Are deadlock errors in the logs?" / "Is the same pair of transactions involved?" / "Are retries making it worse?"
Metadata Lock Entire road closed
What it does: Locks entire table structure.
Problem in incident: Schema change blocks all access.
Effect (what you see): Sudden freeze — queries pile up instantly.
Technical effect:
- All queries blocked waiting on metadata
- No progress despite low CPU
What it means: System is blocked, not overloaded. One operation is halting everything.
Analogy: Entire road shut down.
Incident signals:
- Queries stuck "waiting"
- Low CPU but high latency
IC Questions: "Any schema changes?" / "What's blocking?" / "Can we stop it?"
Locks & Contention Blocked roads and junctions
What it does: Controls access to shared data.
Problem in incident: Too many locks or long transactions.
Effect (what you see): Queries waiting — system appears stuck.
Technical effect:
- Blocking chains
- Increased wait times
- Throughput drops
What it means: Work is queued behind blockers. System not overloaded — just blocked.
Analogy: Traffic jam behind blocked road.
Incident signals:
- Lock wait alerts
- Waiting queries
IC Questions: "What's blocking?" / "How long?" / "Can we remove it?"
Long-Running Transactions A lorry blocking a side road for hours
What it does: A transaction that stays open much longer than normal, holding locks and resources throughout.
Problem in incident: Long transactions are a root cause that triggers several other issues — they hold row locks (blocking others), prevent log truncation (causing log growth), and inflate undo/rollback segments.
Effect (what you see): Depends on what the transaction is doing — could appear as row lock contention, log growth, or disk pressure rather than the transaction itself.
Technical effect:
- Holds row locks for extended period → blocks other transactions
- Prevents transaction log from being truncated → log grows
- Holds undo/rollback space → undo segment pressure
Key insight: Often invisible as a direct alert — you see the symptoms (lock waits, log growth) but must look for long-running transactions as the underlying cause.
Analogy: A lorry parked across a side road for hours — blocking everything behind it and preventing road crews from clearing the area.
Incident signals: Long transaction time in monitoring, lock waits, log growth, undo pressure.
IC Questions: "Any transactions open for an unusual length of time?" / "Is this causing lock waits or log growth?" / "Can it be safely rolled back?"
Redo Log / Transaction Log Traffic control recording every car movement
What it does: Records all changes for durability and recovery.
Problem in incident: Heavy write activity overwhelms logging. Logs become a bottleneck.
Effect (what you see): System slows under write load. Even simple operations delayed.
Technical effect:
- Increased disk writes
- Log flush contention
- Transactions slowed waiting for log writes
What it means: Write throughput is limiting performance. System can't commit changes fast enough. Risk of cascading slowdown.
Analogy: Cars must stop at a checkpoint before continuing.
Incident signals:
- High write latency
- Disk pressure
- Slow commits
IC Questions: "Is write volume high?" / "Any long transactions?" / "Is disk under pressure?"
Bottleneck in Transaction Log Single toll booth
Core understanding: All write operations must be recorded in the transaction log first. If the log can't keep up (slow disk or high write volume), everything slows down.
What it does: Ensures durability of writes.
Problem: Log becomes a bottleneck.
Effect (what you see): Slow transactions, connection buildup.
Technical effect:
- Log write delays
- Commit latency rises
What it means: Central write system is congested.
Analogy: Single toll booth causing traffic backup.
Incident signals: Log write waits, rising active sessions.
IC Questions: "Is disk slow?" / "Too many writes?"
Are Items Removed from Transaction Log? Black box recorder
Core understanding: Completed transactions are not immediately removed. The log keeps them until it is safe to reuse the space — after checkpoints and/or log backups, depending on system.
What it does: Stores transaction history for recovery.
Problem: Log keeps growing.
Effect (what you see): Disk pressure.
Technical effect:
- Entries retained until safe for recovery
- Space reused later (not deleted immediately)
What it means: Log is controlled reuse, not deletion.
Analogy: Black box recorder that overwrites old data later.
Incident signals: Log growth alerts.
IC Questions: "Are log backups running?" / "Any long transactions?"
Checkpoint vs Log Backup Unloading truck vs clearing warehouse
Core understanding: Checkpoint writes data pages to disk for recovery. Log backup allows the transaction log to reuse space. They solve different problems — using the wrong one won't fix the issue.
What it does:
- Checkpoint → flushes data pages to disk
- Log backup → frees log space for reuse
Problem: Log growing unexpectedly.
Effect (what you see): Disk issues despite checkpoints running.
Technical effect:
- Checkpoint does not truncate the log
- Log backup is required to free space
What it means: Wrong tool applied to the problem.
Analogy: Unloading a truck (checkpoint) vs clearing the whole warehouse (log backup).
Incident signals: Log growth despite checkpoints running.
IC Questions: "Are log backups configured?" / "What recovery mode is set?"
Database Connections / Connection Pooling Cars entering the city
What it does: Limits number of active DB connections.
Problem in incident: Too many connections or leaks.
Effect (what you see): Requests waiting or timing out.
Technical effect:
- Connection pool exhausted
- Requests queued before DB
- Threads blocked waiting
What it means: System can't accept more work. Often caused by slow queries or leaks.
Analogy: Cars queued at city entrance.
Incident signals:
- "Too many connections"
- Timeouts
- Low DB utilisation sometimes
IC Questions: "Are we at max connections?" / "Are connections released?" / "What's holding them?"
Connection Pathway + Redo Log Club capacity + slow bar
Core understanding: A client must connect before running queries. Write operations are logged first (redo/transaction log). If the system is slow, connections stay open longer and can hit limits.
What it does: Handles access and write durability.
Problem: Too many connections / slow commits.
Effect (what you see): Connection errors, requests rejected.
Technical effect:
- Flow: Client → Connect → Limit check → Query → Execute → Log
- Slow log → slow commits → connections pile up → limit hit
What it means: System saturated at entry or commit stage.
Analogy: Club at capacity with slow bar service — people can't get in or get stuck inside.
Incident signals: "Too many connections" error, rising active sessions.
IC Questions: "Are connections being released?" / "Where is the bottleneck?"
Query Timeout vs Connection Timeout Order taking too long vs never getting a table
What it does: Two different timeout types that produce similar-looking errors but have different causes and fixes.
Problem in incident: Teams often conflate them — treating a connection timeout like a slow query problem, or vice versa. Diagnosing the wrong one wastes time.
Technical effect:
- Query timeout: Connection was made, query started, but it ran too long — DB or app killed it. Cause: slow query, missing index, lock wait.
- Connection timeout: App could not get a connection within the time limit — never reached a query. Cause: pool exhausted, DB overloaded, network issue.
Key distinction:
- Query timeout → you got in, but service was too slow
- Connection timeout → you never got a table
Analogy: Query timeout = seated at a restaurant but your order never arrives. Connection timeout = no tables available, turned away at the door.
Incident signals: Error message wording — "query exceeded timeout" vs "connection timed out" / "could not acquire connection".
IC Questions: "What does the exact error say?" / "Did the connection succeed?" / "Is the pool full or are queries just slow?"
Temp Index Rebuild Road maintenance during rush hour
What it does: Rebuilds or reorganises indexes.
Problem in incident: Happens during peak load. Competes for resources.
Effect (what you see): Sudden slowdown, increased I/O and CPU.
Technical effect:
- Heavy disk usage
- Temporary space consumption
- Increased contention with live queries
What it means: Background work is stealing capacity from production traffic. Can trigger wider performance issues.
Analogy: Roadworks reducing available lanes.
Incident signals:
- Maintenance job running
- "tablespace is full" (possible)
- Disk spikes
Key insight: Rebuilding creates a new index alongside the old one before swapping — temporarily doubling the storage needed. Disk full alerts during maintenance are often this, not a general storage leak.
IC Questions: "Any maintenance running?" / "Can we pause it?" / "Is disk space OK?" / "Was disk headroom checked before the job started?"
Resource Saturation (CPU / Disk / Memory) City at full capacity
What it does: Provides compute and storage resources.
Problem in incident: System exceeds capacity.
Effect (what you see): Everything slows — no single clear cause.
Technical effect:
- CPU maxed → slow processing
- Disk maxed → slow reads/writes
- Memory pressure → less caching
What it means: System overloaded. Needs load reduction or scaling.
Analogy: Entire city overwhelmed with traffic.
Incident signals:
- High CPU / disk
- System-wide latency
IC Questions: "Which resource is maxed?" / "Load spike or inefficiency?" / "Can we reduce load?"
Replication Lag Branch office receiving yesterday's updates
What it does: Changes written to the primary database are replicated to read replicas, usually with a small delay.
Problem in incident: Lag grows — reads from replicas return stale data. Users see outdated results or inconsistencies.
Effect (what you see): Data appears to "go backwards" or users see different data depending on which replica they hit. May look like a bug rather than an infrastructure issue.
Technical effect:
- Primary processes writes faster than replica can apply them
- Replica falls behind — lag measured in seconds or minutes
- Reads routed to replica return old data
Common causes: Heavy write load on primary, slow replica disk, long-running queries on replica blocking apply, network issues.
Analogy: Head office sends updates daily — branch office is working from yesterday's data.
Incident signals: Replication lag metric rising, user reports of stale data, replica behind primary by N seconds.
IC Questions: "What is current replica lag?" / "Are reads being routed to replicas?" / "Is write load on primary spiking?" / "Can we route reads to primary temporarily?"
Database Wallet Secure key locker
What it does: A database wallet is a secure store for credentials, certificates, and encryption keys. Applications and databases retrieve passwords and keys from the wallet instead of having them exposed in plain-text config files or code.
Problem in incident: Wallet missing, corrupted, or inaccessible; wrong file permissions; expired certificates; config pointing to the wrong wallet path.
Symptoms:
- Apps suddenly can't connect to the database
- Authentication failures spike — often immediately after a deploy
- Services fail on startup or restart
Technical effect: The system can't retrieve credentials or encryption material, so DB connections fail, TLS/SSL handshakes may fail, and authentication breaks even if the underlying credentials are correct.
What it means (IC interpretation): Likely a misconfiguration or dependency failure — not load-related. Often triggered by deployments, certificate rotation, or permission changes. The credentials themselves may be fine; it's access to them that has broken.
Analogy: A secure key locker for delivery drivers. Drivers (apps) don't carry keys themselves — they go to the locker to pick them up before each delivery. If the locker is locked, broken, or empty, no deliveries happen regardless of whether the drivers are available.
Incident signals: "Authentication failed" · "Cannot load wallet" · "Permission denied" · "SSL handshake failed" · Spike in connection errors immediately after a deploy
IC questions: Did anything change recently (deploy, config, cert rotation)? Is the wallet file path accessible from the service? Are file permissions correct? Has anything expired (certs/keys)? Is this affecting all services or just one?
Incident Chain How it all connects
Undo & Read Consistency (RAC) Old maps for drivers
Core understanding: Oracle lets readers see a consistent past version of data using undo, even while writes are happening. In RAC, this consistency must work across multiple nodes, which adds coordination overhead.
What it does:
- Stores before-images of data (undo)
- Lets queries read a stable snapshot
- Prevents read/write blocking
Problem in incident: Undo too small or overwritten; long queries need old data that no longer exists; RAC adds delay due to cross-node access.
Effect (what you see): "Snapshot too old" query failures; sudden query slowdowns; intermittent errors on long-running reports.
Technical effect: Required undo data no longer available, or slow retrieval across RAC nodes.
What it means: Capacity issue (undo too small) or workload mismatch (long queries vs high churn). In RAC, could also be inter-node latency.
Analogy: Cars (queries) need a map of the road from 5 minutes ago. Old maps (undo) keep getting thrown away. If the map is gone, the driver gets lost — query fails.
Incident signals: "snapshot too old" errors; long-running queries failing; spikes in undo usage; RAC: interconnect latency warnings.
IC Questions: Are queries long-running? Has data change rate increased? Any recent batch jobs? Is this happening across all RAC nodes or one?
Memory Architecture (SGA/PGA, RAC) Kitchens with shared fridges
Core understanding: Oracle uses memory to cache data and speed up queries. In RAC, each node has its own memory but must share data via interconnect — the "pinging" problem.
What it does:
- SGA = shared memory (data cache, SQL cache)
- PGA = per-session memory
- Reduces disk I/O by caching hot data
Problem in incident: Memory pressure (too many queries); cache inefficiency; RAC blocks constantly moving between nodes.
Effect (what you see): High latency; high CPU; slow queries across cluster; sudden performance degradation.
Technical effect: Cache misses lead to more disk reads; RAC block transfer overhead between nodes ("gc" waits).
What it means: Resource contention (memory/CPU) or bad workload distribution across RAC. Often: too many queries, poor query patterns, or hot blocks bouncing between nodes.
Analogy: Each RAC node is a separate kitchen with its own fridge. If a chef needs something from another kitchen, they must run across the street. Too much running = everything slows down.
Incident signals: High CPU; high memory usage; RAC interconnect traffic spikes; "buffer busy waits" / "gc" waits.
IC Questions: Is load evenly distributed across nodes? Any spike in query volume? Are specific queries dominating? Is one node worse than others?
Undo + Memory Interaction (RAC) Bridge congestion + roadworks
Core understanding: Undo and memory work together to serve consistent reads quickly. In RAC, this may involve remote memory access between nodes — heavy writes and long reads colliding causes compounding pressure.
What it does:
- Memory serves cached data quickly
- Undo reconstructs older versions for consistency
- RAC shares both mechanisms across nodes
Problem in incident: Heavy writes + long reads + RAC traffic causes simultaneous contention and latency.
Effect (what you see): Cluster-wide slowdown; queries inconsistent in performance; timeouts; mixed symptoms (CPU + latency + errors).
Technical effect: Undo reconstruction + memory contention happening at the same time; inter-node block transfers compound both.
What it means: System under stress — multiple subsystems interacting badly. Often triggered by batch jobs or reporting running alongside heavy writes.
Analogy: Cars need old maps (undo). Roads are busy (writes). Cities are connected by bridges (RAC). Too many cars crossing bridges + changing roads = gridlock.
Incident signals: Mixed symptoms (CPU + latency + errors); RAC interconnect spikes; query variability; undo errors alongside memory pressure.
IC Questions: What changed? (batch job, release) Is this cluster-wide? Are reads and writes colliding at the same time?
Seeded Reports City-wide traffic map
Core understanding: A seeded report is a pre-built, default report that ships with a system. Designed for common use cases — not tailored to your specific environment or incident needs.
What it does: Provides standard visibility into data (performance, usage, sales) without requiring a custom build.
Problem in incident: Seeded reports often lack the detail, speed, or focus needed during an active incident.
Effect (what you see):
- Missing key data you need right now
- Reports too slow to load
- Data feels generic — "nothing looks wrong"
- Teams say "the report looks fine" but users are impacted
Technical effect: Queries are broad and inefficient; not optimised for real-time debugging; may miss critical filters or dimensions (specific customer, query, endpoint).
What it means (IC interpretation): Observability gap. You're relying on generic tooling instead of targeted insight — this slows decision-making and prolongs the incident.
Analogy: A city-wide traffic map. It shows "traffic looks normal overall" — but your incident is a single blocked lane on one street. You need a zoomed-in camera, not a general map.
Incident signals:
- "Dashboard shows normal but users report slowness"
- "Report takes too long to generate"
- "No visibility into specific query / user / service"
- Conflicting statements between teams
IC Questions: "Do we have a more granular or real-time view?" / "Can we filter to affected users or endpoints?" / "Is this report cached or delayed?" / "Who can run a targeted query or log search instead?"
Real-world example — Top Customers Report: A classic seeded report you'll find pre-installed in many systems:
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10;
This query shows your top 10 customers by spending — a common business report that ships by default. It's useful day-to-day, but during an incident it tells you almost nothing: it doesn't filter by time window, affected region, or error type. You'd need a targeted query scoped to the problem instead.
Where seeded reports appear:
- ERP systems (Oracle, SAP) — pre-built operational reports
- CRM tools — customer activity and pipeline summaries
- Internal dashboards — aggregate health views used by on-call
- BI tools (connected to MySQL / Postgres) — standard metric views
Symptom → Diagnosis
Read the incident symptom and identify the most likely cause.
27 questions · shuffled each round · score tracked.
| TCP connect | Network issue, firewall, DB down |
| Authentication | Wrong creds, wallet inaccessible, cert expired |
| Session / pool | Pool exhausted → connection timeout |
| max_connections | Too many open sessions → rejected requests |
| Optimize | Stale stats → bad plan → full table scan |
| Execute | Lock wait, missing index, slow query |
| Redo log | Disk bottleneck → slow commits → sessions pile up |
| Close / release | Connection leak → pool never freed |
- Connection issue? Check pool exhaustion, "Too many connections" error
- Auth issue? Recent deploy? Wallet path / certs / permissions changed?
- Slow query? Slow query log on? Indexes being used? EXPLAIN output?
- Blocked? Long transaction holding locks? Schema change running?
- Write lag? Disk I/O high? Redo log flush contention?
- Resource? CPU / Disk / Memory — which one is maxed?
Data already in RAM. Served instantly — no disk involved. Cache hit ratio >99% is healthy; below 95% is a warning sign.
Grabbing from shelfData not in RAM — must read from disk. 10–100× slower. Looks like slow queries even with good plans.
Cause: Working set larger than buffer pool, or memory pressure evicting pages.
Trip to distant warehouse- Cache hit ratio dropping?
- Memory utilisation high on DB host?
- Has working data set grown?
- Buffer pool size recently reduced?
- Fragmentation — inserts/updates/deletes scatter pages
- Stale statistics — optimizer misjudges row count, picks wrong plan
- Poor selectivity — column has few unique values (e.g. status Y/N)
- Function on column —
WHERE YEAR(date)=bypasses index
type:ALL = full scan. type:ref/range = index used.- Check slow query log for offenders
- Run EXPLAIN — identify full scans
- Are statistics up to date?
- Has data volume grown recently?
- Any code deploy or query change?
- Is an index rebuild running (competing I/O)?
Locks specific rows during an update. Others needing the same rows must wait. Write always blocks write. MVCC prevents read blocks in most DBs.
T1 holds Row A, wants Row B. T2 holds Row B, wants Row A. Circular — neither moves. DB kills one (the "victim"). May trigger retry storm.
DDL (ALTER TABLE) locks the entire table structure. All queries queue instantly. CPU stays low — blocked, not overloaded.
BLOCKING CHAIN — how one transaction freezes the system
| What it records | Every write before commit |
| Why it exists | Durability — recover after crash |
| Bottleneck sign | High write latency, slow commits |
| Cascade effect | Slow log → slow commits → pool fills |
| Long transactions | Hold log space — prevent truncation → log grows |
| Checkpoint | Flushes data pages to disk for crash recovery |
| Log backup | Frees log space for reuse |
| Common mistake | Running checkpoint when log grows — won't help |
| Fix for log growth | Run log backup, kill long transactions |
All slots taken. New requests queue then time out. DB may not be overloaded — just at its connection limit.
Connections opened but never closed. Pool slowly fills. Triggered by app restarts or error paths that skip close().
| Query timeout | Got in, query too slow, killed |
| Conn timeout | Never got a slot, rejected |
- At max_connections?
- Slow queries holding slots?
- Connection leak suspected?
- Can app layer restart to release?
A single root cause often triggers a cascade. Recognising the chain tells you where to intervene.
- Long transaction → row locks → blocking chain → throughput drops
- Disk I/O saturation → redo log slow → commits slow → pool fills
- Schema change (MDL) → instant table lock → all queries queue
- Index rebuild at peak → doubles disk I/O → slow queries → cascade above
Heavy primary writes outpace replica's apply speed. Reads return stale data — looks like a bug, not infrastructure.
CPU, disk, and memory all hitting limits simultaneously — everything degrades with no single clear cause.
| CPU >90% | Query processing starved |
| Disk I/O >85% | All reads/writes slow |
| Memory >85% | Buffer pool evicted → more disk reads |
- Identify scope — all users or subset? One service?
- Check what changed — deploy, migration, job, config?
- Blocked vs overloaded? — low CPU + waits = blocked; high CPU = overloaded
- Find the head of the chain — what is T1 / the root blocker?
- Kill or pause — remove the blocker; monitor for recovery
- Root cause, not symptom — so it doesn't immediately recur
| Active sessions | SHOW PROCESSLIST |
| InnoDB locks | SHOW ENGINE INNODB STATUS |
| Query plan | EXPLAIN SELECT ... |
| Kill session | KILL [process_id] |
| Replication lag | SHOW REPLICA STATUS |
| Slow query log | SHOW VARIABLES LIKE 'slow%' |
DNS Record Types Contact list with routing rules
Core understanding: DNS isn't just "name → IP." It stores different record types that control where traffic goes and how services are discovered.
What it is: A distributed directory with multiple record types, each serving a different routing purpose.
Key records:
- A → domain → IPv4 (most common)
- AAAA → domain → IPv6
- CNAME → alias (domain points to another domain)
- MX → mail routing
- TXT → verification / policies (SPF, DKIM)
- NS → which DNS servers are authoritative
Problem in incident: Wrong IP in A record · broken CNAME chain · missing or incorrect records
Effect (what you see): Users routed to wrong server · partial outages · some services work, others fail
Technical effect: DNS resolves — but to the wrong destination
What it means: Misconfiguration, not outage — traffic is flowing, but incorrectly
Analogy: Contact list with wrong phone numbers or forwarding rules
Incident signals:
- Traffic hitting wrong servers
- Sudden shift in traffic patterns
- "It works for some domains but not others"
IC questions: "What record changed?" / "Are we resolving to the expected IP?" / "Is there a CNAME chain involved?"
Pattern: Traffic going somewhere wrong → think DNS misconfiguration
TTL & Propagation Old maps still in circulation
Core understanding: DNS changes are not instant — TTL (Time To Live) controls how long old answers stay cached by resolvers across the internet.
What it does: TTL determines how long a resolver caches a DNS answer before it re-queries the authoritative server.
Problem in incident: Old records still cached · some users see new config, others see old
Effect (what you see): "Works for me but not others" · gradual recovery · region-dependent behaviour
Technical effect: Different resolvers return different answers — inconsistent global state
What it means: Not a failure — the change is still propagating. Expected behaviour after a DNS update.
Analogy: Old maps still being used while new maps are being distributed
Incident signals:
- Mixed behaviour across regions or users
- Gradual improvement over time after a DNS change
- "Some users fixed, others still broken"
IC questions: "What is the TTL?" / "When was the change made?" / "Are caches cleared?"
Pattern: Inconsistent behaviour after a DNS change → think TTL propagation delay
TCP vs UDP Registered mail vs postcards
Core understanding: TCP and UDP are two transport protocols — reliable vs fast. Knowing which one your traffic uses changes how you diagnose failures.
TCP (Transmission Control Protocol): Reliable, ordered, connection-based · used by HTTP/S, MySQL · retries automatically · guaranteed delivery
UDP (User Datagram Protocol): Fast, no guarantees, connectionless · used by DNS, streaming, VoIP · sends and forgets — no retry built in
Problem in incident:
- TCP: congestion, connection limits, slow under load
- UDP: silent drops, hard-to-detect failures, no error trail
Effect (what you see): TCP issues → timeouts, slow apps · UDP issues → intermittent failures, missing responses
What it means: TCP problems = congestion or capacity · UDP problems = loss or instability
Analogy: TCP = registered mail (guaranteed delivery) · UDP = postcards (fast but may get lost)
Incident signals:
- TCP: high latency, connection timeouts
- UDP: missing responses, intermittent failures, no error logs
IC questions: "Is this TCP or UDP traffic?" / "Do we see retries or silent drops?" / "Is reliability or speed more critical?"
Pattern: Silent failures with no error logs → think UDP packet loss
TCP Handshake & Connection Lifecycle Knocking on a door that won't answer
Core understanding: Before any data flows, TCP must establish a connection via a 3-step handshake. If this fails, no requests can be processed at all.
The handshake: SYN → SYN-ACK → ACK
Problem in incident: Handshake fails or is delayed · SYN queue fills up · server cannot accept new connections
Effect (what you see): Connection timeouts · users can't connect · errors appear before any request is sent
Technical effect: Entry point is saturated — the problem is at the door, not inside the application
What it means: Often load-related or an attack — not an application bug
Analogy: Knocking on a door but no one answers — the house is overwhelmed before anyone can get inside
Incident signals:
- SYN backlog warnings
- High connection attempt counts
- Timeouts before any request data is exchanged
IC questions: "Are connections failing before requests?" / "Is the SYN queue full?" / "Is this a traffic spike or an attack?"
Pattern: Fails before any request is processed → think TCP handshake saturation
Retransmissions & Congestion Traffic jam where cars keep re-entering
Core understanding: When TCP packets are lost, they are automatically retransmitted. Under high load, this creates a congestion feedback loop — more retransmits = more traffic = worse congestion.
What it does: TCP guarantees delivery by resending lost packets — but each resend adds to overall traffic load.
Problem in incident: High retransmission rate · congestion builds · performance degrades progressively under sustained load
Effect (what you see): Slow responses · latency climbing · throughput dropping under load
Technical effect: More traffic → more loss → more retransmits → worse performance (self-reinforcing loop)
What it means: Network degradation spiral — not a full outage, but worsening performance under load
Analogy: Traffic jam where cars keep re-entering — clearing gets harder the more vehicles try to pass
Incident signals:
- Retransmission rate climbing
- Latency increasing over time
- Throughput dropping under load
IC questions: "Are retransmissions increasing?" / "Is packet loss present?" / "Where is the congested link?"
Pattern: Progressive slowdown under load + rising retries → think TCP congestion loop
Kafka Model Multi-lane highway
Core understanding: Kafka is a distributed message bus. Producers write to topics, which are split into partitions for parallelism. Consumer groups read partitions independently — each partition is owned by one consumer in the group at a time.
Key concepts:
- Producer — publishes messages to a topic
- Topic — a named stream, split into partitions for throughput
- Partition — ordered log; one consumer per group handles each partition
- Consumer Group — consumers sharing the work; each partition assigned to one member
- Offset — the consumer's position in the log; tracks how far behind it is
- Broker — server holding partitions; one broker per partition acts as leader
Analogy: Multi-lane highway — messages are cars, partitions are lanes, consumer groups are independent fleets. A blocked lane affects only the consumers using it.
IC relevance: Kafka sits between services. Problems here cause downstream processing to stop silently — no application errors until the queue backs up visibly. Always check lag metrics before assuming the consuming app is healthy.
Consumer Group Lag Falling behind on the highway
What it is: The gap between the latest message written to a partition and where the consumer has read to. Lag = unconsumed messages accumulating.
Signals:
- Lag metric rising continuously
- Consumers appear healthy but processing is slow
- Downstream services receive events late or in bursts
- Alerts on
consumer_group_lagorrecords_lag
Common causes: Slow consumer processing logic · insufficient consumer instances · a stuck or crashed consumer holding a partition · sudden producer spike
IC actions:
- Check lag metrics per consumer group and per partition — is it one partition or all?
- Identify stuck or slow consumers — is one consumer responsible?
- Scale out consumers (more instances = more partitions processed in parallel)
- Determine trend: lag growing, stable, or recovering?
Pattern: Lag growing + consumers healthy → slow processing logic or stuck consumer. Lag spike + producer spike → transient burst, may self-recover. Lag on one partition only → single consumer issue.
Broker & Partition Failure Lane closure
What it is: Each partition has a leader broker. If that broker fails, partition leadership must be re-elected before producers and consumers can resume on those partitions.
Signals:
- Producer errors:
LEADER_NOT_AVAILABLEorNOT_LEADER_FOR_PARTITION - Consumers stop receiving messages on affected partitions
- Alert on under-replicated partitions (should always be 0 in steady state)
- Broker removed from cluster health view
Common causes: Broker disk full · broker OOM or crash · network partition isolating a broker · replication factor too low (no replica to elect)
IC actions:
- Check broker health across all nodes in the cluster
- Check under-replicated partition count — non-zero means data risk
- Allow Kafka to auto-elect a new partition leader (usually seconds)
- Investigate root cause on the failed broker before bringing it back
Pattern: Partial message loss or processing gap → broker failure. Under-replicated partitions → replication issue or broker degraded. Full topic unavailability → majority of brokers for that partition lost.
RabbitMQ Model Postal sorting office
Core understanding: RabbitMQ is a message broker using a push model. Producers publish to an exchange, which routes messages to queues based on binding rules. Consumers pull from queues. Unlike Kafka, messages are deleted once acknowledged — no persistent log.
Key concepts:
- Producer — publishes messages to an exchange with a routing key
- Exchange — routes messages to queues based on type and binding key
- Queue — holds messages until a consumer processes and acknowledges them
- Consumer — connects to a queue, processes messages, sends ACK to remove them
- Dead-Letter Queue (DLQ) — receives messages that fail, expire, or are rejected
- Prefetch — how many unacknowledged messages a consumer can hold at once
Exchange types: Direct — exact key match · Fanout — broadcast to all bound queues · Topic — wildcard pattern match · Headers — match on message attributes
Analogy: Postal sorting office — producer drops a parcel (message) with an address label (routing key). The sorting machine (exchange) reads the label and drops it in the right bin (queue). The delivery driver (consumer) collects from the bin and signs for it (ACK). Failed deliveries go to the returns pile (DLQ).
IC relevance: Problems show as queue depth growing, DLQ filling, or consumer connections dropping. The exchange layer is invisible to most monitoring — routing misconfigurations silently send messages to the wrong queue.
Dead-Letter Queue Saturation Returns pile overflowing
What it is: A Dead-Letter Queue (DLQ) receives messages that cannot be processed — due to repeated failures, TTL expiry, or explicit rejection. When the root cause isn't fixed, the DLQ grows without bound.
Signals:
- DLQ depth metric climbing continuously
- Consumer error rate elevated — NACKs or exceptions in logs
- Upstream queue may appear healthy but messages are being lost silently to the DLQ
- Memory pressure on the broker if DLQ is unbounded and large
Common causes: Application bug in consumer processing logic · schema mismatch (consumer can't parse message format) · downstream dependency the consumer calls is unavailable · message TTL set too low
IC actions:
- Check DLQ depth and rate of growth — is it accelerating?
- Read a sample message from the DLQ and inspect its content
- Check consumer logs for the error being thrown on each failure
- Fix the root cause first — clearing the DLQ without fixing the cause just refills it
- Once fixed, replay DLQ messages in a controlled way (don't flood the queue)
Pattern: DLQ growing + consumer errors → processing bug or schema mismatch. DLQ growing + consumer healthy → TTL expiry or routing misconfiguration. DLQ suddenly growing + recent deploy → code change broke the consumer.
Consumer Connection Storm Revolving door jammed open
What it is: A large number of consumers repeatedly disconnect and reconnect in rapid succession, overwhelming the broker with connection state management. The broker spends more time handling connect/disconnect churn than delivering messages.
Signals:
- Broker connection count spiking and thrashing (rapid up-down pattern)
- High CPU on the broker despite low message throughput
- Consumer application logs showing repeated connection errors and retries
- Queue processing stalled even though consumers appear to be running
Common causes: Consumer crash loop (pod restarting repeatedly) · incorrect prefetch setting (consumer takes too many messages, times out, gets disconnected) · aggressive health-check misconfiguration forcing disconnections · network instability between consumer hosts and broker
IC actions:
- Check broker connection count over time — is there a churn pattern?
- Identify which consumer group or host is responsible for the churn
- Check for crash loops:
kubectl get podsrestart counts, or process monitor - Check prefetch setting — a value too high causes slow ack, triggering disconnect
- Isolate and restart affected consumer group; monitor stabilisation
Pattern: Connection churn + consumer crash loop → fix the crash cause (bad code, OOM, bad config). Connection churn + consumer healthy → prefetch misconfiguration or network instability. Broker CPU high with low message rate → connection management overhead, not processing load.
OSI Model 7-floor building
Core understanding: The OSI model gives you a shared language to pinpoint where a problem lives. Different layers are owned by different teams — knowing the layer tells you who to call.
Analogy: A 7-floor building. A fire on floor 3 is a different team's problem than a broken window on floor 7. You need to know which floor is burning before you radio anyone.
IC use: "Which layer is failing?" is the first isolation question. Failing before connection (L1–L4) is a network/infra problem. Failing after connection (L5–L7) is an app or security problem. Different layers mean different on-call groups.
Example — browser connects to company login page:
- L7: Browser sends HTTPS GET. WAF inspects the request. App processes it.
- L6: TLS encrypts/decrypts the payload between browser and server.
- L5: Session is established and maintained between client and server.
- L4: TCP connection on port 443. Firewall checks source IP and port.
- L3: IP routing selects the path to the destination IP across the internet.
- L2: Ethernet frames hop between switches. MACs used within each segment.
- L1: Electrical or optical signal travels down the cable or Wi-Fi.
Key distinction — Hub vs Switch: A Hub (L1) blindly repeats signals to all ports — it doesn't understand addresses. A Switch (L2) reads MAC addresses and forwards frames only to the correct port. If a switch fails, specific segments lose connectivity. If a hub fails, everything on that segment drops.
IC question: "Does the problem affect all hosts or just hosts in a specific segment?" — L1 vs L2 distinction. "Is routing broken?" — L3. "Is a port blocked?" — L4.
WAF vs Firewall Customs vs border fence
Core understanding: Both are security controls that block traffic — but they operate at entirely different layers, filter different things, and are owned by different teams. Knowing which one is blocking traffic determines who you call.
Key distinction: A Firewall says "I don't care what's in the parcel — I only care where it came from and which door it's heading to." A WAF opens the parcel and reads it — if it contains malicious content, it blocks the specific request, not the sender's entire address.
IC triage:
- Whole IP/CIDR unreachable? → Check firewall rules (network team)
- Specific HTTP requests returning 403, others fine? → Check WAF rules (security team)
- All traffic through a port suddenly blocked? → Firewall rule change (network team)
- New deploy causing request failures with no code error? → WAF may be matching a new payload pattern (security team)
- Legitimate user traffic blocked after load spike? → WAF rate-limiting rule triggered (security team)
Common IC mistake: Assuming a 403 error is an application permission problem. It may be a WAF block — the app never even received the request. Check WAF logs before escalating to the app team.
Pattern: All requests blocked to an IP range → firewall. Only specific URL paths or payload patterns blocked → WAF. Sudden 403 spike after a deployment → WAF rule matched something in the new request format.
Why WAF comes before the firewall in modern cloud
The OSI comparison might suggest firewall (L4) sits in front of WAF (L7) because lower layers precede higher ones. In practice the order is the opposite — and for good reason.
- WAF lives at the edge — it is typically part of the CDN or reverse proxy layer, closest to the internet. Application attacks (SQL injection, XSS, credential stuffing) are blocked there, before traffic ever enters the cloud network.
- Early blocking saves compute — stopping a malicious request at the edge means the load balancer, firewall, and app tier never see it. Fewer resources consumed, lower blast radius.
- Firewall/NSGs protect internal resources — once traffic passes the WAF and load balancer it enters a VCN (virtual cloud network). Firewalls and security groups here enforce zone-to-zone rules: which tier can talk to which, on which ports. They are not designed to inspect HTTP payloads.
- Cloud providers separate edge security from network security — WAF/CDN is one product (e.g. OCI WAF, AWS WAF, Azure Front Door), firewalls/NSGs are another (e.g. OCI Security Lists, AWS Security Groups, Azure NSG). Different teams own each, different change-management processes apply.
What actually happens in modern cloud (OCI / AWS / Azure style):
IC implication of this ordering: When a user reports they can't reach a service, the triage path follows this stack top-down. A block at the WAF produces a 403 and never reaches the load balancer. A firewall/NSG block causes a TCP timeout — no HTTP response at all. An app error produces a 5xx after a full connection is established. The failure signature tells you which layer to investigate first.
Why this matters for escalation: WAF is owned by a different team than NSGs, which is owned by a different team than the app. Calling the wrong team wastes critical incident minutes. Match the symptom to the layer, then call the right team once.
Physical Infrastructure Hardware Fundamentals
Every server, packet, and connection ultimately runs on physical hardware. When a networking problem can't be explained by software, config, or DNS, the answer may be at the physical layer — and physical failures are typically total, sudden, and clean-cut in monitoring.
Physical Server
A computer in a data centre. It has CPU, RAM, storage (disk/SSD), and one or more NICs. Physical problems — hardware failure, power loss, overheating — cause total server failure with no useful application-level error messages.
NIC — Network Interface Card
The hardware component connecting a server to the network. Operates at L1 (Physical) and L2 (Data Link) — handles electrical signals, MAC addresses, and frame transmission. A failed or misconfigured NIC means 100% packet loss for that server. NICs come in 1G, 10G, 25G, and 100G speeds; a speed mismatch with the switch port causes connectivity or performance problems.
Switch (Top-of-Rack / TOR)
Connects multiple servers in the same network segment. Operates at L2 — reads MAC addresses and forwards frames to the correct port. One TOR switch typically serves an entire rack. A switch failure takes down all servers in that rack simultaneously.
Fiber Optic Cable
Carries data as pulses of light. Used within data centres and between DCs. Much faster and longer-range than copper.
- Multi-mode: Shorter distances (within a DC, up to ~300m). Wider core, multiple light paths.
- Single-mode: Long distances (DC-to-DC, km scale). Narrower core, one light path. Used for backbone links.
A dirty fiber connector or bad end-face causes intermittent packet loss and CRC errors — frustrating to diagnose remotely because the link stays up but degrades unpredictably.
SFP — Small Form-factor Pluggable
A transceiver module plugged into a NIC or switch port to convert electrical signals to light for fiber connections. A failed SFP causes complete link loss on that port — from software, it looks exactly like the cable is unplugged.
IC Relevance — Scoping a Physical Fault
- One server unreachable: NIC, its patch cable, the SFP, or the switch port it connects to
- Whole rack unreachable: TOR switch failure or its uplink fiber
- Multiple racks / a zone: Aggregation switch or inter-DC uplink fiber
- Intermittent drops + CRC errors: Dirty fiber connector, failing SFP, or marginal cable — the link is up but unreliable
Key question for the DC team: "Has anyone done any cabling work, port moves, or hardware changes in that rack recently?"
Proxy vs Reverse Proxy Forward vs Reverse
A proxy is a server that sits between two parties in a network connection — either on behalf of the client (forward proxy) or on behalf of the server (reverse proxy). The direction determines what it protects and what it hides.
Forward Proxy — represents the client
A forward proxy sits in front of the client. Client traffic passes through it on the way out to the internet.
- What it hides: the client's identity from the destination server
- Use cases: corporate content filtering, outbound traffic control, caching for groups of users, anonymity
- IC scenario: all users in an office can't reach external sites → suspect forward proxy misconfiguration or outage. Check proxy logs. The app isn't the problem — the outbound path is.
- Examples: Squid, corporate web proxy, VPN exit node
Reverse Proxy — represents the server
A reverse proxy sits in front of the server. External traffic reaches the reverse proxy first, which then routes it to the right backend.
- What it hides: the backend server's identity and internal topology from the client
- Use cases: TLS termination, load balancing across app servers, rate limiting, caching static content, WAF integration
- IC scenarios:
- 502 Bad Gateway — reverse proxy can't reach the upstream app (app crashed or connection refused)
- 504 Gateway Timeout — upstream app is alive but not responding fast enough
- 499 — client gave up waiting before the reverse proxy responded
- Examples: Nginx (see Cloud Infra tab), HAProxy, Caddy, AWS ALB, Cloudflare
The one-line difference: A forward proxy knows who you are and fetches the internet for you. A reverse proxy knows the internet is calling and routes it to the right server for you.
Symptom → Diagnosis
Read the incident symptom and identify the most likely cause.
10 questions · shuffled each round · score tracked.
| A | Hostname → IPv4 address |
| AAAA | Hostname → IPv6 address |
| CNAME | Alias → another hostname (chain) |
| MX | Mail routing for domain |
| TXT | Verification, SPF, DKIM records |
| NS | Which nameserver is authoritative |
TTL (Time To Live) controls how long DNS answers are cached. After a change, old answers persist across the internet until every cache expires.
| Low TTL (60s) | Changes propagate fast |
| High TTL (3600s) | Changes take up to 1 hour to spread |
| "Works for me" | Your cache has new record; others still have old |
Server can't accept new connections. Cause: traffic spike or SYN flood attack. Connections fail before the app is even involved.
Lost packets trigger retransmit. Under load, retransmits add more traffic → more loss → feedback loop. Progressive slowdown that worsens without intervention.
Cars re-entering a traffic jam- Failing before or after connection established?
- SYN queue depth — is it filling?
- Retransmit rate increasing?
- Is packet loss present on the link?
- Traffic spike or sustained high load?
- Is this an attack (SYN flood)?
- Layer first — DNS (name resolution) or TCP (connection) or app?
- Who sees it? — all users or subset? Points to DNS propagation
- What changed? — DNS record, IP, certificate, firewall rule?
- Failing before or after handshake? — pre-handshake = network; post = app
- TCP or UDP? — determines whether retransmit or silent drop
Producers publish to an exchange with a routing key. The exchange routes to queues based on its type. Consumers pull from queues and ACK each message to remove it.
| Direct | Exact routing key match |
| Fanout | Broadcast to all bound queues |
| Topic | Wildcard pattern match on key |
Failed, rejected, or TTL-expired messages are routed to the DLQ. A growing DLQ means the consumer is failing to process messages — without fixing the root cause, clearing the DLQ just refills it.
| DLQ growing fast | Consumer bug or schema mismatch |
| Recent deploy + DLQ spike | Code change broke the consumer |
| DLQ growing, consumer OK | TTL too low or routing error |
Consumers rapidly disconnect and reconnect, overwhelming the broker with state management. Broker CPU spikes with low message throughput — it's handling churn, not messages.
Controls how many unACKed messages a consumer holds at once. Too high → slow ACK → broker disconnects the consumer. Too low → consumer starved, slow throughput.
- Queue depth growing? → consumer keeping up?
- DLQ filling? → consumer errors, check logs
- Broker CPU high, low throughput? → connection churn
- Messages missing? → routing / exchange config
- Recent deploy? → schema or code change
Each floor handles a different job. A fire on floor 3 (Network) doesn't mean the top floors (App) are broken — but they can't work if floors below are burning.
Hub (L1): Repeats signal to all ports — no address awareness. Everything on the segment goes down together.
Switch (L2): Reads MAC addresses, forwards only to correct port. One port failure isolates one host.
- Pre-connection failure → L1–L4 (network/infra)
- Post-connection failure → L5–L7 (app/security)
- All hosts in range → L3 routing or L4 firewall
- Specific requests 403'd → L7 WAF
- TLS errors → L6 cert issue
| Filters by | IP address · port · protocol |
| Blocks | IP ranges · CIDR rules · ports |
| Sits at | Inside VCN — zone-to-zone rules |
| IC signal | TCP timeout — no HTTP response at all |
| Owned by | Network team |
| Filters by | HTTP headers · URL · request body |
| Blocks | SQL injection · XSS · bad payloads |
| Sits at | Edge — CDN / reverse proxy (before LB) |
| IC signal | HTTP 403 — specific requests blocked |
| Owned by | Security / App team |
| Internet | Untrusted — all traffic starts here |
| ↓ WAF (CDN / edge) | Blocks app attacks early · HTTP 403 on match |
| ↓ Load Balancer | Distributes · TLS termination |
| ↓ Firewall / NSGs | Zone rules by IP/port · TCP drop on block |
| ↓ App Tier | App logic — only reached after all layers pass |
Assuming a 403 is an app permission error. If the app logs show nothing, the request never reached the app — WAF blocked it at the edge. Check WAF logs before escalating to the app team.
| HTTP 403, specific paths | WAF |
| TCP timeout, no response | Firewall / NSG |
| HTTP 5xx after connect | App tier |
| Connection refused | Port blocked / NSG |
- HTTP 403 → Security team (WAF)
- TCP timeout → Network team (NSG/FW)
- 5xx after connect → App team
- Nothing logged anywhere → start at edge (WAF)
- 1 server unreachable: NIC, patch cable, SFP, or switch port
- Whole rack down: TOR (top-of-rack) switch or its uplink
- Multiple racks / zone: Aggregation switch or inter-DC fiber
- Intermittent drops + CRC errors: Dirty SFP, bad fiber connector
- "Has anyone done cabling work or hardware changes in that rack?"
- "Is it exactly one rack, or partial?" (scope the switch)
- "Are there CRC errors on the NIC?" (physical layer signal)
- "Can you try re-seating the SFP?" (quick physical fix)
- Sits in front of the client — traffic goes Client → Proxy → Internet
- Hides the client's identity from the destination
- Used for outbound content filtering, corporate traffic control, anonymity
- IC signal: all users behind a network can't reach external sites → check forward proxy health and config
- Examples: Squid, corporate web proxy
- Sits in front of the server — traffic goes Internet → Proxy → App
- Hides backend topology; handles TLS, load balancing, rate limiting
- 502 = upstream app is down · 504 = upstream too slow · 499 = client gave up
- IC signal: Nginx 502/504 → the problem is behind Nginx, not Nginx itself
- Examples: Nginx (Cloud Infra tab), HAProxy, AWS ALB
IDCS Global Authentication Failure Highway entrance closed
Core understanding: IDCS is a centralised cloud identity provider. It acts as the first gate users must pass through before reaching any system. If it becomes unavailable, users cannot authenticate anywhere — even though the underlying apps may still be healthy.
What it is: A shared login authority used across multiple systems.
What it does: Authenticates users and issues access tokens.
Problem in incident: IDCS outage or service disruption.
Effect (what you see):
- All apps inaccessible after login attempt
- 401/403 spike across every service simultaneously
Technical effect: No tokens issued — authentication cannot begin.
IC interpretation: Central dependency failure — the authentication hub is down.
Analogy: Highway entrance closed — all routes blocked even though the roads beyond are clear.
Incident signals: Login failures across all apps at once · drop in successful auth metrics.
IC questions: "Are all apps affected?" / "Is IDCS reachable?" / "When did auth success rate drop?"
Pattern recognition: All apps fail login simultaneously → suspect IDCS.
Token Expiry / Validation Issues Expired train ticket during journey
Core understanding: After login, users don't continuously re-authenticate — they use tokens as proof of identity. These tokens have rules like expiration time and validation checks. If those rules are misconfigured or systems disagree on time, valid users can suddenly appear invalid.
What it does: Maintains authenticated sessions across systems.
Problem in incident: Expired or misvalidated tokens.
Effect (what you see):
- Random mid-session logouts
- Intermittent 401 errors for users already logged in
Technical effect: Token rejected by applications.
IC interpretation: Misconfiguration or time sync issue — not an outage.
Analogy: Expired train ticket during the journey — you bought it, you're on the train, but the gate says it's invalid.
Incident signals: Token validation errors in logs · session drops without user action.
IC questions: "Are tokens expiring earlier than expected?" / "Is system time consistent across services?"
Pattern recognition: Random auth failures for already-logged-in users → token issue.
Federation / SSO Misconfiguration Two border checkpoints refusing each other
Core understanding: Federation allows one identity system to trust another (e.g., corporate login into cloud apps). This relies on precise configuration and certificates. If that trust breaks, users get stuck in login flows or cannot authenticate at all.
What it does: Enables login via external identity providers.
Problem in incident: Broken trust configuration or certificate mismatch.
Effect (what you see):
- Redirect loops — browser bounces between app and login page
- Login fails after being redirected to SSO
Technical effect: Authentication handshake fails between identity providers.
IC interpretation: Integration misconfiguration — the two systems no longer agree on trust.
Analogy: Two border checkpoints refusing to accept each other's stamps.
Incident signals: Repeated redirect errors · SSO-specific error codes · only SSO users affected.
IC questions: "Are only SSO users affected (local accounts still work)?" / "Any cert or config changes recently?"
Pattern recognition: Redirect loop → SSO / federation issue.
LDAP Latency (IDM) Traffic jam at ID checkpoint
Core understanding: LDAP is the directory service that stores user identities in IDM environments. During login, systems query LDAP to verify users. If LDAP is slow, every authentication request slows down — even if nothing is technically broken.
What it does: Provides user data for authentication queries.
Problem in incident: Slow directory responses.
Effect (what you see):
- Login takes much longer than normal (15–20s instead of 1–2s)
- Occasional timeouts for some users
Technical effect: Queued or delayed auth requests — high LDAP response times.
IC interpretation: Performance bottleneck — slowness, not failure.
Analogy: Traffic jam at the ID checkpoint — everyone gets through eventually, but very slowly.
Incident signals: High auth latency · complaints about slow login, not login failure.
IC questions: "Is login slow or actually failing?" / "What are LDAP query response times?" / "Any load increase recently?"
Pattern recognition: Login eventually works but is very slow → LDAP latency.
User Provisioning / Sync Issues Different checkpoints, different passenger lists
Core understanding: Users and permissions are synchronised across systems. If this process fails, different systems may have different views of who a user is or what they can access — creating inconsistent, hard-to-diagnose failures.
What it does: Keeps user identities and roles consistent across all systems.
Problem in incident: Sync delays or failures.
Effect (what you see):
- Some users fail while others succeed
- Permissions missing or incorrect for affected users
Technical effect: Data inconsistency across systems.
IC interpretation: State mismatch — not an outage, but a divergence between systems.
Analogy: Different checkpoints using different passenger lists.
Incident signals: Only specific users or groups affected · new users, recently changed roles, or recently onboarded teams impacted.
IC questions: "Who exactly is affected?" / "Any recent provisioning changes or new user onboarding?"
Pattern recognition: Partial user failures (not everyone) → sync or provisioning issue.
MFA Failure Second checkpoint blocked
Core understanding: MFA adds a second verification step after password authentication. This step often depends on external systems (SMS providers, authenticator apps). If it fails, users are authenticated on password but cannot complete login.
What it does: Provides additional identity verification beyond password.
Problem in incident: MFA system or provider failure.
Effect (what you see):
- Users stuck after entering their password
- MFA prompts that never arrive or fail to validate
Technical effect: Second authentication step cannot complete.
IC interpretation: Partial authentication failure — first step worked, second step blocked.
Analogy: Getting through the first checkpoint but being blocked at the second.
Incident signals: MFA error messages in logs · push notifications or SMS not arriving.
IC questions: "Where exactly does login stop — before or after MFA prompt?" / "Is this an external MFA provider?"
Pattern recognition: Login stalls after password entry → MFA failure.
OAuth / OIDC Misconfiguration Wrong key for one door
Core understanding: Applications must be correctly configured to trust IDCS tokens. This includes client IDs, secrets, and redirect URLs. A small mismatch can break authentication for a single app while others work fine.
What it does: Connects individual applications to the identity provider.
Problem in incident: Incorrect client configuration in one app.
Effect (what you see):
- One specific app fails login
- All other apps still work fine
Technical effect: Token rejected by the misconfigured application.
IC interpretation: App-specific misconfiguration — scope is narrow, not a platform issue.
Analogy: Wrong key for one door — master key still works on all others.
Incident signals: Single app impacted · OAuth error codes (invalid_client, redirect_uri_mismatch).
IC questions: "Is this only one app or multiple?" / "Any config deployment to this app recently?"
Pattern recognition: One app broken while others work → OAuth / OIDC misconfiguration.
Certificate Expiry Expired passport
Core understanding: Certificates establish trust between systems in authentication flows. They have expiration dates. When they expire, systems stop trusting each other — causing sudden, complete failures with no degraded middle period.
What it does: Secures and validates identity communication between systems.
Problem in incident: Expired certificate.
Effect (what you see):
- Sudden, complete login failure — was working, now completely broken
- SSO stops working
Technical effect: Trust validation fails — systems refuse to communicate.
IC interpretation: Preventable config failure — a known expiry date was missed.
Analogy: Expired passport — valid until midnight on the expiry date, then refused everywhere instantly.
Incident signals: Certificate error messages in logs · sudden complete outage with no deployment.
IC questions: "Did any certificate expire recently?" / "Was there a cert change or renewal attempt?"
Pattern recognition: Sudden auth break with no deployment → check certificate expiry first.
Rate Limiting / Throttling Road closed due to too much traffic
Core understanding: Identity systems protect themselves by limiting how many requests they accept per time window. During traffic spikes, legitimate users can be blocked if limits are hit — even when the identity system itself is completely healthy.
What it does: Prevents overload or abuse by capping request rates.
Problem in incident: Too many requests trigger the limit.
Effect (what you see):
- Login failures during peak usage times
- 429 (Too Many Requests) responses
Technical effect: Requests rejected or delayed by the rate limiter.
IC interpretation: Capacity or protection issue — the limit may be correct or may need tuning.
Analogy: Road closed due to too much traffic — the road is fine, volume exceeded what's allowed.
Incident signals: Traffic spike correlates exactly with login failure onset · 429 errors in logs.
IC questions: "Is there a traffic spike right now?" / "Are 429 errors visible?" / "What are the configured rate limit thresholds?"
Pattern recognition: Peak usage + login failures + 429 errors → throttling.
Identity Dependency Failure Checkpoint staff can't access records
Core understanding: Identity systems rely on underlying services like databases, network, and storage. If those fail, identity services degrade or stop working — even if the identity system's own processes are healthy.
What it does: Depends on backend infrastructure to function.
Problem in incident: Database, network, or storage failure beneath IDCS.
Effect (what you see):
- Slow or failed login
- Auth errors combined with infrastructure alerts
Technical effect: Backend dependency unavailable — IDCS cannot complete auth lookups.
IC interpretation: Downstream dependency issue — the visible failure is auth, but the root cause is infrastructure.
Analogy: Checkpoint staff can't access the records database — they're present but unable to do their job.
Incident signals: Infra alerts fire alongside auth failures · auth latency spike coincides with DB / network alerts.
IC questions: "Are there DB or network alerts at the same time?" / "Is this auth-only or a wider infrastructure issue?"
Pattern recognition: Auth failures + infra alerts simultaneously → dependency failure.
Oracle RAC — Real Application Clusters Multiple highways, one shared tunnel
Core understanding: Oracle RAC is multiple servers running the same database at the same time, all connected to shared storage. It exists to improve availability and handle more load — but coordination between nodes introduces complexity and specific failure points.
What it does: Allows multiple servers to access the same database simultaneously, share workload across nodes, and continue operating if one server fails.
Problem in incident: Things go wrong when nodes stop syncing properly, one node becomes slow or fails, or the shared storage or interconnect network becomes a bottleneck.
Effect (what you see):
- Intermittent slowness — not a full outage
- Some requests fast, others very slow or timing out
- Random errors under load
- Latency spikes, especially during high traffic
Technical effect: Nodes are competing over shared data access. Delays in synchronisation between nodes. Traffic imbalance (some nodes overloaded). Possible node eviction from the cluster.
IC interpretation: Usually a contention problem (nodes competing), a coordination failure (cluster not in sync), or an infrastructure bottleneck (network or storage). Rarely a simple "server down" — more often partial degradation, not total failure.
Analogy: Multiple highways merging into one shared tunnel. Highways = servers, tunnel = shared database storage, traffic = queries. Too many cars → congestion. Poor coordination at the merge → traffic jams. One highway blocked → the others become overloaded.
Incident signals:
- "High DB latency" or "cluster node evicted"
- "Global cache wait" events in Oracle monitoring
- Connection timeouts under load
- Uneven CPU across nodes
- Spike in lock or enqueue waits
IC questions: "Is this affecting all users or intermittent?" / "Are all nodes healthy or is one degraded?" / "Is load evenly distributed?" / "Any recent scaling or config changes?" / "Is storage or the interconnect showing latency?"
Pattern recognition: Partial slowness (not full outage) + uneven CPU across nodes + intermittent timeouts → think RAC imbalance or coordination issue.
Symptom → Diagnosis
Read the incident symptom and identify the most likely cause.
20 questions · shuffled each round · score tracked.
IDCS is the first gate — all apps depend on it. If IDCS is down, all apps are unreachable even if they're perfectly healthy.
Users authenticate successfully but get kicked out mid-session. Token has expired or systems disagree on expiry rules. Not an outage — a misconfiguration or time sync issue.
User passes password check but can't complete the second factor. Often an external MFA provider issue — not the identity system itself. Partial auth failure.
SSO relies on exact config and certificate trust between identity systems. Small mismatch = login loops or redirect failures. Only SSO users affected.
One app has wrong client ID, secret, or redirect URL. That app's auth breaks while all others work fine. App-specific, not platform-wide.
Certs have hard expiry dates. When they expire, systems instantly stop trusting each other — no degraded period. Complete, sudden failure. Entirely preventable.
| LDAP slow | Every auth request slows — not broken, just sluggish. Eventually works. |
| Provisioning lag | New user exists in one system, not another. Inconsistent access per system. |
| Sync failure | Different systems have different user states — specific users/groups only. |
| Rate limit | 429 errors during traffic spikes. Identity system is healthy — it's protecting itself. |
| Dependency failure | Identity DB or network fails. Auth service processes are up but can't function. Root cause is infra, not identity. |
Multiple servers run the same DB simultaneously using shared storage. Adds availability but adds coordination complexity.
- Intermittent failure — one node degraded, not all
- Load imbalance — sessions not evenly spread across nodes
- Interconnect slowness — block transfers between nodes cause latency
- Affecting all users or intermittent?
- Are all RAC nodes healthy?
- Is load evenly distributed across nodes?
- Any recent scaling or config changes?
- Is storage or the interconnect showing latency?
- All apps or one? — all = IDCS; one = app OAuth config
- All users or subset? — all = platform; subset = provisioning/sync
- Where does login stop? — password/MFA/redirect = different layer
- What changed? — cert, config, deploy, rotation
- Slow or failing? — slow = LDAP; failing = IDCS/cert/config
Framing the Incident (Impact First) Side street vs motorway
Core understanding: Framing means quickly defining what is broken and how bad it is. Without it, teams focus on the wrong things or move too slowly.
What it does: Aligns everyone on what matters most and how urgent the situation is.
Problem in incident: Engineers jump into debugging without confirming impact. Low-priority issues get equal attention as critical ones. No urgency → slow decisions.
Effect (what you see): People asking different questions, no shared sense of severity, delayed mitigation.
What it means (IC interpretation): This is a priority alignment problem. The system isn't just failing — the response is unfocused.
Analogy: An accident happens but no one knows if it's on a side street or a major motorway. If it's the motorway (checkout), you need immediate response and all resources focused.
Incident signals: "Is this actually impacting users?" / "How bad is this?" / "Are we sure this is critical?" / Multiple threads of investigation.
IC questions: "What is the user impact right now?" / "Which functionality is affected?" / "Is this revenue-critical (checkout/login)?" / "How many users are impacted?" / "When did this start?"
Then state clearly: "Checkout is failing → high priority → focus on mitigation."
Ownership Assignment Uncontrolled junction
Core understanding: Every critical task needs a clearly named person or team responsible. Without this, work is assumed, duplicated, or not done at all.
What it does: Ensures work happens without delay and everyone knows who is doing what.
Problem in incident: Tasks are suggested but not assigned. People assume "someone else is doing it." Gaps or duplication in work.
Effect (what you see): "I thought that was already happening." Silence after actions are suggested. Same task done twice or not at all.
What it means (IC interpretation): This is a responsibility gap. The system is slow because no one owns execution.
Analogy: Traffic lights exist but no one is assigned to operate them. Cars hesitate, collide, or stop moving entirely.
Incident signals: "Who is doing that?" / "Is that being worked on?" / Long pauses after instructions.
IC questions: "Who owns the app right now?" / "Who is handling DB investigation?" / "Who is managing infra/network?"
Then assign clearly: "App team → initiate rollback now. DBA → investigate queries. Network → prepare to drain nodes."
Timeline Tracking Sequence before the crash
Core understanding: Timeline tracking means keeping a clear sequence of events during the incident. This helps connect cause and effect quickly.
What it does: Identifies what changed before the failure. Prevents confusion during the incident.
Problem in incident: Events get mixed up. Teams argue about what happened first. Root cause becomes harder to identify.
Effect (what you see): "Wait, did that happen before or after the deploy?" Repeated questions. Confusion about sequence.
Technical effect: Slower diagnosis. Missed correlations (e.g., deploy → failure).
What it means (IC interpretation): This is a visibility problem over time. You can't solve what you can't sequence.
Analogy: Trying to understand a crash without knowing which car entered the junction first or when the collision happened.
Incident signals: Confusion about timing / "When did that happen?" repeated / Misaligned understanding across teams.
IC questions: "When did alerts start?" / "When was the last deploy?" / "When did user impact begin?"
Then state: "09:05 deploy → 09:12 alerts → likely related."
Parallel Work (Avoid Serial Investigation) Multi-lane road
Core understanding: Parallel work means multiple teams investigate different areas at the same time. Serial work (one after another) slows everything down.
What it does: Speeds up diagnosis and mitigation simultaneously.
Problem in incident: Teams wait for each other. Only one path investigated at a time. Bottlenecks form.
Effect (what you see): "Let's wait for DB before doing anything." Idle teams. Slow progress.
What it means (IC interpretation): This is a throughput problem. Not enough work happening simultaneously.
Analogy: Only opening one lane when multiple lanes are available — traffic builds up unnecessarily.
Incident signals: Teams waiting / Sequential updates / Slow momentum.
IC questions: "What can each team investigate right now?" / "Are we blocked or just waiting?" / "Can we run these in parallel?"
Then assign: App → deploy/rollback. DBA → queries. Network → traffic. All simultaneously.
Decisive Action (Mitigation First) Clear the road before the inquest
Core understanding: Incident command requires making fast, reasonable decisions to reduce impact — even without full information.
What it does: Stops user impact quickly. Buys time for deeper investigation.
Problem in incident: Over-analysis. Fear of making the wrong decision. Delayed action.
Effect (what you see): Endless discussion. No clear plan. Metrics not improving.
What it means (IC interpretation): This is a decision paralysis problem. The system isn't recovering because no action is taken.
Analogy: Seeing a blocked road but debating the causes instead of clearing it first.
Incident signals: "We're still investigating…" with no action taken / No improvement in metrics / Repeated theories.
IC questions: "What is the fastest way to reduce impact?" / "Can we roll back?" / "What is the safest immediate mitigation?"
Then decide: "We are rolling back — execute now."
Structured Communication (Who / What / Priority) Clear junction signs
Core understanding: Communication must be clear, direct, and structured so actions happen immediately.
What it does: Removes ambiguity. Speeds up execution.
Problem in incident: Vague instructions. Long explanations. Misunderstandings.
Effect (what you see): "Sorry, what was I doing?" Delayed responses. Confusion.
What it means (IC interpretation): This is a clarity problem. Work slows because instructions are unclear.
Analogy: Giving unclear directions at a busy junction — cars hesitate or go the wrong way.
Incident signals: Repeated clarifications / Tasks misunderstood / Slow execution after instruction.
Structure: Every instruction = Who is doing this + What exactly + Priority (now / next).
Example: "App team → roll back all nodes → priority now." (not "let's look into rollback")
Symptom → Diagnosis
Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.
The first thing an IC does is define what is broken and how bad it is. Without framing, teams focus on the wrong things or move too slowly.
| User impact | What exactly can't users do right now? |
| Scope | All users or a subset? One service or many? |
| Severity | Is this revenue-critical (checkout / login)? |
| Start time | When did this begin? |
Good framing: "Checkout is broken for all users since 14:32 — zero orders completing."
Every critical task needs a clearly named person responsible. Without this, work is assumed, duplicated, or falls through the gaps.
A clear sequence of events connects cause and effect. Without it, the team wastes time re-discovering what happened.
| When did alerts start? | First signal |
| Last deploy? | Common cause — always check |
| When did user impact begin? | May differ from first alert |
| What changed just before? | Config, data migration, traffic spike |
Multiple teams investigate different areas simultaneously. Serial investigation (one after another) is the most common time-waster in incidents.
IC must make fast, reasonable decisions to reduce impact — even without full information. Over-analysis during an active incident costs users time.
| Can we roll back? | Usually fastest mitigation after a deploy |
| Can we redirect traffic? | Bypass broken component immediately |
| Can we disable a feature? | Reduce blast radius, keep rest working |
| Can we scale up? | Buy time if it's capacity-related |
Every IC instruction = Who + What + Priority. Ambiguous instructions don't get actioned immediately.
- Frame it — state impact, scope, and severity clearly to the room
- Assign owners — App / DB / Infra / Comms — named, not assumed
- Check the timeline — when did it start? What changed just before?
- Launch parallel investigation — don't wait for one team to finish
- First mitigation action — rollback? redirect? disable? Do it fast
- Communicate out — status to stakeholders, even if "investigating"
- Vague framing — "something's broken" → nobody knows urgency
- No named owner — "someone look into the DB" → nobody does
- Serial investigation — waiting for each team before the next starts
- Analysis paralysis — waiting for certainty before acting
- Unclear instructions — "maybe try rolling back?" → treated as optional
- No comms out — stakeholders escalate, creating noise
Docker, Kubernetes & Terraform — How They Fit Together The Full Picture
Docker packages an application and everything it needs into a container — so it runs the same everywhere.
Kubernetes runs and manages those containers at scale — scheduling, healing, and load-balancing them across machines.
Terraform builds the underlying infrastructure — servers, networks, and storage — using code.
Together they:
- Define — Terraform provisions the environment
- Run — Docker packages and isolates the app
- Manage — Kubernetes keeps it running at scale
The Port Analogy:
- Terraform → the company that builds the port (designs and provisions the docks, cranes, and warehouses)
- Kubernetes → the port authority running daily operations (decides which ship takes which container, reschedules when a ship is overloaded, and reroutes when one goes down)
- Docker → the standardised shipping container (sealed, identical, and portable — contents are the same no matter where it lands)
Inside a Docker container:
- Application code (e.g. Node.js, Python app)
- Runtime (Node, Python, Java, etc.)
- Dependencies (libraries, packages)
- Config needed to run
IC relevance: When an incident spans multiple layers, knowing which tool owns which layer helps you ask the right question first. Container crashing = Docker layer. Pod scheduling failing = Kubernetes layer. Servers missing = Terraform layer.
Docker Container packaging
What it does: Packages apps into containers. Ensures consistency across environments. Runs isolated processes on a host machine.
Problem in incident: Container crashes or restarts, resource limits hit (CPU/memory), misconfigured image or environment variables.
Symptoms:
- App randomly restarting
- Slow or failing requests
- "Service unavailable" errors
Technical effect:
- Container process dies or is killed by the OS
- Resource starvation — CPU throttled or memory limit hit
- Image or config mismatch between environments
What it means (IC interpretation): Usually resource exhaustion, a bad deploy or config issue, or the isolation hiding the root cause from standard monitoring.
Analogy: A standardised shipping container at a port. Every container is sealed with the app code, runtime, dependencies, and config inside — identical no matter which ship (host machine) carries it. If the contents are wrong, it fails the same way everywhere.
Incident signals: "Container restarted" · "OOMKilled" · High CPU / memory · CrashLoopBackOff
IC questions: Are containers restarting? Is resource usage high? Was there a recent deploy? Is this one container or all of them?
Kubernetes (K8s) Container orchestration
What it does: Runs containers at scale across multiple machines. Balances load, restarts failed workloads, and manages traffic routing between services.
Problem in incident: Pods not starting, traffic not reaching services, scaling or scheduling failures.
Symptoms:
- Intermittent outages — some requests succeed, others fail
- Services unreachable
- High latency across the cluster
Technical effect:
- Pods failing or stuck in Pending/CrashLoop state
- Networking or service routing issues
- Cluster imbalance — one node overloaded, others idle
What it means (IC interpretation): Usually a coordination failure, resource contention between pods, or a networking issue at the service mesh layer.
Analogy: The port authority running daily operations. Kubernetes decides which ship (node) takes which container (pod), manages the schedule, reroutes when a ship is overloaded, and replaces containers that fall into the sea (crash).
Incident signals: "Pod CrashLoopBackOff" · "Pending pods" · "Service unavailable" · Uneven latency
IC questions: Are pods running or pending? Is traffic reaching services? Any node overloaded? Any recent deploy?
Terraform Infrastructure as Code
What it does: Defines infrastructure using code (.tf files) and ensures the real system matches that definition. Creates and manages servers, networks, and storage automatically.
Problem in incident: Wrong infrastructure deployed, accidental deletion or change, drift between the expected and real state.
Symptoms:
- Sudden outages immediately after a deployment pipeline runs
- Missing resources — servers or services that should exist don't
- Wrong environment behaviour despite identical app code
Technical effect:
- Infrastructure changed or destroyed by a bad apply
- State mismatch — Terraform's state file diverges from reality
- Resources recreated with different config (different size, region, network)
What it means (IC interpretation): Usually a misconfiguration, a bad change rollout, or an automation error where Terraform enforced an incorrect "desired state".
Analogy: The company that builds the port itself — the docks, cranes, and warehouses. Terraform defines and provisions the physical infrastructure before any containers arrive. If the blueprint is wrong, the port doesn't exist or is misbuilt, and the port authority (Kubernetes) has nothing to work with.
Incident signals: "Resource deleted" · "Apply completed" · Sudden infra change · Missing instances
IC questions: Was Terraform run recently? What changed in the config? Was this intentional? Can we rollback or restore state?
Nginx Reverse proxy / Web server
What it is: Nginx is a high-performance web server and reverse proxy. In most production setups it sits in front of your application, handling incoming HTTP/HTTPS requests and forwarding them to the app server (e.g. Gunicorn).
Key roles:
- Reverse proxy — receives client requests and forwards them to the correct backend
- TLS termination — handles HTTPS so the app server only sees plain HTTP internally
- Static file serving — serves CSS, JS, images directly without touching the app
- Load balancing — distributes requests across multiple app instances
- Rate limiting / access control — rejects abusive clients before they reach the app
Analogy: The hotel front desk. Every guest walks in, the front desk decides where to route them — regular check-in, concierge, restaurant — without each department needing to handle its own door.
Common incident signals:
- 502 Bad Gateway — Nginx can't reach the upstream app server (app is down or restarting)
- 504 Gateway Timeout — app server is responding too slowly; Nginx gave up
- Connection refused — nothing is listening on the upstream socket/port
- High 499 rate — clients are closing connections before Nginx responds (slow backend)
IC questions: Is Nginx running? What do the Nginx error logs say? Is the upstream app server reachable on its port? Did a recent config change get reloaded?
Gunicorn Python WSGI app server
What it is: Gunicorn (Green Unicorn) is a Python WSGI HTTP server. It runs Python web applications (Django, Flask) by spawning multiple worker processes to handle concurrent requests. It typically sits behind Nginx in production.
What is WSGI? WSGI (Web Server Gateway Interface) is the standard protocol that defines how Python web frameworks communicate with a server. Think of it as the shape of the power socket: Flask and Django are appliances that plug into the WSGI socket; Gunicorn is the socket provider. Because they both speak WSGI, you can swap one framework for another without changing the server, or swap Gunicorn for uWSGI without changing your app. Without WSGI, every framework would need its own server.
Key concepts:
- Worker processes — each worker handles one request at a time; more workers = more concurrency
- Worker types — sync (default), async (gevent/eventlet), or thread-based — chosen based on workload
- Master process — manages workers, restarts crashed ones, handles signals (reload, shutdown)
- Binding — listens on a TCP port (e.g. 8000) or Unix socket; Nginx connects to this
- Timeout — workers that don't respond within the timeout (default 30s) are killed and restarted
Analogy: The kitchen behind the hotel front desk. Nginx (front desk) routes the request; Gunicorn (kitchen) processes it using multiple chefs (workers). If the kitchen is too slow or understaffed, orders back up and the front desk starts returning "sorry, we're busy" errors.
Common incident signals:
- [CRITICAL] WORKER TIMEOUT — a worker didn't finish its request in time; was killed and restarted
- 502 seen by clients — all workers are busy; Nginx gets no response
- High process memory — worker leak; workers grow until they're killed by OOM or max_requests
- Gunicorn not responding after deploy — new code failing to import; workers crash on start
IC questions: How many workers are configured vs request rate? Are workers timing out (slow DB call? external API?)? Is Gunicorn actually running? Did a recent code deploy cause worker crashes?
Node.js JavaScript runtime
What it is: Node.js is a JavaScript runtime built on Chrome's V8 engine. It runs server-side JavaScript using a single-threaded, non-blocking event loop — meaning it can handle many concurrent connections without spawning a thread per request. Commonly used for APIs, real-time apps, and microservices.
Key concepts:
- Event loop — a single loop processes callbacks; I/O operations are handed off asynchronously so the loop stays free for other work
- Non-blocking I/O — DB queries, file reads, and network calls don't block the loop; they return via callbacks, Promises, or async/await
- Single thread — CPU-intensive work blocks the event loop for everyone; offload to worker threads or a separate service
- npm — the package ecosystem; a missing or mismatched package version can cause startup failure
- Cluster mode / PM2 — spawns one process per CPU core to use multiple cores; PM2 also handles restarts and logging
Analogy: A single barista handling many orders at once — they pass each order to the coffee machine (async I/O) and move on. They can juggle 50 orders. But if one order requires them to stand and stir manually for 10 minutes (CPU block), every other customer waits.
Common incident signals:
- Event loop lag / high latency — CPU-intensive code blocking the loop; all requests slow down simultaneously
- Process exits with uncaught exception — unhandled Promise rejection or thrown error; app crashes until PM2/systemd restarts it
- Memory growth / OOM kill — listener leak or unbounded cache; process grows until killed
- EADDRINUSE on startup — port already in use; previous process didn't exit cleanly
IC questions: Is the event loop blocked (all requests slow at once)? Did a deploy introduce CPU-heavy code? Is the process actually running? Is memory growing per restart? Are there unhandled Promise rejections in logs?
Flask Python microframework
What it is: Flask is a lightweight Python web framework. It provides routing, request handling, and templating but has no built-in ORM, admin panel, or authentication — you add only what you need. Flask apps are WSGI applications, typically served by Gunicorn in production behind Nginx.
Key concepts:
- WSGI — Web Server Gateway Interface; the standard for Python web apps to communicate with a server like Gunicorn
- Routes — URL patterns mapped to Python functions using
@app.route('/path') - Application factory — a pattern where the Flask app is created inside a function, making config and testing cleaner
- Blueprints — modular groupings of routes; large Flask apps split into blueprints for each feature area
- Context — Flask uses a request context (per-request data) and app context (app-level data like DB connections)
Analogy: A pop-up food stall versus a full restaurant (Django). Flask gives you a table, a gas burner, and a knife — you bring the rest. Fast to set up, easy to keep simple, but you wire up every component yourself.
Common incident signals:
- 500 Internal Server Error — unhandled exception in a route; check Gunicorn/app logs for the traceback
- App fails to start after deploy — import error, missing env var, or broken dependency in requirements.txt
- Slow responses on specific routes — synchronous DB call, missing index, or external API call blocking a Gunicorn worker
- Working directory / config not found — Flask looks for files relative to the app root; a path mismatch breaks startup
IC questions: Is the app actually running (Gunicorn workers up)? Which route is failing — is it all routes or one? Did a deploy change requirements.txt or env vars? Is there a slow DB call on the failing route?
Django Python batteries-included framework
What it is: Django is a full-featured Python web framework. Unlike Flask, it includes an ORM, admin panel, authentication, form handling, and migrations out of the box. Also a WSGI app — served by Gunicorn behind Nginx in production. Its philosophy is "don't repeat yourself" — conventions reduce the amount of code needed.
Key concepts:
- ORM — Django's built-in Object-Relational Mapper translates Python model classes to SQL; powerful but can generate inefficient queries if used carelessly
- Migrations — schema changes are tracked as migration files; running
manage.py migrateapplies them to the database - Settings — all configuration lives in
settings.py;DEBUG, database credentials, allowed hosts, installed apps - Admin panel — auto-generated at
/admin; very useful for manual data inspection during incidents - WSGI entry point — Gunicorn points at
project.wsgi:application; if this import fails, no workers start
Analogy: A fully equipped commercial kitchen (vs Flask's pop-up stall). The oven, the walk-in fridge, the dishwasher — all included. More opinionated about layout, but you get to cooking faster. The trade-off: more moving parts that can break.
Common incident signals:
- App fails to start after deploy — unapplied migrations, missing settings, or a broken import in models/apps
- Slow queries / high DB CPU — N+1 query problem (one query per object in a loop); use
select_related/prefetch_related - DEBUG=True in production — shows full stack traces to users; also disables template and query caching — major performance and security issue
- 500 on a specific URL — unhandled exception in a view; check Gunicorn logs for the traceback
- Migration conflicts after merge — two branches added migrations to the same app; need to squash or re-number
IC questions: Were migrations applied after the deploy? Is DEBUG True in production? Which view is causing 500s? Are there N+1 query patterns in the slow endpoint? Is the WSGI entry point importable?
OCI Physical Hierarchy OCI Infrastructure
Oracle Cloud Infrastructure organises resources in a three-level hierarchy: Region → Availability Domain → Fault Domain. Understanding which level a failure is at determines the blast radius and recovery options.
Region
A geographic area (e.g. uk-london-1, us-ashburn-1). Completely isolated from other regions — an outage in one region does not affect others. OCI has 40+ regions globally.
IC relevance: If users in only one country are affected, ask: "Which region do they connect to?" Regional failures are rare and escalated immediately to Oracle.
Availability Domain (AD)
Within a region there are 1–3 ADs. Each AD is a physically separate data centre with its own power, cooling, and networking. Failure in one AD does not cascade to others in the same region.
IC relevance: If some users are affected and others are not within the same region, ask: "Are the affected services deployed in only one AD? Is there cross-AD load balancing?"
Fault Domain (FD)
Each AD contains 3 FDs. A FD groups physical hardware — servers and top-of-rack switches — sharing a power circuit. A hardware failure (power circuit, rack switch) affects only the instances in that FD.
IC relevance: If some VMs within an AD are down but others are fine, ask: "Are all the affected instances in the same FD?" Spreading instances across all 3 FDs gives hardware-level redundancy inside an AD.
The Analogy
Region = the city. AD = a separate building in the city, with its own power supply and entrance — a fire in building A doesn't affect building B. FD = a floor within that building — a tripped circuit on floor 3 doesn't affect floors 1 and 2.
IC First Questions
- "Which region are the affected resources in?" — rules in/out a regional event
- "Are affected services in the same AD, or spread across ADs?" — narrows to AD-level failure
- "Which FD are the affected instances in?" — points to hardware-level fault
- "Are any other resources in the same FD also affected?" — confirms blast radius
Java Garbage Collection Java GC
Java automatically reclaims heap memory that is no longer in use — this is garbage collection. The IC-relevant symptom is the stop-the-world (STW) pause: a brief period where the JVM halts every application thread to run GC. Under load, these pauses appear as periodic latency spikes (typically 200ms–2s) with no CPU, disk, or network cause visible in infrastructure monitoring. The JVM resumes normally after each pause. If heap is consistently near-full, GC runs more frequently and pauses grow longer, eventually causing a java.lang.OutOfMemoryError. Modern collectors (G1GC, ZGC) reduce pause duration, but insufficient heap or a memory leak will overwhelm any collector.
Container Runtimes Beyond Docker
What is a container runtime? The low-level software that actually runs containers — it creates the isolated process, sets up namespaces and cgroups, and manages the container lifecycle. Docker is the most recognised but not the only option.
Why it matters as IC: Knowing which runtime is in use helps you read logs correctly and point to the right team. "docker ps" doesn't work if the environment uses containerd or CRI-O directly.
- Podman — Near drop-in replacement for Docker. Daemonless (no background service required), supports rootless containers (runs without root), same CLI syntax. Used where Docker daemon is a security concern. Key difference: no daemon means no single point of failure; each container is a direct child process of the user.
- containerd — Lightweight runtime originally extracted from Docker — Docker uses containerd under the hood. Kubernetes switched from dockershim to containerd directly in K8s v1.24. Minimal API, no CLI for end users. IC signal: in K8s environments post-1.24, container state is in containerd not Docker.
- CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) so K8s can talk to it directly. Even more minimal than containerd. Common in OpenShift environments. IC signal: if the cluster uses OpenShift, the runtime is almost certainly CRI-O.
- LXC / LXD — More like lightweight virtual machines than pure application containers. Each LXC container runs a full Linux userspace with init, systemd, and multiple processes — not just one application. Used for OS-level isolation rather than microservice packaging. Key difference: LXC feels like a VM; Docker feels like a process.
- rkt (CoreOS Rocket) — Security-focused runtime. Now deprecated — CoreOS was acquired by Red Hat and rkt development stopped in 2019. Mentioned here for historical context; you may see it in older documentation.
- Kubernetes + pluggable runtimes — K8s itself is not a container runtime; it is an orchestrator. It manages containers via the Container Runtime Interface (CRI), which lets you swap the underlying runtime (containerd, CRI-O, etc.) without changing how K8s works.
Quick decision rule for ICs:
- Bare VM running a single app → likely Docker or Podman
- Kubernetes cluster → containerd or CRI-O (not Docker since K8s v1.24)
- OpenShift cluster → CRI-O
- OS-level multi-process isolation → LXC/LXD
Symptom → Diagnosis
Read the incident symptom and identify the most likely cause.
N questions · shuffled each round · score tracked.
A container is just a process with its own mini-filesystem and dependencies — an isolated app + everything it needs to run, packaged together.
Core value: portability + consistency. The app behaves the same on any machine, any environment.
| The food | = your app |
| The ingredients | = dependencies |
| The box | = isolation |
You can take it anywhere, and it's the same meal every time.
Nginx is a reverse proxy and web server that sits in front of your app — it receives every incoming HTTP/HTTPS request and decides where to send it.
The hotel front desk. Every guest walks in; the desk decides who handles them — restaurant, concierge, housekeeping. No department needs its own front door.
Gunicorn is a Python WSGI app server — it takes requests from Nginx and runs your Flask or Django app using a pool of worker processes (one request per worker at a time).
WSGI (Web Server Gateway Interface) is the standard protocol that lets Python web frameworks (Flask, Django) communicate with a server like Gunicorn. Think of it as the power socket shape — the framework plugs in, the server provides the socket, and they speak a common language regardless of which framework is used.
The kitchen behind the hotel front desk. Nginx routes the order; Gunicorn processes it using N chefs (workers). If the kitchen is full or a chef takes too long — new orders back up and the front desk starts returning errors.
Node.js is a JavaScript runtime that handles many concurrent connections using a single-threaded event loop — async I/O keeps it free for other requests, but CPU-heavy code blocks every user at once.
A single barista juggling many orders — they hand each order to the machine (async I/O) and move on. But if they have to stand and manually grind beans for 10 minutes (CPU work), every other customer waits.
Flask is a lightweight Python WSGI microframework — it gives you URL routing and request handling only. No ORM, no admin panel, no auth built in. You add exactly what you need.
A pop-up food stall. You get a table, a gas burner, and a knife — bring the rest yourself. Fast to set up, easy to keep simple, but you wire every component.
Django is a batteries-included Python WSGI framework — ORM, admin panel, auth, and migrations come built in. More moving parts than Flask but faster to build standard features.
A commercial kitchen fully equipped — everything is there when you arrive. Faster to cook a full meal, but more equipment means more things that can break.
Geographic area (e.g. uk-london-1). Fully isolated from other regions. A regional failure affects all ADs and FDs within it.
Ask: "Is this one geography or global?"
Separate data centre within a region (1–3 per region). Own power and cooling. AD failure does not affect other ADs.
Ask: "Are affected services in the same AD?"
Hardware grouping within an AD (3 per AD). Shared power circuit + top-of-rack switch. Failure affects only instances in that FD.
Ask: "Are all downed VMs in the same FD?"
Region = city · AD = separate building in the city · FD = floor within the building.
A tripped circuit on one floor doesn't affect other floors or other buildings.
Scope first: Region → AD → FD. The level of the failure determines who you call and what options you have for recovery.
The JVM briefly halts all threads to reclaim heap memory. Symptom: periodic latency spikes (200ms–2s), no CPU/disk/network cause, clean recovery after each spike.
IC signal: intermittent spikes with no infrastructure alert → ask if it's a Java service → suspect GC.
- Docker — general-purpose app containers, best developer tooling
- Podman — drop-in Docker replacement, daemonless, rootless mode — preferred where security posture matters
- containerd — lightweight runtime used by Docker and by Kubernetes since v1.24 (replaced dockershim)
- CRI-O — Kubernetes-native only, OpenShift default, minimal footprint
- LXC / LXD — OS-level isolation, more like a lightweight VM than an app container
- rkt — deprecated (CoreOS acquired by Red Hat, 2019)
- Bare VM / single app → Docker or Podman — use
docker ps - Kubernetes cluster (v1.24+) → containerd — use
crictl ps - OpenShift cluster → CRI-O — use
crictl ps - Multi-process OS isolation → LXC/LXD — use
lxc list
Key difference from Docker: Podman has no daemon — each container is a direct child process of the user, so there is no central point of failure.
OCI Architecture Puzzle
A visual quiz where you identify and place components in a Flask container deployment on Oracle Cloud Infrastructure.
6 questions · click the glowing node · instant feedback · score tracked
Term → Definition
Select the correct one-sentence definition for each term.
50 terms · shuffled each round · score tracked.