
















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam validates understanding of observability concepts, telemetry types (logs, metrics, traces), instrumentation, monitoring frameworks, distributed systems visibility, dashboards, alerts, and root-cause analysis. Intended for DevOps, SRE, and cloud engineers working on improving system insight and stability.
Typology: Exams
1 / 88
This page cannot be seen from the preview
Don't miss anything!

















































































Question 1. Which of the following best describes the primary difference between monitoring and observability? A) Monitoring only collects logs, while observability collects metrics. B) Monitoring tells you what is happening; observability tells you why it is happening. C) Monitoring is a subset of observability focused on alerting. D) Observability requires manual instrumentation, monitoring does not. Answer: B Explanation: Monitoring provides visibility into system health (e.g., thresholds), whereas observability enables answering root‑cause questions by correlating metrics, logs, and traces. Question 2. In the three pillars of observability, which pillar is most suited for answering “when did the error occur?” A) Metrics B) Logs C) Traces D) Dashboards Answer: B Explanation: Logs contain timestamped events, making them ideal for pinpointing the exact moment an error happened. Question 3. Which of the following is NOT one of Google’s Four Golden Signals? A) Latency B) Traffic C) Errors D) Throughput Answer: D
Explanation: The four signals are Latency, Traffic, Errors, and Saturation. Throughput is a metric but not a golden signal. Question 4. A counter metric is best used to track which of the following? A) Current memory usage B) Number of HTTP requests served since process start C) Distribution of request latency D) Temperature of a CPU core Answer: B Explanation: Counters only increase; they are ideal for cumulative counts such as total requests. Question 5. Which time‑series database characteristic is essential for high‑resolution metric storage? A) Column‑family storage model B) Immutable append‑only logs with compression C) Relational ACID transactions D) Document‑oriented indexing Answer: B Explanation: TSDBs store immutable points, often using compression to handle high‑frequency data efficiently. Question 6. In the OODA loop applied to observability, the “Orient” step primarily involves: A) Generating alerts from thresholds. B) Correlating data from metrics, logs, and traces to understand context. C) Deploying a new version of the service. D) Restarting a failing container. Answer: B
D) Cause‑based alerts Answer: D Explanation: Cause‑based alerts fire on root‑cause conditions, preventing multiple noisy symptom alerts. Question 10. In structured logging, which field is most critical for linking a log entry to a trace? A) log_level B) timestamp C) trace_id D) hostname Answer: C Explanation: The trace_id enables correlation between logs and distributed traces. Question 11. Which logging level should be used for messages that indicate an unexpected condition that does not stop program execution but may need attention? A) DEBUG B) INFO C) WARN D) FATAL Answer: C Explanation: WARN indicates a potentially problematic situation that isn’t fatal. Question 12. Which component of the ELK stack is responsible for indexing and searching log data? A) Elasticsearch B) Logstash
C) Kibana D) Beats Answer: A Explanation: Elasticsearch stores and indexes logs, enabling fast search queries. Question 13. When collecting logs from a Kubernetes cluster, which sidecar pattern is commonly used? A) Pushgateway sidecar B) Log exporter sidecar (e.g., fluentd) C) Prometheus scrape sidecar D) Service mesh sidecar Answer: B Explanation: A logging sidecar (fluentd, vector) runs alongside the application container to ship logs. Question 14. Which technique helps to reduce the storage cost of high‑volume log data while preserving diagnostic value? A) Storing logs in plain text files. B) Log sampling with a deterministic hash. C) Disabling log rotation. D) Increasing log verbosity. Answer: B Explanation: Sampling selects a representative subset, lowering volume while keeping useful signals. Question 15. In distributed tracing, a span represents: A) The entire end‑to‑end request.
B) A network latency issue. C) A performance bottleneck. D. A missing instrumentation point. Answer: C Explanation: Repeated long spans reveal operations that consume disproportionate time, suggesting a bottleneck. Question 19. Which of the following best describes an exemplar in observability? A) A sample log line used for debugging. B) A metric data point that includes a reference to a trace ID. C) A synthetic transaction used for testing. D) A predefined alert rule. Answer: B Explanation: Exemplars attach trace identifiers to metric samples, enabling jump‑to‑trace from a metric spike. Question 20. A well‑designed operational dashboard should: A) Show every raw metric collected by the system. B) Emphasize high‑level health indicators and actionable insights. C) Use bright colors for all panels. D) Refresh only once per hour. Answer: B Explanation: Dashboards should surface concise, actionable health signals, not exhaustive raw data. Question 21. Service Level Indicators (SLIs) are: A) The contractual penalties for missed targets.
B) The raw measurements used to assess a Service Level Objective. C) The business revenue targets. D) The same as Service Level Agreements. Answer: B Explanation: SLIs are quantitative metrics (e.g., request latency 99th percentile) that feed into SLOs. Question 22. An error budget is defined as: A) The total number of bugs allowed per release. B) The fraction of SLO time that can be spent in error without violating the agreement. C) The budget allocated for purchasing observability tools. D) The amount of log storage purchased per month. Answer: B Explanation: Error budget = 1 – SLO target; it quantifies permissible downtime. Question 23. In a Kubernetes environment, which cgroup metric is most useful for monitoring container CPU pressure? A) memory.usage_in_bytes B) cpu.throttled_time_ns C) blkio.io_service_bytes_total D) network.packets_received_total Answer: B Explanation: cpu.throttled_time_ns indicates how long the container’s CPU was throttled, reflecting pressure. Question 24. Which challenge is unique to observability in serverless platforms compared to traditional VMs?
A) Grafana B) Datadog C) Prometheus D) OpenTelemetry Collector Answer: B Explanation: Datadog provides integrated observability across the three pillars. Question 28. During an incident, the Detection phase most often relies on: A) Post‑mortem documentation. B) Automated alerts generated from observability data. C) Manual code reviews. D) Service deployment pipelines. Answer: B Explanation: Detection is the moment an anomaly is surfaced, typically via alerts. Question 29. Which of the following is a key benefit of correlating logs with trace IDs? A) Reducing the storage size of logs. B) Enabling end‑to‑end request reconstruction across services. C) Eliminating the need for metrics. D) Automating code deployment. Answer: B Explanation: Trace IDs tie log entries to a specific request, facilitating cross‑service debugging. Question 30. A dynamic threshold for alerting is typically based on: A) A fixed numeric value defined at design time. B) Historical baseline data that adapts over time.
C) The number of developers on call. D) The size of the log file. Answer: B Explanation: Dynamic thresholds adjust according to recent data trends, reducing false positives. Question 31. Which metric type would you use to represent “current number of active database connections”? A) Counter B) Gauge C) Histogram D) Summary Answer: B Explanation: Gauges represent a value that can go up or down, suitable for current connection counts. Question 32. In the context of SRE, the term availability is best expressed as: A) The percentage of time a service responds to requests within its SLO latency target. B) The number of servers in a cluster. C) The total amount of storage allocated. D) The average CPU utilization. Answer: A Explanation: Availability measures the proportion of successful, timely responses. Question 33. Which of the following is a common cause of high cardinality in metric labels, leading to performance issues? A) Adding a label for static environment (prod, dev).
C) Store trace data in a relational database. D) Generate alerts automatically. Answer: B Explanation: Parent‑child links define the execution tree of a request across services. Question 37. Which of the following is an advantage of using automatic instrumentation libraries (e.g., OpenTelemetry auto‑instrumentation) over manual instrumentation? A) Guarantees zero performance overhead. B) Provides coverage for many popular libraries without code changes. C) Allows custom business‑logic metrics only. D) Eliminates the need for a collector. Answer: B Explanation: Auto‑instrumentation injects trace/metric hooks into existing libraries without modifying application code. Question 38. When designing an executive‑level dashboard, which metric is most appropriate to include? A) CPU usage per pod. B) Number of 5xx errors per minute. C) Overall Service Level Objective (SLO) compliance percentage. D) Detailed request trace IDs. Answer: C Explanation: Executives need high‑level business‑impact metrics like SLO compliance. Question 39. What is the primary purpose of a service map derived from tracing data? A) To display real‑time CPU utilization.
B) To visualize dependencies and communication paths between services. C) To store logs centrally. D) To enforce network security policies. Answer: B Explanation: Service maps illustrate how services interact, helping identify coupling and failure propagation. Question 40. Which of the following is a recommended practice for log retention in compliance‑heavy environments? A) Keep logs indefinitely on local disks. B) Rotate logs daily and store compressed archives for the required retention period. C) Delete logs after one week to save space. D) Store logs only in memory. Answer: B Explanation: Rotating and compressing logs balances storage costs with compliance retention requirements. Question 41. In the context of error budgets, a team deciding to launch a new feature when the error budget is 80 % remaining is: A) Violating the SLO. B) Operating within acceptable risk. C) Ignoring the SLO entirely. D) Exceeding the budget. Answer: B Explanation: With 80 % of the budget left, the team still has ample margin before breaching the SLO.
Question 45. Which of the following is a common indicator of a memory leak observable via metrics? A) Sudden drop in request latency. B) Steady increase in resident set size (RSS) over time without corresponding load increase. C) Decrease in CPU utilization. D) Increase in network packet loss. Answer: B Explanation: A continuously growing RSS without load changes suggests memory not being released. Question 46. In a microservice architecture, which observability pillar is most effective for identifying the exact code path that caused an error? A) Metrics B) Logs C) Traces D) Dashboards Answer: C Explanation: Traces map the request flow across services and code paths, pinpointing where the error originated. Question 47. Which of the following best describes alert fatigue? A. A condition where alerts are never triggered. B. A situation where operators become desensitized due to excessive false or noisy alerts. C. The process of automatically silencing alerts during maintenance windows. D. The practice of sending alerts via multiple channels. Answer: B
Explanation: Alert fatigue occurs when too many alerts cause operators to ignore or miss critical ones. Question 48. Which of these is a primary benefit of centralized logging compared to local log files? A. Reduced network traffic. B. Ability to query across multiple hosts and services from a single interface. C. Elimination of log rotation. D. Guarantees zero log loss. Answer: B Explanation: Centralization enables cross‑host correlation and unified search. Question 49. Which of the following is the most appropriate SLO for a user‑facing API that promises “99.9 % of requests respond within 200 ms”? A. Latency ≤ 200 ms for 99.9 % of requests. B. Error rate ≤ 0.1 % overall. C. Throughput ≥ 10 k req/s. D. CPU utilization ≤ 70 %. Answer: A Explanation: The SLO directly specifies the latency target and the percentile. Question 50. Which of the following statements about ephemeral containers is correct? A. They retain state across restarts. B. Their lifecycle is tied to the pod, making continuous metric collection challenging. C. They cannot be instrumented for tracing. D. They always run on dedicated hardware.
C. A metric that measures CPU usage. D. A tool for visualizing logs. Answer: B Explanation: SLAs are external agreements that reference SLOs and define penalties for breaches. Question 54. What is the primary purpose of the Alertmanager in a Prometheus ecosystem? A. Store time‑series data. B. Visualize metrics on dashboards. C. Route, deduplicate, and silence alerts generated by Prometheus rules. D. Collect logs from containers. Answer: C Explanation: Alertmanager handles alert aggregation, routing, and silencing. Question 55. Which of the following is a common indicator of network saturation observable via metrics? A. Decrease in request latency. B. Increase in TCP retransmission rate. C. Drop in CPU usage. D. Higher disk I/O throughput. Answer: B Explanation: Retransmissions rise when the network is congested, indicating saturation. Question 56. When correlating logs and metrics, which identifier is typically used to join the two data sources? A. hostname
B. process_id C. trace_id or span_id D. metric_name Answer: C Explanation: Trace or span IDs appear in both logs and metrics, enabling correlation. Question 57. Which of these is an example of a cause‑based alert for a database service? A. Alert when query latency exceeds 500 ms. B. Alert when the number of open connections reaches the max pool size. C. Alert when the error rate of SELECT statements exceeds 2 %. D. Alert when the CPU usage exceeds 80 %. Answer: B Explanation: Hitting the pool limit directly indicates the root cause of potential request failures. Question 58. In the context of observability for containers, which cgroup metric best reflects memory pressure? A. memory.usage_in_bytes B. memory.limit_in_bytes C. memory.stat.active_anon D. memory.pressure_level (if available) Answer: D Explanation: memory.pressure_level (or similar) indicates the intensity of memory reclamation, directly reflecting pressure. Question 59. Which of the following is a key characteristic of a good trace sampling policy?