DevOps Roadmap — Part 9: Monitoring & Observability

By Suraj Ahir November 16, 2025 6 min read

Security in DevOps
Security in DevOps
← Part 8 DevOps Roadmap · Part 9 of 12 Part 10 →

You cannot manage what you cannot measure. A production system without monitoring is flying blind — you only find out something is wrong when users start complaining. By then, the problem has often been affecting performance for hours. Monitoring and observability give you eyes inside your systems so you know about problems before users do, understand why they happened, and prevent them from recurring.

Monitoring vs Observability

These terms are related but distinct. Monitoring means watching predefined metrics and alerting when they cross thresholds — CPU above 90%, response time above 500ms, error rate above 1%. You know what to watch because you anticipated it. Observability means your system produces enough data (metrics, logs, traces) that you can answer any question about its internal state, including questions you did not anticipate asking. Highly observable systems make debugging much faster because the answers are already in the data.

The Three Pillars

Prometheus — Metrics Collection

Prometheus is the industry-standard metrics system in the cloud-native world. It scrapes (pulls) metrics from your applications and infrastructure at regular intervals and stores them in a time-series database. You query metrics using PromQL (Prometheus Query Language).

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['server1:9100', 'server2:9100']

  - job_name: 'my-app'
    static_configs:
      - targets: ['app:5000']
    metrics_path: '/metrics'

Node Exporter — Server Metrics

Install Node Exporter
# Run Node Exporter (exposes server metrics on port 9100)
docker run -d   --name node-exporter   --net="host"   --pid="host"   -v "/:/host:ro,rslave"   prom/node-exporter   --path.rootfs=/host

# Key metrics available:
# node_cpu_seconds_total
# node_memory_MemAvailable_bytes
# node_filesystem_avail_bytes
# node_network_receive_bytes_total

PromQL — Querying Metrics

Useful PromQL Queries
# CPU usage percentage
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory used
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# HTTP request rate per second
rate(http_requests_total[5m])

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Grafana — Visualization

Prometheus stores data. Grafana visualizes it. Grafana connects to Prometheus as a data source and lets you build beautiful dashboards. There are thousands of community-built dashboards available — for Node Exporter, databases, Kubernetes, and more — which you can import with a single click using their dashboard IDs.

Run Grafana
docker run -d   -p 3000:3000   --name grafana   -e GF_SECURITY_ADMIN_PASSWORD=admin   grafana/grafana

# Access at http://localhost:3000
# Add Prometheus as data source: http://prometheus:9090
# Import dashboard ID 1860 for Node Exporter

Alerting

alert_rules.yml
groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU is {{ $value }}% for 5 minutes"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

In Part 10, we will cover Kubernetes fundamentals — the container orchestration platform that runs the majority of the world's containerized production workloads.

The Three Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are metrics, logs, and traces. Metrics are numeric measurements over time — CPU usage, request count, error rate, latency percentiles. They are efficient to store and query, and ideal for dashboards and alerting. Logs are timestamped records of discrete events — an HTTP request completed, an error occurred, a user logged in. They provide rich context for debugging specific incidents. Distributed traces track a request as it flows through multiple services, showing exactly where time is spent and where failures occur. Each pillar provides different visibility; comprehensive observability requires all three.

The RED Method for Service Monitoring

The RED method (popularized by Weaveworks) provides a simple framework for monitoring any service: Rate (requests per second — how much traffic is the service handling?), Errors (error rate — what percentage of requests are failing?), and Duration (latency distribution — how long are requests taking, at P50, P95, P99?). These three metrics, monitored together, give you immediate visibility into service health. An alert on error rate above 1% and P99 latency above 500ms covers most production service health requirements. Pair the RED method with the USE method for resource monitoring: Utilization, Saturation, and Errors for CPU, memory, disk, and network.

Practice Exercise

Set up a basic monitoring stack using Docker Compose: run Prometheus (metrics collection), Grafana (visualization), and a simple application instrumented with Prometheus metrics. Configure Prometheus to scrape metrics from your application. In Grafana, create a dashboard showing the RED metrics for your service. Configure an alert rule that fires when the simulated error rate exceeds a threshold. This hands-on setup builds the practical skills needed for production monitoring work.

The Continuous Improvement Mindset

DevOps is not a destination but a continuous journey of improvement. The practices covered here — automation, monitoring, infrastructure as code, CI/CD pipelines — are tools in service of a deeper goal: enabling teams to deliver software changes to production quickly, safely, and reliably. The measurement that matters is not which tools you use but how long it takes to go from a committed code change to running in production, and how confident you are in that process. The best DevOps teams measure their deployment frequency, lead time for changes, change failure rate, and mean time to recovery (the DORA metrics), and treat these as engineering objectives to improve over time.

Consistent application of the principles covered here, combined with ongoing learning and hands-on practice, is what separates those who understand technology conceptually from those who can build and operate real systems. The investment in depth pays dividends for years. Keep learning, keep building, and keep asking the questions that drive deeper understanding.

Disclaimer: This content is for educational purposes only. SRJahir Tech does not guarantee any specific outcome or job placement.