DevOps Roadmap -- Part 9: Monitoring & Observability

By Suraj Ahir 2025-11-16 11 min read

← Part 8DevOps Roadmap · Part 9 of 12Part 10 →
DevOps Roadmap -- Part 9: Monitoring & Observability

You cannot manage what you cannot measure. Monitoring is not optional in production -- it is the difference between knowing your system is down and finding out from angry users. But modern observability goes beyond uptime monitoring. The three pillars -- metrics, logs, and traces -- give you a complete picture of what your distributed system is doing at any moment.

The Three Pillars of Observability

Metrics: Numerical measurements over time. CPU usage, request rate, error rate, response time. Ideal for alerting on quantifiable thresholds. Prometheus collects and stores metrics.

Logs: Timestamped text records of events. Application logs, system logs, audit logs. Ideal for debugging what happened. ELK Stack (Elasticsearch, Logstash, Kibana) or Loki aggregates logs.

Traces: Records of a request's journey through distributed services. Shows how long each service took and where errors occurred. Jaeger, Zipkin, or AWS X-Ray provide distributed tracing.

Prometheus Setup

Docker Compose with Prometheus and Grafana
# docker-compose.yml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin123
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:
prometheus.yml configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
  
  - job_name: "myapp"
    static_configs:
      - targets: ["myapp:8000"]
    metrics_path: /metrics

Key Metrics to Monitor

Essential production metrics
# The RED Method (for services)
# Rate:   Request rate per second
sum(rate(http_requests_total[5m])) by (service)

# Errors: Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Duration: Response time percentiles
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Infrastructure metrics
# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100

AlertManager Rules

Alert on critical conditions
# alerts.yml
groups:
  - name: production
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate {{ $value | humanizePercentage }}"
      
      - alert: DiskSpaceLow
        expr: (node_filesystem_free_bytes / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10%"

Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring tells you when something goes wrong (alerting). Observability tells you why it went wrong (debugging). Monitoring is about known unknowns. Observability is about unknown unknowns -- being able to ask arbitrary questions about your system's state.

What is the RED method?

Rate (requests per second), Errors (error rate), Duration (response time). These three metrics cover the external behaviour of any service. If all three look good, users are probably happy. If any degrades, you have a problem to investigate.

What is Grafana and how does it relate to Prometheus?

Prometheus collects and stores time-series metrics. Grafana visualises them with dashboards. Prometheus has a built-in UI for queries, but Grafana provides production-quality dashboards, alerting, and the ability to combine data from multiple sources.

How do I set up log aggregation?

For small setups: Promtail + Loki + Grafana (the lightweight stack). For larger setups: Filebeat + Elasticsearch + Kibana (ELK stack). Promtail ships logs from servers, Loki stores them, Grafana queries and visualises them alongside metrics in one dashboard.

What should I alert on?

Alert on symptoms that affect users, not causes. Alert on: error rate above 1%, response time above 2 seconds, disk space below 15%, service down. Do not alert on CPU at 80% unless it is causing user impact. Every alert should require immediate action -- alert fatigue kills response.

In Part 10, we cover Kubernetes -- the container orchestration platform that runs containerised applications at scale.

Key takeaways

Continue reading
Part 10 — Incident Response and SRE
When things break, and they will.
Suraj Ahir — author of SRJahir Tech

Written by

Suraj Ahir

Cloud & DevOps engineer running four live production services on my own AWS infrastructure. I write everything on this site myself — no ghostwriters, no AI filler.

← Part 8DevOps Roadmap · Part 9 of 12Part 10 →
← Back to Blog
Disclaimer: Educational content only.

SLIs, SLOs, and Error Budgets

Google's SRE framework introduced a disciplined approach to reliability that is now widely adopted. Service Level Indicators (SLIs) measure specific aspects of service behaviour. Service Level Objectives (SLOs) set targets for those indicators. Error Budgets define how much unreliability is acceptable before feature development must pause to focus on reliability work.

Defining SLIs and SLOs
# SLI: measurement of service behaviour
Availability SLI   = successful_requests / total_requests
Latency SLI        = requests_under_200ms / total_requests
Error rate SLI     = 1 - (error_requests / total_requests)

# SLO: target for the SLI (measured over a rolling window)
Availability SLO   = 99.9% (over 30 days)
Latency SLO        = 95% of requests under 200ms (p95)
Error rate SLO     = error rate below 0.1%

# Error budget = 100% - SLO
# 99.9% SLO => 0.1% error budget = 43.8 minutes downtime/month

# PromQL for availability SLO
(1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
)) * 100

Log-Based Alerting with Loki

Alert on log patterns, not just metrics
# Loki alert rule in loki-rules.yml
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorLogRate
        expr: |
          sum(rate({app="myapp"} |= "ERROR" [5m])) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate for {{ $labels.app }}"
      
      - alert: DatabaseConnectionFailures
        expr: |
          count_over_time({app="myapp"} |= "connection refused" [5m]) > 5
        for: 1m
        labels:
          severity: critical