You cannot manage what you cannot measure. A production system without monitoring is flying blind — you only find out something is wrong when users start complaining. By then, the problem has often been affecting performance for hours. Monitoring and observability give you eyes inside your systems so you know about problems before users do, understand why they happened, and prevent them from recurring.
These terms are related but distinct. Monitoring means watching predefined metrics and alerting when they cross thresholds — CPU above 90%, response time above 500ms, error rate above 1%. You know what to watch because you anticipated it. Observability means your system produces enough data (metrics, logs, traces) that you can answer any question about its internal state, including questions you did not anticipate asking. Highly observable systems make debugging much faster because the answers are already in the data.
Prometheus is the industry-standard metrics system in the cloud-native world. It scrapes (pulls) metrics from your applications and infrastructure at regular intervals and stores them in a time-series database. You query metrics using PromQL (Prometheus Query Language).
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['server1:9100', 'server2:9100']
- job_name: 'my-app'
static_configs:
- targets: ['app:5000']
metrics_path: '/metrics'
# Run Node Exporter (exposes server metrics on port 9100)
docker run -d --name node-exporter --net="host" --pid="host" -v "/:/host:ro,rslave" prom/node-exporter --path.rootfs=/host
# Key metrics available:
# node_cpu_seconds_total
# node_memory_MemAvailable_bytes
# node_filesystem_avail_bytes
# node_network_receive_bytes_total
# CPU usage percentage
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory used
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# HTTP request rate per second
rate(http_requests_total[5m])
# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Prometheus stores data. Grafana visualizes it. Grafana connects to Prometheus as a data source and lets you build beautiful dashboards. There are thousands of community-built dashboards available — for Node Exporter, databases, Kubernetes, and more — which you can import with a single click using their dashboard IDs.
docker run -d -p 3000:3000 --name grafana -e GF_SECURITY_ADMIN_PASSWORD=admin grafana/grafana
# Access at http://localhost:3000
# Add Prometheus as data source: http://prometheus:9090
# Import dashboard ID 1860 for Node Exporter
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU is {{ $value }}% for 5 minutes"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
In Part 10, we will cover Kubernetes fundamentals — the container orchestration platform that runs the majority of the world's containerized production workloads.
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are metrics, logs, and traces. Metrics are numeric measurements over time — CPU usage, request count, error rate, latency percentiles. They are efficient to store and query, and ideal for dashboards and alerting. Logs are timestamped records of discrete events — an HTTP request completed, an error occurred, a user logged in. They provide rich context for debugging specific incidents. Distributed traces track a request as it flows through multiple services, showing exactly where time is spent and where failures occur. Each pillar provides different visibility; comprehensive observability requires all three.
The RED method (popularized by Weaveworks) provides a simple framework for monitoring any service: Rate (requests per second — how much traffic is the service handling?), Errors (error rate — what percentage of requests are failing?), and Duration (latency distribution — how long are requests taking, at P50, P95, P99?). These three metrics, monitored together, give you immediate visibility into service health. An alert on error rate above 1% and P99 latency above 500ms covers most production service health requirements. Pair the RED method with the USE method for resource monitoring: Utilization, Saturation, and Errors for CPU, memory, disk, and network.
Set up a basic monitoring stack using Docker Compose: run Prometheus (metrics collection), Grafana (visualization), and a simple application instrumented with Prometheus metrics. Configure Prometheus to scrape metrics from your application. In Grafana, create a dashboard showing the RED metrics for your service. Configure an alert rule that fires when the simulated error rate exceeds a threshold. This hands-on setup builds the practical skills needed for production monitoring work.
DevOps is not a destination but a continuous journey of improvement. The practices covered here — automation, monitoring, infrastructure as code, CI/CD pipelines — are tools in service of a deeper goal: enabling teams to deliver software changes to production quickly, safely, and reliably. The measurement that matters is not which tools you use but how long it takes to go from a committed code change to running in production, and how confident you are in that process. The best DevOps teams measure their deployment frequency, lead time for changes, change failure rate, and mean time to recovery (the DORA metrics), and treat these as engineering objectives to improve over time.
Consistent application of the principles covered here, combined with ongoing learning and hands-on practice, is what separates those who understand technology conceptually from those who can build and operate real systems. The investment in depth pays dividends for years. Keep learning, keep building, and keep asking the questions that drive deeper understanding.