What is DevOps and why is it important?

DevOps combines development and operations to automate software delivery. It reduces deployment time, increases reliability, and enables teams to ship features faster with fewer failures.

What salary does a DevOps engineer earn in India?

Entry-level DevOps engineers in India earn Rs.5-10 LPA. Mid-level with 2-4 years experience earns Rs.12-25 LPA. Senior DevOps engineers with Kubernetes and Terraform expertise earn Rs.25-50 LPA.

Which tools should a DevOps engineer learn first?

Start with Linux, then Git, Docker, GitHub Actions for CI/CD, then Kubernetes and Terraform. This sequence builds logically where each tool prepares you for the next.

How long does it take to become a DevOps engineer?

With dedicated daily practice: 6-9 months full-time or 12-18 months part-time. The key milestone is completing the AWS Solutions Architect certification and building one complete portfolio project.

Is DevOps a good career in 2026?

Yes. DevOps and cloud engineering are among the fastest-growing tech roles globally. The combination of development and operations skills is rare and commands premium salaries across all major tech markets.

DevOps Roadmap Part 9 - Monitoring and Observability

DevOps Roadmap -- Part 9: Monitoring & Observability

By Suraj Ahir 2025-11-16 11 min read

← Part 8DevOps Roadmap · Part 9 of 12Part 10 →

DevOps Roadmap -- Part 9: Monitoring & Observability

You cannot manage what you cannot measure. Monitoring is not optional in production -- it is the difference between knowing your system is down and finding out from angry users. But modern observability goes beyond uptime monitoring. The three pillars -- metrics, logs, and traces -- give you a complete picture of what your distributed system is doing at any moment.

The Three Pillars of Observability

Metrics: Numerical measurements over time. CPU usage, request rate, error rate, response time. Ideal for alerting on quantifiable thresholds. Prometheus collects and stores metrics.

Logs: Timestamped text records of events. Application logs, system logs, audit logs. Ideal for debugging what happened. ELK Stack (Elasticsearch, Logstash, Kibana) or Loki aggregates logs.

Traces: Records of a request's journey through distributed services. Shows how long each service took and where errors occurred. Jaeger, Zipkin, or AWS X-Ray provide distributed tracing.

Prometheus Setup

Docker Compose with Prometheus and Grafana

# docker-compose.yml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
  
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin123
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml configuration

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
  
  - job_name: "myapp"
    static_configs:
      - targets: ["myapp:8000"]
    metrics_path: /metrics

Key Metrics to Monitor

Essential production metrics

# The RED Method (for services)
# Rate:   Request rate per second
sum(rate(http_requests_total[5m])) by (service)

# Errors: Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Duration: Response time percentiles
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Infrastructure metrics
# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100

AlertManager Rules

Alert on critical conditions

# alerts.yml
groups:
  - name: production
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate {{ $value | humanizePercentage }}"
      
      - alert: DiskSpaceLow
        expr: (node_filesystem_free_bytes / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10%"

Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring tells you when something goes wrong (alerting). Observability tells you why it went wrong (debugging). Monitoring is about known unknowns. Observability is about unknown unknowns -- being able to ask arbitrary questions about your system's state.

What is the RED method?

Rate (requests per second), Errors (error rate), Duration (response time). These three metrics cover the external behaviour of any service. If all three look good, users are probably happy. If any degrades, you have a problem to investigate.

What is Grafana and how does it relate to Prometheus?

Prometheus collects and stores time-series metrics. Grafana visualises them with dashboards. Prometheus has a built-in UI for queries, but Grafana provides production-quality dashboards, alerting, and the ability to combine data from multiple sources.

How do I set up log aggregation?

For small setups: Promtail + Loki + Grafana (the lightweight stack). For larger setups: Filebeat + Elasticsearch + Kibana (ELK stack). Promtail ships logs from servers, Loki stores them, Grafana queries and visualises them alongside metrics in one dashboard.

What should I alert on?

Alert on symptoms that affect users, not causes. Alert on: error rate above 1%, response time above 2 seconds, disk space below 15%, service down. Do not alert on CPU at 80% unless it is causing user impact. Every alert should require immediate action -- alert fatigue kills response.

In Part 10, we cover Kubernetes -- the container orchestration platform that runs containerised applications at scale.

Key takeaways

Shift security left: scan code, scan dependencies, scan containers, scan IaC — all in CI, before code reaches production.
SAST (static analysis), DAST (dynamic/runtime), SCA (dependency scanning), secrets scanning. Pick one tool per category, automate them all.
Threat modelling is cheap and effective. Spend an hour asking "what could go wrong?" before you spend a month building.
Security is a culture, not a checklist. The team that thinks security is "the security team's problem" will be breached. Don't be that team.

Part 10 — Incident Response and SRE

When things break, and they will.

→

Written by

Suraj Ahir

Cloud & DevOps engineer running four live production services on my own AWS infrastructure. I write everything on this site myself — no ghostwriters, no AI filler.

More about me → GitHub LinkedIn

← Part 8DevOps Roadmap · Part 9 of 12Part 10 →

← Back to Blog

Disclaimer: Educational content only.

SLIs, SLOs, and Error Budgets

Google's SRE framework introduced a disciplined approach to reliability that is now widely adopted. Service Level Indicators (SLIs) measure specific aspects of service behaviour. Service Level Objectives (SLOs) set targets for those indicators. Error Budgets define how much unreliability is acceptable before feature development must pause to focus on reliability work.

Defining SLIs and SLOs

# SLI: measurement of service behaviour
Availability SLI   = successful_requests / total_requests
Latency SLI        = requests_under_200ms / total_requests
Error rate SLI     = 1 - (error_requests / total_requests)

# SLO: target for the SLI (measured over a rolling window)
Availability SLO   = 99.9% (over 30 days)
Latency SLO        = 95% of requests under 200ms (p95)
Error rate SLO     = error rate below 0.1%

# Error budget = 100% - SLO
# 99.9% SLO => 0.1% error budget = 43.8 minutes downtime/month

# PromQL for availability SLO
(1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
)) * 100

Log-Based Alerting with Loki

Alert on log patterns, not just metrics

# Loki alert rule in loki-rules.yml
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorLogRate
        expr: |
          sum(rate({app="myapp"} |= "ERROR" [5m])) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate for {{ $labels.app }}"
      
      - alert: DatabaseConnectionFailures
        expr: |
          count_over_time({app="myapp"} |= "connection refused" [5m]) > 5
        for: 1m
        labels:
          severity: critical