Infrastructure Monitoring with Prometheus and Grafana

Prometheus and Grafana form the foundation of modern infrastructure monitoring. Prometheus collects and stores metrics, while Grafana provides visualization and alerting. This guide covers deploying a production-ready monitoring stack.

Prometheus Architecture

Prometheus Server: Scrapes and stores metrics
Exporters: Expose metrics from various systems
Alertmanager: Handles alerts and notifications
Pushgateway: For short-lived jobs

Kubernetes Deployment

# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

Prometheus Configuration

# prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yaml

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

ServiceMonitor for Custom Apps

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: api
  namespaceSelector:
    matchNames:
    - production
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Recording Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-rules
  namespace: monitoring
spec:
  groups:
  - name: api.rules
    rules:
    - record: job:http_requests:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)
    
    - record: job:http_request_duration:p99
      expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
  namespace: monitoring
spec:
  groups:
  - name: api.alerts
    rules:
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
        / sum(rate(http_requests_total[5m])) by (service) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate on {{ $labels.service }}"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
      for: 5m
      labels:
        severity: warning

Grafana Dashboard

# Terraform - Grafana dashboard as code
resource "grafana_dashboard" "api" {
  config_json = jsonencode({
    title = "API Dashboard"
    panels = [
      {
        title      = "Request Rate"
        type       = "timeseries"
        gridPos    = { h = 8, w = 12, x = 0, y = 0 }
        datasource = { type = "prometheus", uid = "prometheus" }
        targets = [{
          expr         = "sum(rate(http_requests_total[5m])) by (service)"
          legendFormat = "{{ service }}"
        }]
      },
      {
        title      = "Error Rate"
        type       = "stat"
        gridPos    = { h = 4, w = 6, x = 12, y = 0 }
        datasource = { type = "prometheus", uid = "prometheus" }
        targets = [{
          expr = "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }]
        fieldConfig = {
          defaults = {
            unit       = "percent"
            thresholds = {
              steps = [
                { value = 0, color = "green" },
                { value = 1, color = "yellow" },
                { value = 5, color = "red" }
              ]
            }
          }
        }
      }
    ]
  })
}

Alertmanager Configuration

# alertmanager.yaml
route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/xxx'
    channel: '#alerts'
    
- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'xxx'

Conclusion

Prometheus and Grafana provide a powerful, flexible monitoring solution. By implementing proper recording rules, alerting, and dashboards, teams gain visibility into system health and can respond quickly to issues.

Cloud AI