Prometheus and Grafana form the foundation of modern infrastructure monitoring. Prometheus collects and stores metrics, while Grafana provides visualization and alerting. This guide covers deploying a production-ready monitoring stack.
Prometheus Architecture
- Prometheus Server: Scrapes and stores metrics
- Exporters: Expose metrics from various systems
- Alertmanager: Handles alerts and notifications
- Pushgateway: For short-lived jobs
Kubernetes Deployment
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100GiPrometheus Configuration
# prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/rules/*.yaml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__ServiceMonitor for Custom Apps
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: api
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 30s
path: /metricsRecording Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-rules
namespace: monitoring
spec:
groups:
- name: api.rules
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
spec:
groups:
- name: api.alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1
for: 5m
labels:
severity: warningGrafana Dashboard
# Terraform - Grafana dashboard as code
resource "grafana_dashboard" "api" {
config_json = jsonencode({
title = "API Dashboard"
panels = [
{
title = "Request Rate"
type = "timeseries"
gridPos = { h = 8, w = 12, x = 0, y = 0 }
datasource = { type = "prometheus", uid = "prometheus" }
targets = [{
expr = "sum(rate(http_requests_total[5m])) by (service)"
legendFormat = "{{ service }}"
}]
},
{
title = "Error Rate"
type = "stat"
gridPos = { h = 4, w = 6, x = 12, y = 0 }
datasource = { type = "prometheus", uid = "prometheus" }
targets = [{
expr = "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}]
fieldConfig = {
defaults = {
unit = "percent"
thresholds = {
steps = [
{ value = 0, color = "green" },
{ value = 1, color = "yellow" },
{ value = 5, color = "red" }
]
}
}
}
}
]
})
}Alertmanager Configuration
# alertmanager.yaml
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'Conclusion
Prometheus and Grafana provide a powerful, flexible monitoring solution. By implementing proper recording rules, alerting, and dashboards, teams gain visibility into system health and can respond quickly to issues.


