Skip to main content
K8sCalc
observability28 May 2026

Kubernetes Monitoring Stack Guide: Prometheus, Loki, Grafana, and Tempo

A complete guide to deploying the metrics, logs, traces, and dashboards stack on Kubernetes — including resource sizing, Helm configs, and alert rule generation.

A complete Kubernetes observability stack covers four signals: metrics (Prometheus), logs (Loki), traces (Tempo), and dashboards (Grafana). Each component has its own storage model, resource profile, and operational concerns. This guide walks through deploying all four on a production Kubernetes cluster, with real resource sizing so you don't run out of memory at 2 AM.

Before deploying, size your stack properly:


Architecture Overview

┌─────────────────────────────────────────────────────┐
│                    Grafana UI                        │
│            (dashboards, explore, alerts)             │
└──────────┬──────────────┬──────────────┬────────────┘
           │              │              │
     (PromQL)        (LogQL)        (TraceQL)
           │              │              │
     Prometheus         Loki           Tempo
     (metrics)         (logs)         (traces)
           │              │              │
    node-exporter   promtail/         OTEL
    kube-state       alloy          Collector
    metrics

Prometheus scrapes metrics from all cluster components and application pods. Loki receives logs forwarded by Promtail or Grafana Alloy running as a DaemonSet. Tempo receives traces from your applications via OpenTelemetry. Grafana sits in front of all three and provides unified dashboards and alerting.


Step 1: Deploy kube-prometheus-stack

The kube-prometheus-stack Helm chart is the standard way to deploy Prometheus, Alertmanager, and a set of pre-built Kubernetes dashboards in Grafana. It installs everything you need to start monitoring the cluster itself.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml

A production-ready prometheus-values.yaml:

prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: "45GB"
    scrapeInterval: "30s"
    evaluationInterval: "30s"
    resources:
      requests:
        cpu: "500m"
        memory: "2Gi"
      limits:
        cpu: "2000m"
        memory: "6Gi"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    # Only scrape pods with this annotation
    podMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false

alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        cpu: "50m"
        memory: "128Mi"
      limits:
        cpu: "200m"
        memory: "256Mi"

grafana:
  persistence:
    enabled: true
    storageClassName: longhorn
    size: 5Gi
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"
  adminPassword: "change-this-in-production"
  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

Key decisions here:

  • scrapeInterval: 30s — the default 15s doubles your storage and CPU for minimal benefit in most cases
  • retentionSize caps storage so Prometheus doesn't fill the PVC and crash
  • serviceMonitorSelectorNilUsesHelmValues: false means Prometheus will pick up ServiceMonitors from all namespaces, not just ones matching Helm chart labels

Use the Prometheus Storage Calculator to calculate the right retention + storage size for your scrape target count and series cardinality.


Step 2: Deploy Loki for Log Aggregation

Loki stores logs as compressed chunks indexed only by labels (no full-text index). This makes it far cheaper than Elasticsearch at the cost of requiring label-based queries. For Kubernetes log aggregation, this trade-off is almost always worth it.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm upgrade --install loki grafana/loki \
  --namespace monitoring \
  --values loki-values.yaml

loki-values.yaml for a single-binary deployment (suitable for clusters up to ~50 pods):

loki:
  commonConfig:
    replication_factor: 1
  schemaConfig:
    configs:
      - from: "2026-01-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  storage:
    type: filesystem
  limits_config:
    ingestion_rate_mb: 16
    ingestion_burst_size_mb: 32
    max_streams_per_user: 10000
    retention_period: 744h  # 31 days

singleBinary:
  replicas: 1
  persistence:
    enabled: true
    storageClass: longhorn
    size: 50Gi
  resources:
    requests:
      cpu: "200m"
      memory: "512Mi"
    limits:
      cpu: "1000m"
      memory: "2Gi"

For larger clusters or multi-tenant setups, use Loki's distributed (microservices) mode with object storage (S3/GCS) for the chunk store. Use the Loki Log Storage Calculator to estimate storage requirements based on your log ingestion rate and retention period.


Step 3: Deploy Promtail to Forward Logs

Promtail runs as a DaemonSet and ships logs from every node to Loki:

helm upgrade --install promtail grafana/promtail \
  --namespace monitoring \
  --set config.clients[0].url=http://loki:3100/loki/api/v1/push

Or use the values file to customize scrape configs:

config:
  clients:
    - url: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push
  snippets:
    pipelineStages:
      - cri: {}
      - labeldrop:
          - filename
      - multiline:
          firstline: '^\d{4}-\d{2}-\d{2}'
          max_wait_time: 3s

resources:
  requests:
    cpu: "50m"
    memory: "64Mi"
  limits:
    cpu: "200m"
    memory: "128Mi"

The multiline stage is important for Java and Python stack traces — without it, each line of a stack trace becomes a separate log entry, making them nearly impossible to read in Grafana.


Step 4: Deploy Tempo for Distributed Tracing

Tempo stores traces in an object-store-friendly format and integrates natively with Grafana for trace visualization:

helm upgrade --install tempo grafana/tempo \
  --namespace monitoring \
  --values tempo-values.yaml

tempo-values.yaml:

tempo:
  reportingEnabled: false
  resources:
    requests:
      cpu: "200m"
      memory: "512Mi"
    limits:
      cpu: "1000m"
      memory: "2Gi"
  retention: 72h
  storage:
    trace:
      backend: local
      local:
        path: /var/tempo/traces
      wal:
        path: /var/tempo/wal

persistence:
  enabled: true
  storageClassName: longhorn
  size: 20Gi

config: |
  server:
    http_listen_port: 3200
  distributor:
    receivers:
      otlp:
        protocols:
          http:
            endpoint: "0.0.0.0:4318"
          grpc:
            endpoint: "0.0.0.0:4317"

For production, replace the local backend with S3 or GCS — this keeps the pod stateless and dramatically reduces PVC requirements. Use the Tempo Tracing Storage Calculator to estimate storage needs based on spans/second and retention.


Step 5: Instrument Applications with OpenTelemetry

To get traces into Tempo, your applications need to export spans. The OpenTelemetry Collector can act as a gateway, receiving traces from apps and forwarding to Tempo:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.101.0
          args:
            - "--config=/conf/collector.yaml"
          volumeMounts:
            - name: config
              mountPath: /conf
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        timeout: 1s
    exporters:
      otlp:
        endpoint: "tempo.monitoring.svc.cluster.local:4317"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp]

Step 6: Configure Grafana Data Sources

Add Prometheus, Loki, and Tempo as data sources via ConfigMap (avoids manual clicks):

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        uid: prometheus
        url: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
        isDefault: true
      - name: Loki
        type: loki
        uid: loki
        url: http://loki.monitoring.svc.cluster.local:3100
      - name: Tempo
        type: tempo
        uid: tempo
        url: http://tempo.monitoring.svc.cluster.local:3200
        jsonData:
          tracesToLogsV2:
            datasourceUid: loki
            filterByTraceID: true
          serviceMap:
            datasourceUid: prometheus

The tracesToLogsV2 config in Tempo's datasource enables trace-to-log correlation in Grafana — click on a span and jump directly to the relevant log lines. This is the most useful Grafana feature most people don't set up.


Step 7: Set Up Alert Rules

Alerting on cluster health, pod failures, and disk pressure is table stakes. Use the Prometheus Alert Rules Generator to generate production-ready alert rules for your stack.

Essential rules to have:

groups:
  - name: kubernetes.rules
    rules:
      - alert: NodeMemoryPressure
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} has < 10% memory available"

      - alert: PersistentVolumeUsageHigh
        expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} is over 85% full"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"

      - alert: PrometheusStorageFull
        expr: (prometheus_tsdb_storage_blocks_bytes / prometheus_tsdb_retention_limit_bytes) > 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus storage is over 80% of retention limit"

Resource Sizing Reference

For a 20-node cluster with 100 application pods:

ComponentCPU RequestMemory RequestStorage
Prometheus500m3 Gi50 Gi (15d retention)
Alertmanager50m128 Mi2 Gi
Grafana100m256 Mi5 Gi
Loki300m1 Gi100 Gi (31d retention)
Promtail (per node)50m64 Mi
Tempo200m512 Mi20 Gi (72h retention)
OTEL Collector100m128 Mi
Total~2.1 CPU~8.5 Gi~177 Gi

These numbers scale with log volume, metric cardinality, and trace volume. Use the Kubernetes Observability Stack Sizing Calculator to model your specific workload.

The biggest variable is Loki storage — a high-traffic application emitting structured logs can ingest 10–50 GB/day. Set ingestion rate limits (ingestion_rate_mb in Loki config) to prevent runaway costs from a misconfigured application.

Use the Grafana Resource Sizing Calculator if you have a large number of users or dashboards — Grafana's memory footprint scales with concurrent users and dashboard complexity more than most people expect.