k3s in Production: HA Setup, Gotchas, and Real-World Guide (2026)

k3s is not "Kubernetes lite" — it's a fully conformant Kubernetes distribution packaged into a single ~60 MB binary. That packaging comes with real trade-offs that matter in production. This guide is what I'd hand to an engineer setting up k3s for the first time with production workloads in mind.

Before sizing your cluster, use the k3s Resource Calculator to estimate node requirements, and the Hetzner k3s Cost Calculator to model full monthly spend.

When to Choose k3s

k3s is the right call when:

›Resource constraints are real. k3s runs with a 512 MB RAM minimum on the server node. Vanilla kubeadm clusters need at least 2 GB. For edge, IoT, or small VPS deployments, this matters.
›ARM64 support is a requirement. k3s has first-class arm64 support and ships pre-built binaries for arm64/armv7. Running Kubernetes on Raspberry Pi 5 or Oracle's ARM Ampere nodes is significantly easier with k3s than kubeadm.
›You want a single-binary install. k3s ships containerd, flannel, CoreDNS, Traefik, local-path-provisioner, and the metrics server bundled. A fresh VM goes from nothing to a working cluster in under 2 minutes.
›You're running dev/staging clusters that get rebuilt frequently. The install script handles everything; no Ansible playbooks needed for simple setups.
›Small production workloads where operational simplicity outweighs maximum control. A 3-node k3s cluster on Hetzner CPX21 nodes (~€18/mo) is a legitimate production environment for small SaaS products.

k3s is the wrong call when you need maximum control over component versions, have complex networking requirements that conflict with Flannel, or are running at a scale where the embedded etcd limitations become relevant (more on that below).

See the full comparison: k3s vs Kubernetes kubeadm.

What k3s Removes (and Replaces)

Understanding what k3s strips out prevents surprises:

Removed Component	k3s Replacement	Notes
Docker	containerd (bundled)	containerd is the correct runtime anyway
kube-proxy	Flannel / kube-router	Can be replaced with Cilium
cloud-controller-manager	None (or external)	Needed for cloud LB/storage integration
in-tree storage drivers	local-path-provisioner	Swap for Longhorn for production storage
etcd (external)	embedded SQLite or etcd	SQLite is single-node only; etcd for HA
ingress controller (none default)	Traefik v2	Can be disabled and replaced

The key architectural decision: embedded SQLite vs embedded etcd.

›SQLite (default, single-server): Zero setup, great for dev/edge. Has hard limits — do not use for production HA.
›Embedded etcd: Used for HA clusters (3+ server nodes). Full etcd underneath, no external etcd cluster needed. This is what you want for production.

HA Setup with Embedded Etcd

A production k3s HA cluster needs an odd number of server nodes (3 or 5) for etcd quorum. Here's the complete setup:

Step 1: First Server Node (cluster init)

curl -sfL https://get.k3s.io | K3S_TOKEN=your-secure-cluster-token sh -s - \
  server \
  --cluster-init \
  --tls-san=<YOUR_LOAD_BALANCER_IP> \
  --tls-san=<SERVER1_IP> \
  --disable=traefik \
  --disable=servicelb \
  --flannel-backend=vxlan \
  --write-kubeconfig-mode=644

--cluster-init bootstraps the embedded etcd cluster. --tls-san ensures the kube-apiserver certificate covers both the load balancer IP and individual server IPs.

Step 2: Additional Server Nodes (join etcd cluster)

curl -sfL https://get.k3s.io | K3S_TOKEN=your-secure-cluster-token sh -s - \
  server \
  --server https://<SERVER1_IP>:6443 \
  --tls-san=<YOUR_LOAD_BALANCER_IP> \
  --disable=traefik \
  --disable=servicelb \
  --flannel-backend=vxlan

Repeat for the third server node. Wait for each node to reach Ready state before adding the next:

watch kubectl get nodes

Step 3: Agent (Worker) Nodes

curl -sfL https://get.k3s.io | K3S_TOKEN=your-secure-cluster-token sh -s - \
  agent \
  --server https://<LOAD_BALANCER_IP>:6443

Workers join through the load balancer, not a specific server node — this makes failover transparent to agents.

Step 4: Load Balancer for the API Server

You need something in front of the 3 server nodes on port 6443. Options:

›HAProxy (most common, runs on a dedicated VM or the same nodes)
›Hetzner Load Balancer (clean, ~€4/mo, no management overhead)
›kube-vip (runs inside the cluster as a DaemonSet, provides a floating VIP — no external LB required)

kube-vip is the zero-infrastructure option:

# Install kube-vip as a DaemonSet (ARP mode for on-prem/Hetzner)
kubectl apply -f https://kube-vip.io/manifests/rbac.yaml

export VIP=192.168.1.100
export INTERFACE=eth0

kubectl apply -f "https://kube-vip.io/k3s-config?interface=${INTERFACE}&vip=${VIP}&mode=arp"

Use the k3s Install Script Generator to generate a complete, parameterized install script for your specific configuration without manually assembling flags.

The SQLite Limit Nobody Mentions

Single-server k3s uses SQLite as its datastore by default. SQLite is a local file. If that server dies, your cluster is gone — there's no quorum, no replica, nothing to recover to.

The SQLite datastore is explicitly not recommended for production by the k3s maintainers. The docs say it clearly; people ignore it because SQLite "works fine" in dev.

When to use SQLite:

›Local development
›Edge devices where HA is impossible (single node)
›Throwaway environments that get rebuilt from scratch

When to use embedded etcd:

›Any production workload
›Any cluster where downtime costs money
›Any cluster that won't be rebuilt from a script in 5 minutes

Traefik: The Default Ingress You Probably Don't Want

k3s ships Traefik v2 as the default ingress controller, enabled automatically. For teams already using nginx-ingress or planning to use Gateway API, this creates a conflict.

Disable it at install time (as shown in the server flags above) and install your preferred ingress controller:

# nginx-ingress via Helm
helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=LoadBalancer

If you're on a cloud provider without native load balancer integration, you also need to disable servicelb (k3s's built-in ServiceLB, which forwards LoadBalancer services to node ports) and install MetalLB or use a cloud-specific solution.

Local Path Provisioner: Fine for Dev, Not for Production

k3s ships local-path-provisioner as the default storage class. It creates PVs as directories on the node's local filesystem. This means:

›PVs are not portable across nodes — if a pod is rescheduled to a different node, it loses its data
›No replication
›No snapshots

For any stateful workload in production, replace it with Longhorn:

# Disable local-path-provisioner (add to k3s server flags)
--disable=local-storage

# Install Longhorn
helm repo add longhorn https://charts.longhorn.io
helm repo update
helm upgrade --install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --create-namespace \
  --set defaultSettings.defaultReplicaCount=3

Use the Longhorn Storage Calculator to size disk capacity with your replication factor in mind.

ARM64 Support: What Actually Works

k3s has the best ARM64 support of any Kubernetes distribution. Pre-built binaries for arm64 and armv7 ship with every release. Tested on:

›Raspberry Pi 4 / 5 (arm64)
›Oracle Cloud ARM (Ampere A1)
›AWS Graviton (t4g, c7g, m7g)
›Hetzner ARM (CAX series)

One real gotcha: not all container images ship arm64 variants. Before committing to ARM for production, verify your application images, your ingress controller, your storage driver, and your observability stack all have arm64 builds. The most common missing piece is custom internal images built only for amd64.

# Check if an image has arm64 support
docker manifest inspect nginx:latest | jq '[.manifests[].platform | select(.os == "linux")] | [.[].architecture]'

Upgrading k3s: The Right Way

k3s provides the system-upgrade-controller for automated, rolling upgrades:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Then create an upgrade plan:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: k3s-server
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - { key: node-role.kubernetes.io/control-plane, operator: Exists }
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.32.3+k3s1

concurrency: 1 ensures server nodes upgrade one at a time — critical for maintaining etcd quorum during upgrades.

Key Config File Location

k3s stores its config at /etc/rancher/k3s/config.yaml (server) and merges with CLI flags. Using the config file is better than long command-line flags for reproducibility:

# /etc/rancher/k3s/config.yaml (server node)
cluster-init: true
tls-san:
  - "192.168.1.100"
  - "k3s.example.com"
disable:
  - traefik
  - servicelb
  - local-storage
flannel-backend: vxlan
write-kubeconfig-mode: "644"
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 10

The etcd snapshot settings are important — k3s takes embedded etcd snapshots to /var/lib/rancher/k3s/server/db/snapshots/ by default. Make sure these are backed up offsite (S3 or similar). The etcd Backup CronJob Generator can generate a CronJob that ships snapshots to S3 automatically.