Описание
Built a Valkey operator that does the failover/reshard/secret-rotation toil so the runbook doesn't have to (early, feedback welcome)
Running Redis/Valkey on Kubernetes usually means a StatefulSet + a generic Helm chart, with the real operational knowledge living in runbooks and people's heads. The chart boots a pod. It doesn't know how to fail over without dropping writes, reshard without leaving slots half-migrated, or rotate a password without a replication blip. Each of those becomes a runbook step — and a 3am page when the step gets skipped.
So I've been encoding that knowledge into an operator instead. It's called **wellcake** (Apache-2.0): [link]
The thing I care about most for day-2: **a routine config change shouldn't cause an outage.** A naive StatefulSet rollout restarts the primary and clients eat a \~15s write outage on something as boring as a config bump. wellcake does the handover first — promote the freshest replica, or `CLUSTER FAILOVER` / `SENTINEL FAILOVER` — *before* touching the old primary, so the window is \~0 (opt-in).
Other toil it absorbs:
\- Cluster bootstrap / scale / reshard as declarative spec — with Atomic Slot Migration on Valkey 9.1+, so an interrupted reshard can't leave you with open slots.
\- No-restart password + TLS-cert rotation (live `ACL SETUSER`, cert-manager auto-reload).
\- S3 backup/restore, multi-region replication, Prometheus metrics + alerts, and a `kubectl valkey` plugin for whoever's on call.
One `ValkeyCluster` CRD covers all four topologies (Standalone, Replication, Sentinel, Cluster), and the design decisions live in ADRs rather than commit messages.
Honest status: solo-maintained v0, not battle-tested at scale yet. I'd love to hear where this breaks operationally — what's the failure mode you'd worry about most running it on call?
[link]
[handle]
Контакты работодателя (email/phone/telegram) скрыты из публичного превью —
отправьте резюме, чтобы мы связали вас напрямую.