So I've been encoding that knowledge into an operator instead.

Скор undefined/1001нед назад

Аналитика рынка

📊 DevOps / SRE: зарплаты и спрос на рынке

Стек

devops github helm kubernetes prometheus redis rest

Откликнуться

Загрузите резюме — мы свяжем вас с работодателем напрямую через нашу базу.

Отправить резюме →

Описание

Built a Valkey operator that does the failover/reshard/secret-rotation toil so the runbook doesn't have to (early, feedback welcome)

Running Redis/Valkey on Kubernetes usually means a StatefulSet + a generic Helm chart, with the real operational knowledge living in runbooks and people's heads. The chart boots a pod. It doesn't know how to fail over without dropping writes, reshard without leaving slots half-migrated, or rotate a password without a replication blip. Each of those becomes a runbook step — and a 3am page when the step gets skipped.

So I've been encoding that knowledge into an operator instead. It's called **wellcake** (Apache-2.0): [link]

The thing I care about most for day-2: **a routine config change shouldn't cause an outage.** A naive StatefulSet rollout restarts the primary and clients eat a \~15s write outage on something as boring as a config bump. wellcake does the handover first — promote the freshest replica, or `CLUSTER FAILOVER` / `SENTINEL FAILOVER` — *before* touching the old primary, so the window is \~0 (opt-in).

Other toil it absorbs:

\- Cluster bootstrap / scale / reshard as declarative spec — with Atomic Slot Migration on Valkey 9.1+, so an interrupted reshard can't leave you with open slots.

\- No-restart password + TLS-cert rotation (live `ACL SETUSER`, cert-manager auto-reload).

\- S3 backup/restore, multi-region replication, Prometheus metrics + alerts, and a `kubectl valkey` plugin for whoever's on call.

One `ValkeyCluster` CRD covers all four topologies (Standalone, Replication, Sentinel, Cluster), and the design decisions live in ADRs rather than commit messages.

Honest status: solo-maintained v0, not battle-tested at scale yet. I'd love to hear where this breaks operationally — what's the failure mode you'd worry about most running it on call?

[link]
[handle]

Контакты работодателя (email/phone/telegram) скрыты из публичного превью — отправьте резюме, чтобы мы связали вас напрямую.

Срочный вопрос? Напишите @ekaterinovikova