Zorky CRMZorky CRM
EN|RU
@ekaterinovikova
All jobs

So I've been encoding that knowledge into an operator instead.

Score undefined/1001w ago
Market insights
📊 DevOps / SRE: salaries and demand on the market
Stack
devopsgithubhelmkubernetesprometheusredisrest
Apply
Upload your CV — we will connect you with the employer directly through our pool.
Send your CV →
Description
Built a Valkey operator that does the failover/reshard/secret-rotation toil so the runbook doesn't have to (early, feedback welcome) Running Redis/Valkey on Kubernetes usually means a StatefulSet + a generic Helm chart, with the real operational knowledge living in runbooks and people's heads. The chart boots a pod. It doesn't know how to fail over without dropping writes, reshard without leaving slots half-migrated, or rotate a password without a replication blip. Each of those becomes a runbook step — and a 3am page when the step gets skipped. So I've been encoding that knowledge into an operator instead. It's called **wellcake** (Apache-2.0): [link] The thing I care about most for day-2: **a routine config change shouldn't cause an outage.** A naive StatefulSet rollout restarts the primary and clients eat a \~15s write outage on something as boring as a config bump. wellcake does the handover first — promote the freshest replica, or `CLUSTER FAILOVER` / `SENTINEL FAILOVER` — *before* touching the old primary, so the window is \~0 (opt-in). Other toil it absorbs: \- Cluster bootstrap / scale / reshard as declarative spec — with Atomic Slot Migration on Valkey 9.1+, so an interrupted reshard can't leave you with open slots. \- No-restart password + TLS-cert rotation (live `ACL SETUSER`, cert-manager auto-reload). \- S3 backup/restore, multi-region replication, Prometheus metrics + alerts, and a `kubectl valkey` plugin for whoever's on call. One `ValkeyCluster` CRD covers all four topologies (Standalone, Replication, Sentinel, Cluster), and the design decisions live in ADRs rather than commit messages. Honest status: solo-maintained v0, not battle-tested at scale yet. I'd love to hear where this breaks operationally — what's the failure mode you'd worry about most running it on call? [link] [handle]
Employer contacts (email/phone/telegram) are hidden from the public preview — send your CV, and we will connect you directly.
Urgent question? Message @ekaterinovikova