When config edits start feeling like deploys
A rollout is misbehaving. An operator opens the internal dashboard and pauses it. The UI flashes green. They come into the engineering channel and ask the question that makes everyone sit up:
is this actually applied yet?
The honest answer is: probably, in another thirty seconds, on most of the fleet.
That answer is fine on a quiet day. During an incident it falls apart. The value they edited lives in a database row, and a hundred application worker processes have that row cached in memory. Cache TTL is what's between "save click" and "everyone sees the new value." This post is about how I got from "thirty seconds, mostly" to "yes, everywhere, within a second" without making the database the read path. Several alternatives sat on the whiteboard before I picked a direction, and the part that finally made the design feel right was less about the channel I picked and more about what I deliberately stopped putting in the channel.
The kind of config I'm talking about
Specifically: a rollout-allocation system (which variant does this user see for a given product change?) and a feature-toggle registry (is X currently on?). Both live as ordinary database rows. Both are read on every request through nested service calls. A hundred lookups per request isn't unusual. So the data has to be in process memory. Touching the database on the read path was off the table from day one.
What I wanted: writes hit the database. Every running process eventually reflects the new value. That "eventually" needs to be about a second across the fleet. If a save says "saved," every worker either converges or the save fails loudly. Inconsistent fleets where nobody knows it's inconsistent are the worst possible outcome. People stop trusting fast paths the moment they catch one lying.
A handful of properties shaped every alternative I weighed:
- The database is the only source of truth. Not a side cache, not a sidecar, nothing else.
- The read path stays in process memory.
- The propagation channel either delivers reliably, or fails loudly when it doesn't.
- No new piece of distributed infrastructure to keep alive if avoidable.
One framing note before I get into the alternatives. The specifics below are Rails on Postgres because that's what I built this on, but the bones of the design aren't Rails-specific. The structural requirements are a transactional commit hook on the database side, an in-process cache, and a Kubernetes client that supports the watch protocol. Watch-capable clients are mature in Go (client-go and controller-runtime, by far the best-documented and the reference implementation everyone else borrows from), Python, Java, and Rust. If you're on Django, a Spring service, or a Rust monolith, the same shape transposes and the framework name-drops below are scaffolding.
Approaches I considered
Periodic polling
The simplest answer. Cache the records in process memory and periodically refresh from the source on a timer. No new infrastructure, no propagation channel to operate. The dealbreaker for the shape I needed is the freshness floor. The convergence window is bounded by the TTL by construction, so making a config edit feel like a button click means a TTL of seconds, and a TTL of seconds means every pod hits the source on every refresh, mostly to rehydrate things that didn't change. The wider the fleet, the larger the load floor created by the polling itself. Polling is a legitimate answer when "minutes-fresh" is the right answer for the workload. It isn't here.
A variant I briefly weighed was caching in Redis as a layer between Postgres and the application processes, with the application polling Redis instead of Postgres. That moves the load off the primary, but the freshness vs load tradeoff is fundamentally the same. You've just bought yourself a second cache to invalidate.
Push via Redis pub/sub
If polling is the problem, pushing is the answer. Publish a "this changed" event from after_commit, every subscribed worker refreshes. Sub-second propagation in the happy path, lightweight to add to a stack that already runs Redis. The Redis documentation is also explicit about what it gives you: pub/sub is fire-and-forget, at-most-once delivery1. No persistence, no acknowledgment, no replay. A worker that's between subscriptions when a publish goes out misses it. A network blip lasting a few seconds drops every message published in that window with no error and no metric on the subscriber side. Pub/sub is a fanout primitive, not a state-synchronization primitive. For "every node converges on the latest value" you want either a channel with replay (Redis Streams, a durable queue) or a channel whose unreliability is loud rather than silent. Pub/sub is neither, and the silent-staleness failure mode is the worst possible thing for an operator who needs to trust the propagation path during an incident.
Vendor feature-flag platforms
A managed feature-toggle platform2 like LaunchDarkly, Flagsmith, or Unleash handles propagation for you, with SDK-based polling or event-driven updates from the vendor. They've grown well past their boolean-toggle origins, and most do support JSON payloads now. The harder problem is structural. Pushing the rollout-allocation records into a vendor's value field means signing up for a dual source of truth on purpose, with the database holding the canonical row and the vendor holding a serialized projection of it. The two are supposed to agree, but the moment two stores claim truth, the next incident is "wait, which one was actually right?" The same instinct underlies the older Netflix-Archaius family of dynamic-config libraries. They're well built, but the problem they solve is broader than mine and the runtime layer carries flexibility I wouldn't use.
A dedicated distributed config store (Consul, etcd, ZooKeeper)
Run a separate coordination service with watch APIs3. The strongest version of "push, not poll." Watches are durable and replayable, the consistency model is well-defined (Raft on etcd, ZAB on ZooKeeper), and the read pattern matches what I wanted. The cost is operational. A new distributed system to keep alive, with its own backup story, upgrade story, network partition behavior, and on-call story. Production was already on Kubernetes, which means etcd was already in the picture, but as Kubernetes' backing store rather than something I could access directly. Reaching for a parallel etcd alongside the one Kubernetes was already running felt like the wrong shape for the size of the problem.
A related move I did make later, which is to use a Kubernetes object as the propagation channel and let the cluster's existing etcd carry the bytes. The next section is about how that landed.
What I built
I picked Kubernetes ConfigMap as the propagation channel. ConfigMaps are stored in etcd, the control plane replicates them, and there are two ways to subscribe to changes from inside a pod: mount the ConfigMap as a volume of files and watch the filesystem, or talk to the Kubernetes API directly and use its watch interface4. Both are documented. I started with the volume-mount approach because it was the smaller change.
The publisher patches the ConfigMap from after_commit on the model. The subscriber, in each running worker, watches for changes and refreshes the in-process cache for the affected record. The first version mounted the ConfigMap at /etc/config/... and used the listen gem (already in the Gemfile via Rails) to react to filesystem events.
In a minikube proof-of-concept it worked. In real Kubernetes, two things broke it.
Latency came first. Volume-mounted ConfigMap updates aren't instant. The Kubernetes docs phrase the worst case as "the kubelet sync period plus the cache propagation delay"5, and on the default settings I was running the combination could stretch to a couple of minutes between the patch landing and the new file content showing up inside the pod. That was already worse than the polling baseline I had ruled out.
The symlink swap took longer to figure out. When kubelet updates a ConfigMap volume, it doesn't modify files in place. It writes them to a fresh timestamped directory and atomically swaps a ..data symlink to point at the new directory (Kubernetes calls this AtomicWriter)6. From inotify's perspective the user-visible file just got deleted (its old symlink target is gone), not modified. To handle this, your application has to interpret "deleted" as "atomically replaced," re-establish the watch on the new path, and not panic. I was watching for "deleted means deleted." So were the listener threads, which kept dying with bare "thread terminated" lines. I kept thinking it was memory pressure. The listen gem holds a persistent inotify file descriptor and a fiber pool, and in a busy worker that adds up. The actual cause was the listen gem's internal state being torn apart by the symlink flip every time kubelet did its sync.
I tried hardening the listener for a while. A supervisor thread that restarted on EACCES, a downgrade of the listen gem to an older version, more defensive event handling. I was hoping one of those would land. None of it stuck. The fix wasn't to harden the listen gem against ConfigMap semantics. The fix was to stop using filesystem-watching for this and use the Kubernetes API directly.
The kubeclient gem was already in the Gemfile, used to patch the ConfigMap from the publisher side. The same gem exposes a watch interface that talks to kube-apiserver directly. Switching to it gave me structured ADDED / MODIFIED / DELETED events instead of filesystem deletes pretending to be modifications, a resourceVersion cursor for resuming after disconnects, and no kubelet sync delay. Events fire as soon as kube-apiserver accepts the change. The Kubernetes documentation explicitly endorses this pattern as one of the supported ways for an application to subscribe to ConfigMap changes. The same shape, in Go rather than Ruby, is what client-go calls a Reflector and what controller-runtime wraps as an Informer7.
One more design call inside the ConfigMap-based approach mattered more than the channel choice. Early on, I stored the full record JSON inside the ConfigMap, with each key being the record id and each value being the serialized row. The watcher pulled the whole payload out of the watch event and refreshed its cache without going back to the database. I kept asking myself why I was using the same store for two different jobs, the bell that says "something changed" and the database that says "here is the value." I narrowed the payload to a marker shape within a day, for two reasons.
The size argument came first. The 1 MiB ConfigMap size limit Kubernetes documents and enforces is real, and with full record payloads the size budget shrinks faster than the entry count grows. A record might be a few hundred bytes today and a kilobyte tomorrow as a richer field gets added, and you can cross from "comfortable headroom" to "patches start failing" without anyone noticing. The failure mode would be nasty. The database write commits first, then after_commit fires, the ConfigMap patch fails with a 422, and from the dashboard's point of view the save succeeded while every running process kept the old value. The size headroom was real but finite and shrinking. The safer thing was to stop spending it on payload bytes.
The structural argument was the bigger one. By storing values inside the ConfigMap I'd given myself two sources of truth. The database had the canonical row. The ConfigMap had a serialized projection of it. The two were supposed to be identical, and most of the time they were, but the moment two stores claim truth, the next incident becomes "wait, which one was actually right?", and you find out about it by accident, which is the worst way to find out about anything.
The answer wasn't a bigger ConfigMap or validation between stores. The answer was that the ConfigMap shouldn't have been a store at all.
Markers, not values
The shipped design is one sentence. The ConfigMap doesn't hold values. It holds markers.
A marker is the smallest thing that lets a subscriber answer two questions: "did this entity change?" and "since when?" For the rollout-allocation records, that's {id, name, updated_at}. For the feature-toggle records, the marker carries one extra field, the boolean state, because the toggle's whole truth is a name and a bit and copying the bit into the marker is a verbatim copy of the row rather than a serialized projection that can drift.
The database stays the source of truth. The ConfigMap is the bell. When the bell rings, every running process walks back to the database, asks what changed, and updates its cache. If you've worked with event-driven invalidation before, you'll recognise the shape: this is Martin Fowler's Event Notification pattern, where the event is a thin "something happened" signal and the consumer fetches the actual state.
The publisher is two short methods. Build the marker, patch the ConfigMap with a single key in data. The after_commit hook is the trigger.
# Called from after_commit on the model.
def marker_for(record)
{ id: record.id, name: record.name, updated_at: record.updated_at.iso8601 }
end
def broadcast!(record)
patch_body = { data: { "#{record.id}.json" => marker_for(record).to_json } }
kube_client.patch_config_map(configmap_name, patch_body, namespace)
end
The subscriber is even smaller. One thread per worker process opens a watch and reacts to events.
# Runs in a long-lived thread; receives a watch event,
# diffs against the previous snapshot, refreshes per-record.
def apply_snapshot(data)
data.each do |key, content|
next if previous_snapshot[key] == content
cache.refresh_for(id_from(key))
end
self.previous_snapshot = data.dup
end
Worth noticing what the subscriber is doing in those four lines. Kubernetes watches deliver full objects, not field-level deltas: every watch event on the ConfigMap carries the entire data map, regardless of which single key was actually patched. The diff against the previous in-memory snapshot is what figures out which key in this event was actually different. Without it, every patch would invalidate every cached record on every pod, which collapses the whole point of doing this. The diff turns the firehose back into a per-record signal: one record changes, exactly one cache entry refreshes, on every pod, and untouched records stay warm. That per-record granularity is itself a design choice, not a free property of the channel. Each record gets its own key in the ConfigMap ({id}.json) rather than the dataset being collapsed into a single value. Pack everything into one key and the diff degenerates to "the blob changed" with no per-entity signal to recover. I benchmarked the diff step on its own and it landed in nanoseconds per event, which surprised exactly nobody. The id_from(key) step parses "42.json" back to 42. The naming convention is the only thing tying a ConfigMap entry to a database row, which keeps the channel oblivious to the schema on the other side.
The interesting move isn't the ConfigMap, and it isn't the API watch. It's the refusal to put anything other than "this changed" into the propagation channel. Markers stay tiny (~100 bytes each), so the 1 MiB budget holds thousands of records before partitioning would matter. The database stays the only thing claiming truth. The dual-store class of bugs disappears entirely because there's only one store.
In production code, those methods don't sit loose. They're override hooks on two base classes that own the kubeclient setup, the patch wrapping, and the watch loop with 410 Gone recovery. A new record type added to the broadcast set is two short subclasses, one on each side.
# Publisher base. Subclasses override marker_for and pass in the
# target configmap_name. The kubeclient setup, the patch wrapping,
# and the key naming convention live in one place.
class BasePublisher
def initialize(kube_client:, configmap_name:, namespace:)
@kube_client = kube_client
@configmap_name = configmap_name
@namespace = namespace
end
def broadcast!(record)
patch_body = { data: { key_for(record) => marker_for(record).to_json } }
@kube_client.patch_config_map(@configmap_name, patch_body, @namespace)
end
# Subclass overrides this.
def marker_for(record)
raise NotImplementedError
end
private
def key_for(record)
"#{record.id}.json"
end
end
The watcher base is where the operationally interesting code lives. The run! loop hydrates from a LIST, opens a watch from the resourceVersion the LIST returned, and re-LISTs whenever the watch tears down (including the routine 410 Gone recycle). Subclasses override refresh_for to wire the cache for their record type.
# Watcher base. Subclasses override refresh_for to wire the cache
# for their record type. LIST hydration, watch loop, 410 Gone
# recovery, and snapshot diffing all live here.
class BaseWatcher
def initialize(kube_client:, configmap_name:, namespace:)
@kube_client = kube_client
@configmap_name = configmap_name
@namespace = namespace
@previous_snapshot = {}
end
def run!
loop do
cm = @kube_client.get_config_map(@configmap_name, @namespace)
apply_snapshot(cm.data.to_h)
watch_from(cm.metadata.resourceVersion)
rescue Kubeclient::HttpError => e
sleep(error_backoff(e)) # 410 Gone or transient API error.
end
end
private
def watch_from(resource_version)
@kube_client.watch_config_maps(
namespace: @namespace,
resource_version: resource_version,
field_selector: "metadata.name=#{@configmap_name}",
).each do |event|
next unless %w[ADDED MODIFIED].include?(event.type)
apply_snapshot(event.object.data.to_h)
end
end
def apply_snapshot(data)
data.each do |key, value|
next if @previous_snapshot[key] == value
refresh_for(id_from(key))
end
@previous_snapshot = data.dup
end
# Subclass overrides this.
def refresh_for(id)
raise NotImplementedError
end
def id_from(key)
key.delete_suffix('.json').to_i
end
def error_backoff(error)
error.is_a?(Kubeclient::ResourceNotFoundError) ? 30 : 1
end
end
A concrete subclass is small enough that it isn't worth its own snippet: a BasePublisher subclass with marker_for(record) defined, a BaseWatcher subclass with refresh_for(id) calling into the appropriate cache, and a boot-time call to start the watcher thread from the worker process initializer. Everything else lives in the bases.
The flow end-to-end:
flowchart LR
UI[Dashboard] -->|save| DB[(Database)]
DB -->|after_commit| PUB[Publisher]
PUB -->|patch ConfigMap| KAPI[kube-apiserver]
KAPI --> CM[ConfigMap markers]
CM -->|WATCH event| W[Subscriber per worker]
W -->|read updated row| DB
W -->|refresh in-memory index| C[(Process cache)]
SVC[Request handler] -->|hot path| C
Each pod runs one subscriber thread per Puma worker process. That's a little redundant. Every worker on every pod opens its own watch stream against the same ConfigMap. The alternative is a sidecar or some intra-pod fanout, both of which add machinery I didn't want. Letting every process be responsible for its own correctness keeps the mental model simple.
The RBAC shape
The publisher and the subscriber need different things from RBAC. The publisher needs patch and get on the specific ConfigMap, and RBAC lets you pin those verbs to a single resourceNames entry. The publisher can't touch any other ConfigMap, only this one. The subscriber needs watch and list on ConfigMaps, and here Kubernetes makes a choice you don't get to opt out of: list and watch cannot be scoped by resourceNames8. The verb applies at the kind level. So every subscriber pod has list/watch on every ConfigMap in its namespace.
# Publisher: tight scope, single named resource.
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["broadcast-channel"]
verbs: ["get", "patch"]
# Subscriber: list and watch are namespace-wide.
# resourceNames is silently ignored on those verbs.
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
That sounds worse than it is. The data on the wire is markers, not values, so a worker accidentally reading a neighbour ConfigMap learns nothing about anyone's data. The realistic mitigation, if anything sensitive lives in the same namespace, is to put the propagation ConfigMap in its own dedicated namespace. For a namespace whose other ConfigMaps are equally non-sensitive, the wide grant is acceptable. Worth naming, in any case. The asymmetry between named-verb scoping and list/watch scoping is one of those things you only learn when you write the Role and ask why the manifest is rejecting your resourceNames list.
What this gets you, and what it doesn't
Convergence in production sits around 250ms median, under two seconds at p99. End-to-end per-pod processing cost per propagation, the part the subscriber spends from receiving an event to having the cache slot refreshed, sits around 5-10ms. Most of that is the database round trip to fetch the changed row. The slowest path through the system is pod-restart catch-up, which is hydration-bounded rather than watch-bounded. A fresh pod does a LIST against the ConfigMap, builds its cache from the database for the relevant records, then opens a watch from the resourceVersion the list returned. The watch failure path collapses to the same operation. When kube-apiserver returns a 410 Gone because the resourceVersion has aged out of the watch cache, the subscriber clears its cursor and re-hydrates. No special-case recovery code. Recovery is hydration plus a fresh watch.
What this doesn't give you is queue semantics. The Kubernetes watch is what I'm calling a convergence channel, not a delivery channel. Within a watch session, events arrive in resourceVersion order, and the list-then-watch pattern handles disconnects: when the watch returns 410 Gone, the subscriber re-lists and re-establishes from the new resourceVersion. That's enough to converge on the latest value. It isn't enough for "I need exactly-once, in-order processing of every change." For that, reach for a real durable queue.
Three questions I'd want any design in this space to answer. What happens if a process restarts mid-stream? What happens if the signal is missed? Who owns the truth, and where in the code is that ownership enforced? The shape that holds is the one where the channel carries no truth, the source of truth never moves, and the recovery path collapses to the same operation as the boot path.
Failure modes, old and new
The earlier alternatives each had a specific failure that ruled them out. How the shipped design fares against each:
- Polling's freshness floor: resolved. Convergence is sub-second.
- Pub/sub's silent message loss: resolved. The watch carries
resourceVersionand re-lists on410 Gone, so a missed sequence surfaces as "this watch is too far behind, here's the current state" rather than as quiet drift. - Vendor feature flags' dual source of truth: resolved. Markers, not values, means the database stays the only thing claiming truth.
- A dedicated coordination service's operational cost: resolved. The cluster's existing etcd carries the bytes through the Kubernetes API.
- Volume-mount +
listen's kubelet sync delay andAtomicWritersymlink swap: resolved. The watch interface fires the momentkube-apiserveraccepts the patch.
What this design does add are two new failure surfaces, one on each side of the channel.
Publisher side: the dual-writes window
The publisher fires from after_commit, which guarantees the database write is durable before the patch goes out, but the database write and the ConfigMap patch are still two writes to two different systems. A window exists between the commit and the patch where the network can blip, kube-apiserver can return a 500, or the worker process can OOM. When that window matters, the row is durable but the marker never updates, and no subscriber refreshes.
The basic shipped design makes this loud rather than silent without anything special. The kubeclient patch raises on non-2xx, the after_commit exception propagates up to the controller, and the dashboard shows a 500 instead of the cheerful "saved" flash. The operator reads "save failed" and retries. For the kinds of edits this design carries, a flag flip or a percentage change, the retry is naturally idempotent because the same row update plus the same marker patch will either both succeed or both fail again with the same error. The database state survives a failed retry intact because the commit already happened.
Two improvements I considered and didn't ship, in case your shape needs more than the basic loud-failure path:
- Retry with backoff inside the
after_commitblock. Catches transient network blips that resolve in seconds and spares the operator from seeing the flake. Trade-off: hides infrequentkube-apiserverissues you might want surfaced in error tracking. - A slow reconciliation loop (every minute or two) that walks recently-updated rows and re-publishes any whose marker has fallen behind. Belt-and-suspenders if your subscribers can't tolerate the rare missed update. Costs a periodic scan and a definition of "stale enough to re-publish." Genuinely overkill for my workload, mentioned for shapes where the operator-retry path isn't enough.
Subscriber side: a long-lived idle thread
Each subscriber is a Ruby thread blocked on a long-poll HTTP connection to kube-apiserver. In steady state the thread costs almost nothing. It sleeps in the kernel until the next event arrives, CPU is essentially zero, memory is one thread stack. The 410 Gone recycle is the only non-quiescent activity: when the watch cursor falls behind the apiserver's watch cache, the subscriber re-LISTs and re-establishes from the new resourceVersion. For a low-churn ConfigMap like this one the recycle is rare. For higher-churn resources it can fire every few minutes, still effectively free.
The failure modes are the ones long-lived idle threads tend to have. The thread can die on an uncaught exception and stop watching forever. The HTTP connection can hang half-open after a network event without raising an error. The re-LIST after a 410 Gone can itself fail. Each of these results in silent staleness on that one pod, with its cache no longer refreshing and no one knowing.
The mitigations are the standard set for long-lived consumers, none of them exotic:
- Wrap the watch loop in a supervisor that restarts the thread on uncaught exceptions. The same shape I tried with the
listengem (which didn't save it fromAtomicWriter). Here it works, because the failures are real exceptions rather than the listen gem's torn-apart internal state. - Set a connection read timeout so a half-open socket surfaces as an error rather than blocking forever. The supervisor then catches it.
- Emit a "time since last successful event" metric per pod and alert on the long tail. The one failure mode the supervisor can't see is the case where the thread is blocked but no events are arriving because the connection silently dropped without raising. The metric catches that.
For my workload, the operator-retry path on the publisher side was enough on its own. The reconciliation loop never earned its place.
It's been running in production for about six months now, happy and uneventful. No pages, no fire drills. The numbers above are steady-state observations, not best-case cherries. Whatever else might bite at this scale would have shown up already.
If you've worked through something similar, or hit a failure mode I didn't, I'd love to hear about it. maria@runbookpages.com. War stories most welcome. The closer to "tried this, here's what broke" the better.
Footnotes
-
Redis docs, Pub/Sub. The page is direct about the contract: messages are delivered to currently-subscribed clients only, with no acknowledgment and no replay. https://redis.io/docs/latest/develop/interact/pubsub/ ↩
-
Pete Hodgson, Feature Toggles (aka Feature Flags), on martinfowler.com. The canonical taxonomy of toggle types and the operational tradeoffs of vendor-managed flag platforms. https://martinfowler.com/articles/feature-toggles.html ↩
-
etcd documentation, Why etcd. A short articulation of what a coordination service buys you and where the line sits between an application database and a metadata store backed by a consensus protocol. https://etcd.io/docs/latest/learning/why/ ↩
-
Kubernetes API concepts, Efficient detection of changes. Documents the list-then-watch contract that the API exposes, the role of
resourceVersion, and the410 Gonesemantics that make recovery deterministic. https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes ↩ -
Kubernetes docs, ConfigMaps, section Mounted ConfigMaps are updated automatically. Projected keys update on the kubelet's periodic sync plus an additional cache propagation delay that depends on
configMapAndSecretChangeDetectionStrategy(watch propagation, TTL, or zero for direct API). On default settings the worst-case lag can stretch to minutes. https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically ↩ -
Kubernetes source, pkg/volume/util/atomic_writer.go. The implementation kubelet uses to project ConfigMap and Secret volumes. The header comment describes the timestamped-directory write plus
..datasymlink swap that makes the update atomic from the consumer's perspective, and incidentally breaks any consumer watching the user-visible filenames for in-place modifications. https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/atomic_writer.go ↩ -
client-go, Reflector (godoc). The reference implementation of list-then-watch in Go, wrapped by the higher-level Informer abstraction that controller-runtime builds on. https://pkg.go.dev/k8s.io/client-go/tools/cache#Reflector ↩
-
Kubernetes RBAC docs, Referring to resources. The
resourceNamesfield cannot apply tolist,watch, ordeletecollectionverbs (or tocreate, since the object name isn't yet known), so wide grants are the only available option for those verbs. https://kubernetes.io/docs/reference/access-authn-authz/rbac/#referring-to-resources ↩
Comments