Rolling PgBouncer without dropped queries
Somebody on the SRE channel pings: "PgBouncer rolling again?" The application graphs catch up half a minute later, with a clean little spike of 5xx and a clean little spike of PG::ConnectionBad. The numbers come down. The deploy completes. Nobody investigates because everybody knows what happened: PgBouncer rolled, some open connections got severed, the application was on the wrong end of it. The next deploy will look the same.
This post is about what it takes to make that spike go to zero. Not "smaller." Zero. The interesting part is that almost the whole answer lives in the client, not in PgBouncer.
What PgBouncer is, and why anyone runs one
Postgres backends aren't cheap. Each backend forks a process, allocates a megabyte or two of memory after warmup (more under load), and adds work on the server. A single Postgres instance handles a few hundred concurrent backends comfortably. A few thousand application worker processes (the typical shape of a Rails monolith on Kubernetes) can't each hold an open backend.
The standard answer is a connection pool. Rails ships with one inside ActiveRecord, scoped per process. That handles the per-process side, but it doesn't help you across processes: if every worker holds ten connections, a hundred-pod fleet holds a thousand connections, and you've blown past Postgres's max_connections before the first request of a busy hour.
PgBouncer sits between the application processes and Postgres and multiplexes. Application processes connect to PgBouncer with their usual Rails pool. PgBouncer holds a much smaller pool of real Postgres backends and hands them out per transaction (or per session, depending on configuration). The application never knows the difference. Postgres never sees more than the configured backend count.
A few alternatives exist. AWS RDS Proxy is the managed equivalent, useful if you don't want to operate a proxy yourself, with the trade-off of cost-per-vCPU billing, no prepared statement support, and an extra network hop. PgCat is a Rust rewrite with first-class sharding and better multithreading, less mature but worth tracking. Some teams skip pooling entirely by tuning Postgres backend memory down and provisioning a much larger instance. None of these change the shape of the problem this post is about. PgBouncer is what most production Rails-on-Kubernetes setups end up with.
The pooling modes you can pick from
PgBouncer offers three pooling modes1, each with a different contract with the application.
Session pooling is the most permissive. PgBouncer assigns a backend to a client session and keeps it for the entire lifetime of the client connection. Same as if PgBouncer wasn't there, just with the bookkeeping handled. You get full SQL semantics, including session-scoped state like temp tables, session GUCs, and advisory locks. The cost is zero pooling benefit if your clients hold connections for a long time. Effectively a bouncer that never sends anyone home.
Transaction pooling is the mode most Rails teams settle on. PgBouncer assigns a backend at the start of a transaction and releases it when the transaction commits or rolls back. A single application process can multiplex many transactions over a much smaller PgBouncer pool. Session-scoped state breaks here. Anything Postgres tracks across statements outside a transaction can fail or behave surprisingly, including prepared statements (without specific PgBouncer config), session GUCs, and advisory locks. Most Rails apps don't rely on session state on the hot path, so this mode covers the vast majority of traffic.
Statement pooling is the most aggressive. A backend is assigned per statement and released immediately. Multi-statement transactions don't work at all in this mode, so it's rarely useful for Rails.
The real choice is session vs transaction pooling, and the answer is almost always transaction pooling unless you have a workload that genuinely needs session state. The trade-offs you sign up for, like advisory locks being unsafe and prepared statement caching needing care, are well documented and avoidable in application code.
Three layers, not one
If you want to think about rollouts cleanly, you have to think about three connection layers, not one.
The first layer is the ActiveRecord pool, scoped to the application worker process. A Ruby object holding a list of connection wrappers, with a checkout/check-in dance for thread-safety. Each entry in this pool corresponds to a real TCP connection.
The second layer is the TCP socket itself. From the kernel's point of view this is just an open file descriptor talking to a remote endpoint. The kernel doesn't know it's a Postgres connection. It doesn't even know the endpoint is PgBouncer rather than Postgres. It only knows there's a socket, and that the socket either has bytes to read, or doesn't, or has been closed by the remote side.
The third layer is the backend connection from PgBouncer to Postgres, which PgBouncer manages independently. PgBouncer holds its own pool of these. The mapping between "client connection from a Rails worker" and "backend connection to Postgres" is dynamic. In transaction pooling mode, a single Rails connection might use ten different Postgres backends over its lifetime.
The reason the three-layer view matters is that a PgBouncer rollout breaks layer two from PgBouncer's side. The Rails worker still believes its layer-one entry is healthy. The kernel might or might not have noticed the close yet, depending on TCP timeout settings. The next time Rails checks out that connection and tries to issue a query, the socket fails and you get PG::ConnectionBad. That isn't a connection pool bug. It's a layer-mismatch.
Why a rollout is the awkward case
PgBouncer is a stateful proxy, and rolling it without losing in-flight work is harder than rolling a stateless web tier. Rolls happen often: configuration changes, node draining for autoscaler activity, image bumps for security patches, taints from cluster maintenance, every deploy of the proxy itself. None of these are rare events. If a rollout costs you a spike of PG::ConnectionBad every time, you're paying that cost weekly.
The Kubernetes mechanics are clean enough. New pods come up, the service stops sending new traffic to old pods, and Kubernetes sends SIGTERM to the old pods after the readiness flip. The problem is what SIGTERM means to PgBouncer.
PgBouncer responds to SIGTERM by entering a state called WAIT_FOR_CLIENTS2. It stops accepting new client connections, lets in-flight transactions complete, and waits for existing clients to disconnect on their own. If the clients are well-behaved and disconnect quickly, the pod terminates cleanly. If the clients aren't well-behaved, and Rails workers aren't particularly well-behaved here because they have no reason to disconnect, the pod sits in Terminating state until Kubernetes loses patience and sends SIGKILL after terminationGracePeriodSeconds. That sever is exactly the PG::ConnectionBad spike you were trying to avoid.
The other shutdown signal, SIGINT, behaves differently. PgBouncer enters WAIT_FOR_SERVERS, finishes any in-flight transactions on its backend connections, and then actively disconnects every remaining client. Faster shutdown. No SIGKILL risk. But the active disconnection is itself the failure mode you wanted to avoid: any application thread that's mid-request between two queries will see its connection severed, and the next query fails with the same PG::ConnectionBad.
You can build either of these. Most teams running PgBouncer have variants of both lying around. Neither is enough on its own.
The reason neither is enough is the same in both cases: the client isn't participating in the rollout. The client doesn't know PgBouncer is rotating. The client believes the connection is fine until it discovers the connection isn't. The shutdown approach is trying to fix this from the proxy side, which is the wrong side.
This isn't a gap I diagnosed alone. PgBouncer's maintainers have been clear about it. On the GitHub thread for zero-downtime pod rotation3, a project contributor put it bluntly: "unfortunately there is not much pgbouncer can do to solve this. There is a patch [to add this functionality to the protocol] but it will be awhile before it gets merged, released, incorporated into drivers, etc." The maintainer's recommended workflow includes the line "and then all your clients should reconnect," with no mechanism for actually getting them to. The proposed fix lives at the Postgres protocol level and hasn't shipped, in PgBouncer or in any client driver. Until it does, the client side has to fill the gap.
The missing piece is in the client
The fix is to make the client recycle its own PgBouncer connections on a schedule, before PgBouncer ever needs to rotate.
Concretely: every connection in the ActiveRecord pool gets a birth time stamped on first checkout. On every subsequent checkout, the pool checks the connection's age. If the age exceeds a configurable maximum lifetime, the pool removes the connection from the pool, closes the underlying socket, and checks out a fresh one. The fresh one gets routed by the Kubernetes service to whichever PgBouncer pod is currently healthy, which over time means traffic naturally redistributes off any pod that's about to be terminated.
End-to-end, a query lifecycle with the patch in place looks like this:
sequenceDiagram
participant App as Rails handler
participant Pool as ActiveRecord pool
participant Sock as TCP socket
participant PB as PgBouncer pod
participant DB as Postgres backend
App->>Pool: lease_connection (per query)
Note over Pool: check connection age vs MAX_LIFETIME
alt connection too old
Pool->>Sock: disconnect (recycle)
Pool->>Sock: open fresh socket
Note over Sock: K8s service routes the new TCP to a healthy pod
end
Pool-->>App: connection
App->>Sock: send SQL
Sock->>PB: forward
PB->>DB: assign backend (transaction-mode)
DB-->>PB: result
PB-->>Sock: bytes back
Sock-->>App: result rows
App->>Pool: check in
The lifetime check is the only piece outside what Rails already does. Everything else (per-query lease in 7.2+, the K8s service routing TCP on connect, PgBouncer's transaction-mode pooling) is stock behaviour.
Rails was the ecosystem holdout on this. Go's pgxpool has had MaxConnLifetime (and MaxConnLifetimeJitter) for years. Python's psycopg_pool has max_lifetime. Both ship the recycle-stale-connections pattern in the standard pool. Rails 7.x doesn't, so I shipped it as a small prepend patch on ActiveRecord::ConnectionAdapters::ConnectionPool. Rails 8.1 has since landed the same idea in core as a max_age connection-pool option4, with pool_jitter to avoid every connection in the fleet recycling at exactly the same wall-clock instant. The in-core implementation retires connections at check-in rather than at check-out, but it addresses the same root cause. If you're on 8.1, the patch below is one less thing to maintain. For everyone still on 7.x, this is what I ran5:
module ConnectionMaxLifetime
MAX_LIFETIME_SECONDS = ENV.fetch("CONNECTION_MAX_LIFETIME_SECONDS", "480").to_i
def checkout(*args)
connection = super
return connection unless MAX_LIFETIME_SECONDS > 0
now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
unless connection.instance_variable_defined?(:@birth_time)
connection.instance_variable_set(:@birth_time, now)
end
age = now - connection.instance_variable_get(:@birth_time)
return connection if age <= MAX_LIFETIME_SECONDS
remove(connection)
connection.disconnect!
checkout(*args)
end
end
ActiveRecord::ConnectionAdapters::ConnectionPool.prepend(ConnectionMaxLifetime)
Two things make this work in practice. The first is Rails 7.2's move to per-query connection leasing6. In earlier Rails, the connection was checked out at the request boundary and held for the whole request. A long request would mean a stale connection wouldn't get the lifetime check until the request finished. With per-query checkout, every query inside a request gets the lifetime check, which means stale connections get recycled mid-request, transparently.
The second is the interplay with idle_timeout (which Rails uses to reap unused pool entries, default 300s in 7.2+) and the default reaping_frequency. Idle connections age out of the pool naturally. The lifetime patch handles the in-use ones. Together, no client connection lives longer than MAX_LIFETIME + idle_timeout, regardless of traffic shape.
The PgBouncer side gets one configuration: a terminationGracePeriodSeconds long enough to cover the worst case. With that math right, the rollout behaves cleanly. PgBouncer receives SIGTERM, enters WAIT_FOR_CLIENTS, and waits. New client connections route to the new pods (Kubernetes service does this for free once the old pod is Terminating). Existing client connections age out one by one, either via the lifetime patch on their next checkout, or via idle_timeout if they go idle long enough. By the time the grace period is up, no client is still holding a connection to the old pod. PgBouncer exits cleanly. No SIGKILL, no severed sockets, no PG::ConnectionBad. A rollout that used to spike connection errors goes to zero.
Tuning the timing math
The whole approach hinges on one inequality:
terminationGracePeriodSeconds > MAX_LIFETIME + idle_timeout + buffer
If the inequality holds, every existing client connection has time to recycle (either via the lifetime check on its next checkout, or via the idle reaper) before Kubernetes loses patience and sends SIGKILL. If it doesn't hold, SIGKILL severs whatever connections didn't make it out in time, and you're back to the original PG::ConnectionBad spike on every rollout.
A reasonable starting point for a Rails web tier:
MAX_LIFETIME: 8 minutes (480s)idle_timeout: 5 minutes (300s, Rails 7.2+ default)- buffer: 2 minutes (120s) for TCP close handshakes, kernel cleanup, the worst-case in-flight transaction
terminationGracePeriodSeconds: 15 minutes (900s)
Most pods drain in three to five minutes in practice, because traffic is spread across the fleet and most connections see a fresh checkout well within MAX_LIFETIME. The fifteen minutes is the headroom for the unlucky long-tail connection that just got checked out as PgBouncer started shutting down.
The other side of the inequality is the rolling-update behaviour. Kubernetes uses maxSurge and maxUnavailable to control how many pods rotate at once, and a Pod Disruption Budget with minAvailable: 1 keeps at least one PgBouncer pod always serving. As soon as an old pod enters Terminating, the K8s service stops sending it new connections (modulo a brief endpoint-propagation lag, which is why the standard pattern includes a preStop sleep of a few seconds before SIGTERM actually fires). Every fresh connection from a recycled client lands on a healthy pod that isn't about to be terminated. By the time the grace period expires, traffic has fully shifted to the new generation of pods and the old pod can exit cleanly.
One refinement worth knowing about: with a fixed MAX_LIFETIME, every connection that started around the same time tends to expire around the same time, which produces a brief spike in connection-establishment when they all recycle in a window. Rails 8.1's pool_jitter setting handles this in core. The patch above can do the same with MAX_LIFETIME * (1 + rand(0.0..0.2)) per connection, spreading the recycle over a window instead of concentrating it.
There's a parallel knob on the PgBouncer side worth knowing about. server_lifetime (default 3600s) is the same idea applied to backend connections from PgBouncer to Postgres. PgBouncer recycles a backend the next time it's released after that age. Different layer, same trick. The PgBouncer config docs are consistent with the client-side approach: they note that client_idle_timeout should be set larger than the client-side connection lifetime, which assumes such a thing exists on the other end.
What this gets you, and what it doesn't
What this gets you is zero connection errors during PgBouncer rollouts, every rollout, with no application-side retry logic to maintain. The cost is a slower rollout (you're paying the full MAX_LIFETIME + idle_timeout window, which can be ten to fifteen minutes per pod in practice) and the discipline of keeping the timing math in sync between the client and PgBouncer's grace period.
What it doesn't handle is any case where a single thread holds the connection for longer than the lifetime, and there's an important exception to call out: Rails 7.2's per-query checkout does NOT apply inside a transaction7. Inside a Model.transaction do ... end block the connection is pinned to the thread for the duration of the transaction, every query reuses the same connection, and the lifetime check fires at checkout time only. So if a transaction runs for ten minutes and the lifetime is eight, the lifetime check never fires during those ten minutes. Same for any session-scoped state that has to live on a specific connection: pg_advisory_lock, session GUCs, prepared statements not configured for transaction-mode pooling. A long-running job that holds a transaction or an advisory lock blocks the lifetime check entirely. When PgBouncer terminates and the grace period expires, that connection gets severed and the job's next query fails.
PgBouncer offers one partial mitigation, idle_transaction_timeout, which forces a connection closed if it sits idle inside a transaction for too long. That helps with the idle-in-transaction case (a thread that opened a transaction and went off to do something else), but it doesn't help with a transaction that's actively running queries the whole time.
Two reasonable ways come up. One is to keep long-running jobs off PgBouncer entirely, with a separate database.yml entry pointing directly at the database, used by the relevant job classes. This is what advisory-lock-using code paths and migration runners need anyway, since transaction pooling can't safely route either.
The other approach is to wrap the failure path in transparent retry logic at the ActiveRecord adapter level. If a query fails with PG::ConnectionBad and the failure isn't inside an open transaction, retry the query on a fresh connection. The "not inside an open transaction" guard is the safety. A connection severed mid-transaction means Postgres has already rolled the transaction back, and retrying a single statement from inside that transaction would execute it outside the transaction context, which is unsafe. Retry only when the failure is between transactions. Web requests handle the in-transaction case naturally because the request returns 500 and the client retries. Jobs need an in-process retry layer specifically for the between-transaction case so they don't get re-enqueued for an error that resolves in milliseconds.
Neither approach is wrong. Both come up in practice. Most production Rails apps end up with some mixture: long-running batch jobs go direct, the bulk of web traffic and short jobs go through PgBouncer with the lifetime patch.
What about PgBouncer's deployment shape
PgBouncer is one binary, but how you run it on Kubernetes shapes the rollout story as much as the configuration. Three patterns come up:
DaemonSet (one PgBouncer per node, applications connect to localhost:6432): preserves locality, no extra network hop. Doesn't scale horizontally because the pool size is fixed per node, and rollouts are slow because every node has to drain in turn. Each PgBouncer instance has its own backend pool, so total connections to Postgres scale with node count rather than actual demand. For small clusters, fine. For anything past a few dozen nodes, the inefficiency starts showing.
Sidecar (one PgBouncer per application pod): maximum isolation, one backend pool per app pod. Defeats the point of pooling because you've now got hundreds of small pools instead of one big one. Mostly an anti-pattern.
Deployment + ClusterIP service (centralized PgBouncer pool, autoscaled, fronted by a Kubernetes service): the pattern most production setups settle on. One logical PgBouncer that scales horizontally on CPU or connection metrics. TCP connections are inherently sticky (once a client connects to a particular pod, all traffic stays there until the connection closes), so the service load-balances on connection establishment, not per-query. Rolling updates are clean if you've got the lifetime patch in the client. Pod Disruption Budgets keep at least one pod available during voluntary disruptions, anti-affinity spreads pods across nodes for failure isolation, and HPA or KEDA scales the fleet up under load.
The Deployment + service pattern is what you want past hobby-project scale. The lifetime patch above is what makes it operationally clean.
Other things PgBouncer ends up doing
Connection pooling is the primary job, but once you have PgBouncer in the path, a few other capabilities come for free.
Pause and resume for maintenance windows. PgBouncer's admin interface includes PAUSE and RESUME commands. PAUSE stops PgBouncer from forwarding queries to Postgres and waits for in-flight transactions to complete. The application keeps its connections open to PgBouncer, which now buffers them. This is useful during Postgres failovers and certain DDL operations that need a quiet moment, with the application none the wiser. RESUME flushes everything that queued up.
TLS termination. PgBouncer can speak TLS to the application and plaintext to Postgres, or vice versa. Useful when the application is on the other side of a network boundary and the database is on the same trusted network as the proxy, or the inverse.
Multi-database routing. The [databases] section in pgbouncer.ini lets one PgBouncer instance front many logical databases, mapping client connection strings to backend hosts and database names. A single proxy fleet can serve a primary, a read replica, multiple databases, and even databases on different hosts. This is the lever that makes "centralised PgBouncer for the whole org" practical.
Authentication proxy. PgBouncer can authenticate clients on its own without forwarding the credentials to Postgres, which keeps the real database password hidden inside the cluster.
Per-user, per-database configuration. Different pool modes, pool sizes, and connection limits per user or per database. Useful when you have a mix of long-running batch jobs and short transactional queries against the same database, or when you want service-level metrics on shared infrastructure.
What PgBouncer doesn't do, despite how often it gets asked, is read/write splitting on its own. PgBouncer doesn't inspect queries to decide whether they're reads or writes. The standard pattern when you want read/write splitting is to deploy two PgBouncer fleets, one in front of the primary and one in front of the read replica, and have the application route by connection target. PgCat does this natively if you'd rather have it in the proxy.
Other ways to solve the same problem
Connection pooling is a real problem, and PgBouncer is one of several answers. A few worth knowing:
AWS RDS Proxy is a managed connection pool from AWS. Trade-off: it costs per vCPU per hour and adds a network hop. Prepared statement support is partial. Extended-protocol prepared statements are multiplexed (a recent improvement). The older PREPARE / DEALLOCATE / DISCARD text-protocol commands still cause connection pinning, which defeats pooling for the duration. Worth it if you don't want to operate a proxy yourself and your application uses the extended protocol (the default for most ORMs, including Rails). Cost climbs quickly with multiple databases since each one needs its own proxy instance.
PgCat is a Rust rewrite of PgBouncer with first-class sharding support and better multithreading. Less mature, smaller community, but a serious project worth tracking if you're hitting PgBouncer's single-threaded ceiling.
Application-side pooling only (no proxy at all) is viable if your application fleet is small enough that the total connection count stays inside Postgres's limits. The per-process Rails pool is well-engineered. You don't need PgBouncer until you do, and "until you do" is usually somewhere around the point where a single app fleet outgrows what one Postgres instance can host as direct connections.
Pgpool-II is the older alternative. Powerful but operationally heavier, with replication and load-balancing features that often aren't needed.
For most Rails-on-Kubernetes setups past a few dozen worker pods, PgBouncer is the right answer. The choice isn't usually "PgBouncer or something else." It's "PgBouncer with the missing client-side pieces or PgBouncer with a recurring rollout spike."
Closer
If you're running PgBouncer on Kubernetes and the rollout spike is something you've learned to live with, try the lifetime patch. The implementation is small, the operational gains are large, and the trade-offs (slow rollouts, timing math) are honest. It's been running cleanly in production for me, and the rollout-spike alert has been silent since the patch landed.
For the SQL-side perils of running PgBouncer (transaction-pooling gotchas around prepared statements, advisory locks, statement timeouts, listen/notify), JP Camara's PgBouncer is useful, important, and fraught with peril is the canonical companion read. He covers what to watch for inside the SQL layer. This post covers what to watch for at the rollout layer.
If you've worked through something similar with a different shape, or hit a failure mode I didn't, I'd love to hear about it. maria@runbookpages.com. Wrong-turns and what-broke stories most welcome. The closer to "tried this, here's what actually shipped" the better.
Footnotes
-
PgBouncer documentation, pool_mode. The canonical reference for the session/transaction/statement contract and the SQL features each one breaks. https://www.pgbouncer.org/config.html#pool-mode ↩
-
PgBouncer documentation, Signals. SIGTERM puts PgBouncer into the same
SHUTDOWN WAIT_FOR_CLIENTSstate available via the admin console (the named admin command), and SIGINT triggersWAIT_FOR_SERVERS. A second signal escalates to immediate shutdown. Note this graceful-SIGTERM behaviour is PgBouncer 1.23+. Pre-1.23, SIGTERM was an immediate shutdown. https://www.pgbouncer.org/usage.html#signals ↩ -
PgBouncer GitHub issue #1468, Zero downtime on PgBouncer Pod Rotation. Maintainer and contributor responses confirm there is no proxy-side mechanism to gracefully tell clients to reconnect during rotation. The proposed protocol-level fix (a Postgres protocol extension) hasn't shipped in any driver as of writing. ↩
-
ActiveRecord ConnectionPool API, max_age. Rails 8.1 ships connection-age retirement on the check-in path, complemented by
pool_jitter(randomises expiry to avoid thundering-herd recycling),idle_timeout, andkeepalive. The shape is different from the check-out patch but the goal is the same. https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html ↩ -
Prior art on the patch shape: Samuel Cochran's ActiveRecord Connection Lifetime gist, motivated by autoscaling rebalance rather than rotation drain. Same checkout-time recycle pattern, different reason to reach for it. The Rails 8.1
max_agework in core was driven by similar autoscaling concerns and lands on the check-in path. ↩ -
Rails 7.2 release notes, Per-query connection leasing. Connections are leased for the duration of a single query rather than the whole request, which makes mid-request connection recycling possible. https://guides.rubyonrails.org/7_2_release_notes.html ↩
-
Rails ConnectionPool, transaction pinning. Inside an open transaction the pool tracks the connection as pinned to its thread (
@pinned_connection/@pinned_connections_depth) and skips the per-query lease/release cycle. The same connection is reused for every query in the transaction. This is necessary because Postgres transactions are connection-scoped and can't survive a connection swap mid-transaction. ↩
Comments