Connection Leak in PgBouncer Behind AWS NLB

At a previous job, we had several instances of Ruby on Rails applications connecting to PostgreSQL through PgBouncer. The services and databases were deployed on bare-metal servers and chugged along fine. We then decided to migrate to AWS. Upon migration, we noticed that PgBouncer started leaking connections to PostgreSQL like a sieve, leading to exhaustion of the maximum number of connections configured in RDS.

The Architecture

Each Ruby on Rails service maintained a small pool of database connections within the service process. The services then connected to PostgreSQL through PgBouncer, which acted as the external connection pool. The pool_mode in PgBouncer was set to session. So PgBouncer kept a connection to PostgreSQL (server connection) associated with a connection from Rails (client connection), as long the client remained connected. Such server connections will not be re-used for new client connections. Server connections are re-used by PgBouncer, in pool_mode = session, only when the client connection it is associated with is closed by the client.

As part of migration to AWS, we placed 2 instances of PgBouncer behind AWS Network Load Balancer (NLB), for redundancy. PgBouncer instances then pooled connections to RDS instances running PostgreSQL.

Architecture

The Problem

When we routed traffic to AWS instances, we noticed that the number of active connections to RDS was much higher than expected and was climbing up steadily. The number of connections reported by Rails connection pool was much lower and at the expected level. It appeared that Rails was managing connections correctly, but PgBouncer did not recognize that the client connections were closed. So it created new server connections, every time Rails requested a new connection, instead of re-using existing ones.

The Rails services received fairly low traffic and may execute queries several minutes apart. They connected to PgBouncer, executed a query, and put the connection into their local connection pool. They may retrieve the connection from the local pool after, say, 30 minutes to execute another query.

NLB treats any TCP connection which does not have data flowing through it for 350 seconds as idle. If a client or server sends data after remaining idle for 350 seconds, it will receive an RST packet and notice that NLB has terminated the connection. NLB does not allow users to change the timeout.

In our case, the Rails service received the RST packet when it attempted the second query and realized that the connection is defunct. So it established a new connection to PgBouncer for the second query. However, since PgBouncer does not send any data until client queries, it never received an RST. So it continued to assume that the first server connection was still associated with a client connection. So when Rails established a new connection for the second query, PgBouncer created a new server connection, instead of re-using the old one. Thus a connection to PostgreSQL was leaked every time a connection from Rails timed out.

The Solution

NLB does not close a TCP connection, if either the server or the client sends a TCP keepalive packet before the 350 second timeout, even if no other data is sent. Luckily for us, PgBouncer allows sending keepalive packets periodically by setting the following configuration options.

tcp_keepalive = 1
tcp_keepidle = 300

With these set, PgBouncer sends keepalive packets if a connection remains idle for 300 seconds. You can also set tcp_keepcnt and tcp_keepintvl to tweak keepalive behaviour further.

With keepalive enabled, NLB never closed connections even if no query was executed on it. So closing of DB connections were entirely in the control of Rails and PgBouncer. This also meant that Rails’ local connection pool worked more effectively.

If you liked what you read, consider subscribing to the RSS feed in your favourite feed reader.

10 May, 2020 postgresql / aws / database