This scenario is admittedly operator error, but I am curious if there's anything River could do differently to self-heal.
We have a writer and reader in an AWS RDS cluster. Our service connects through a cluster endpoint which routes to the current writer. We needed to perform an OS patch so we upgraded the reader and then failed over so it became the writer. Then, we upgraded the former writer before failing back over to return to the original state. River's leader ended up pegged to the reader and could not do anything so we restarted the service to force a new connection which fixed the issue. The root cause is that a pgxpool connection obtained from the cluster endpoint is trusted as a writer for its entire lifetime, but Aurora flips roles without closing sockets.
Some of the errors we saw:
ERROR: cannot execute UPDATE in a read-only transaction (SQLSTATE 25006)
ERROR: cannot execute INSERT in a read-only transaction (SQLSTATE 25006)
error listening on topic "river_control": ERROR: cannot execute LISTEN during recovery (SQLSTATE 25006)
error scheduling jobs: ERROR: cannot execute SELECT FOR UPDATE in a read-only transaction (SQLSTATE 25006)
error cleaning jobs: ERROR: cannot execute DELETE in a read-only transaction (SQLSTATE 25006)
ERROR: cannot access temporary or unlogged relations during recovery (SQLSTATE 0A000)
If River could realize it's incorrectly connected to a read-only instance by inspecting errors or running something like pg_is_in_recovery(), it could attempt to reconnect and ultimately stop if it won't be successful. Again, I realize this is somewhat specific to our scenario ✌🏼
This scenario is admittedly operator error, but I am curious if there's anything River could do differently to self-heal.
We have a writer and reader in an AWS RDS cluster. Our service connects through a cluster endpoint which routes to the current writer. We needed to perform an OS patch so we upgraded the reader and then failed over so it became the writer. Then, we upgraded the former writer before failing back over to return to the original state. River's leader ended up pegged to the reader and could not do anything so we restarted the service to force a new connection which fixed the issue. The root cause is that a pgxpool connection obtained from the cluster endpoint is trusted as a writer for its entire lifetime, but Aurora flips roles without closing sockets.
Some of the errors we saw:
If River could realize it's incorrectly connected to a read-only instance by inspecting errors or running something like
pg_is_in_recovery(), it could attempt to reconnect and ultimately stop if it won't be successful. Again, I realize this is somewhat specific to our scenario ✌🏼