Skip to content

river stuck connected to read-only postgres instance #1225

@maxbeatty

Description

@maxbeatty

This scenario is admittedly operator error, but I am curious if there's anything River could do differently to self-heal.

We have a writer and reader in an AWS RDS cluster. Our service connects through a cluster endpoint which routes to the current writer. We needed to perform an OS patch so we upgraded the reader and then failed over so it became the writer. Then, we upgraded the former writer before failing back over to return to the original state. River's leader ended up pegged to the reader and could not do anything so we restarted the service to force a new connection which fixed the issue. The root cause is that a pgxpool connection obtained from the cluster endpoint is trusted as a writer for its entire lifetime, but Aurora flips roles without closing sockets.

Some of the errors we saw:

ERROR: cannot execute UPDATE in a read-only transaction (SQLSTATE 25006)

ERROR: cannot execute INSERT in a read-only transaction (SQLSTATE 25006)

error listening on topic "river_control": ERROR: cannot execute LISTEN during recovery (SQLSTATE 25006)

error scheduling jobs: ERROR: cannot execute SELECT FOR UPDATE in a read-only transaction (SQLSTATE 25006)

error cleaning jobs: ERROR: cannot execute DELETE in a read-only transaction (SQLSTATE 25006)

ERROR: cannot access temporary or unlogged relations during recovery (SQLSTATE 0A000)

If River could realize it's incorrectly connected to a read-only instance by inspecting errors or running something like pg_is_in_recovery(), it could attempt to reconnect and ultimately stop if it won't be successful. Again, I realize this is somewhat specific to our scenario ✌🏼

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions