Skip to content

[Bug/Optimization] Implement Transactional Outbox Pattern to Prevent Neo4j/MongoDB Distributed State Desync #225

@vakrahul

Description

@vakrahul

Context & Problem

Our data architecture splits structural entity states across two separate database engines: document states live in MongoDB (xmem-go/internal/database/mongo.go) while structural/dependency graph states live in Neo4j (src/graph/neo4j_client.py).

Currently, memory ingestion (src/pipelines/ingest.py) and compaction tasks write updates to these databases sequentially within the application execution loop. This creates a massive reliability bottleneck: if a background worker writes an updated memory segment to MongoDB but the network fails, times out, or crashes before the corresponding graph mutations are executed in Neo4j, the system enters an unrecoverable desync state. This leaves orphaned document references in MongoDB and "ghost nodes/edges" in Neo4j, eventually triggering downstream exceptions or missing context during retrieval.

Proposed Solution

To guarantee eventual consistency without introducing slow, blocking distributed locks (like two-phase commits), we should implement an asynchronous Transactional Outbox Pattern on the database layer.

Key Implementation Steps

  • Outbox Collection: Create an atomic outbox collection in MongoDB. Every write operation to a collection (like project stores or user memories) must include an outbox event document written within the same local MongoDB transaction.
  • Event Publisher Loop: Build a reliable, lightweight background tailing thread (using MongoDB Change Streams) that reads events from the outbox table and asynchronously publishes them to the Neo4j client.
  • Idempotent Graph Operations: Refactor the Cypher queries in src/graph/neo4j_client.py to be completely idempotent (using MERGE instead of raw CREATE) so that if an event is processed more than once during an edge-case retry, the graph state remains perfectly intact.
  • Acknowledge and Purge: Once Neo4j acknowledges a successful write transaction, the worker marks the outbox event as processed or safely purges it.

Why This Matters

This removes network instability from the critical write path. It ensures that even if Neo4j goes down or network sockets time out during high-load ingestion, the memory architecture is completely resilient and will eventually synchronize seamlessly.

Impacted Files

  • src/pipelines/ingest.py
  • src/graph/neo4j_client.py
  • xmem-go/internal/database/mongo.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions