Beyond Eventual Consistency: Implementing Zanzibar-Style Zookies in Distributed CQRS for Causal Consistency

Reading from the write database destroys the core benefits of CQRS. This article explores a pattern using causal tokens to guarantee session consistency without sacrificing write database performance.

Software Architecture CQRS Distributed Systems System Design

Command Query Responsibility Segregation (CQRS) scales asymmetric workloads by separating write logic from read paths and optimizing data models for each purpose. However, this decoupling introduces data consistency challenges across independent stores. Systems must balance replication lag and performance against the user expectation of data freshness.

This article examines consistency mechanisms in asynchronous CQRS, moving from write-database bypasses to distributed causal tokens. The text explores a pattern using decoupled data and signaling channels to enforce strict session consistency without degrading write database performance.

1. Basic Model

In a traditional application architecture, a single database handles all operations. Every time the system updates a record or fetches data for the user interface, it interacts with the exact same database instance and the same data model.

Basic Model

The Problem

This single-database approach creates severe bottlenecks when a system faces asymmetric workloads where read operations are significantly more frequent than writes. Relational databases are typically normalized to ensure data integrity and prevent duplication. While normalization is excellent for safe and predictable writes, it is highly inefficient for frequent reads.

  • Complex Queries: Fetching data for a single user interface screen often requires complex database JOINs across multiple normalized tables. As read traffic grows, executing these JOINs repeatedly consumes massive CPU resources.
  • Resource Competition: Reads and writes compete for the same database engine resources. A heavy read query can lock tables or exhaust memory, which directly slows down critical write operations.
  • Conflicting Optimizations: A database optimized for fast, safe write transactions cannot be simultaneously optimized for rapid, complex data retrieval.

2. CQRS Single Database

To improve read performance without managing two separate databases, a system can use materialized views within a single database instance.

Basic Model

The Problem

While materialized views make read operations fast, they introduce a major data synchronization problem. Materialized views are static snapshots and do not update automatically when the underlying table data changes.

  • The Trigger Trap: Configuring database triggers to refresh the materialized view immediately after every write operation destroys write performance.
  • The Refresh Lag: Refreshing the views asynchronously on a schedule creates a replication gap. This causes the “Read Your Own Writes” problem, where a user updates their profile, the page reloads, but the user still sees old information because the background refresh has not executed yet.

3. Distributed CQRS

To scale systems, the architecture completely separates the infrastructure into two independent databases, a Write DB and a Read DB. The Command Service processes updates and writes them directly to the Write DB. It then publishes an update event to a PubSub message channel. A separate background worker, the Read Updater Daemon, listens to this channel, consumes the events, and updates the Read DB to match the new state.

Distributed CQRS

The Problem

Because data flows through an asynchronous PubSub channel, there is a time gap between writing data and updating the read model. This delay introduces eventual consistency.

The main issue is the “Read Your Own Writes” problem. If a user modifies data and immediately redirects to a page that displays it, the Query Service reads from the Read DB. If the Read Updater Daemon has not processed the event yet due to network lag or high traffic, the user will see their old data, creating a confusing user experience.

4. Distributed CQRS + Causal Tokens

This approach solves the “Read Your Own Writes” problem by enforcing causal consistency with tokens. When the Command Service completes a write operation, it returns a unique Causal Token to the client. The Read DB stores a corresponding Bound Token alongside each data object, which is typically a sequential number representing the latest change version of that specific object.

When making a query, the client includes their Causal Token in the request. The Query Service compares this token to the Bound Token stored in the Read DB. If the client Causal Token is greater than the Bound Token, the read model is outdated. The Query Service then pauses the request and waits for synchronization. This delay only impacts the specific user or the small group requiring the fresh data. Performance remains completely unaffected for the massive volume of other users who continue to read asynchronously without delay.

Distributed CQRS

The Problem

The primary disadvantage is the extreme complexity of the architecture and implementation. Distributed CQRS is already difficult to manage due to separate databases and message queues. Adding causal tokens introduces heavy infrastructure overhead, requiring object versioning on the write side, atomic token updates on the read side, and a low-latency signaling mechanism to wake up paused HTTP requests.

Real-World Example: Google Zanzibar

Google Zanzibar applies this exact concept to global permission management using tokens called Zookies. When a file’s access rights change, Zanzibar returns a Zookie to the application backend, which stores it within the file’s metadata. When a user requests access, the backend passes this token to the authorization engine. The distributed system then ensures its local replica is updated to at least this file version, forcing a temporary pause if the permission sync is still in progress.