Replication setup guide
This guide walks you through setting up QuestDB Enterprise replication.
Prerequisites: Read the Replication overview to understand how replication works.
Setup steps
- Configure object storage (AWS S3, Azure Blob, GCS, or NFS)
- Configure the primary node
- Take a snapshot of the primary
- Configure replica node(s)
1. Configure object storage
Choose your object storage provider and build the connection string for
replication.object.store in server.conf.
AWS S3
Create an S3 bucket following AWS documentation.
Recommendations:
- Select a region close to your primary node
- Disable blob versioning
- Set up a lifecycle policy to manage WAL file retention (see Snapshot and expiration policies)
Connection string:
replication.object.store=s3::bucket=${BUCKET_NAME};root=${DB_INSTANCE_NAME};region=${AWS_REGION};access_key_id=${AWS_ACCESS_KEY};secret_access_key=${AWS_SECRET_ACCESS_KEY};
DB_INSTANCE_NAME can be any unique alphanumeric string (dashes allowed). Use
the same value across all nodes in your replication cluster.
Azure Blob Storage
Create a Storage Account following Azure documentation, then create a Blob Container.
Recommendations:
- Select a region close to your primary node
- Disable blob versioning
- Set up Lifecycle Management for WAL file retention
Connection string:
replication.object.store=azblob::endpoint=https://${STORE_ACCOUNT}.blob.core.windows.net;container=${BLOB_CONTAINER};root=${DB_INSTANCE_NAME};account_name=${STORE_ACCOUNT};account_key=${STORE_KEY};
Google Cloud Storage
Create a GCS bucket, then create a service account with Storage Admin (or
equivalent) permissions. Download the JSON key and encode it as Base64:
cat <key>.json | base64
Connection string:
replication.object.store=gcs::bucket=${BUCKET_NAME};root=/;credential=${BASE64_ENCODED_KEY};
Alternatively, use credential_path to reference the key file directly.
NFS
Mount the shared filesystem on all nodes. Ensure the QuestDB user has read/write permissions.
Important: Both the WAL folder and scratch folder must be on the same NFS mount to prevent write corruption.
Connection string:
replication.object.store=fs::root=/mnt/nfs_replication/final;atomic_write_dir=/mnt/nfs_replication/scratch;
2. Configure the primary node
Add to server.conf:
| Setting | Value |
|---|---|
replication.role | primary |
replication.object.store | Your connection string from step 1 |
cairo.snapshot.instance.id | Unique UUID for this node |
Restart QuestDB.
3. Take a snapshot
Replicas are initialized from a snapshot of the primary's data. This involves creating a backup of the primary and preparing it for restoration on replica nodes.
See Backup and restore for the full procedure.
Set up regular snapshots (daily or weekly). See Snapshot and expiration policies for guidance.
4. Configure replica node(s)
Create a new QuestDB instance. Add to server.conf:
| Setting | Value |
|---|---|
replication.role | replica |
replication.object.store | Same connection string as primary |
cairo.snapshot.instance.id | Unique UUID for this replica |
Do not copy server.conf from the primary. Two nodes configured as primary
with the same object store will break replication.
Restore the db directory from the primary's snapshot, then start the replica.
It will download and apply WAL files to catch up with the primary.
Configuration reference
All replication settings go in server.conf. After changes, restart QuestDB.
Use environment variables for sensitive settings:
export QDB_REPLICATION_OBJECT_STORE="azblob::..."
| Property | Default | Reloadable | Description |
|---|---|---|---|
| replication.role | none | No | Defaults to |
| replication.object.store | No | A configuration string that allows connecting to an object store. The format is scheme::key1=value;key2=value2;…. The various keys and values are detailed in a later section. Ignored if replication is disabled. No default given variability. | |
| cairo.wal.segment.rollover.size | 2097152 | No | The size of the WAL segment before it is rolled over. Default is |
| cairo.writer.command.queue.capacity | 32 | No | Maximum writer ALTER TABLE and replication command capacity. Shared between all the tables. |
| replication.primary.throttle.window.duration | 10000 | No | The millisecond duration of the sliding window used to process replication batches. Default is |
| replication.requests.max.concurrent | 0 | No | A limit to the number of concurrent object store requests. The default is |
| replication.requests.retry.attempts | 3 | No | Maximum number of times to retry a failed object store request before logging an error and reattempting later after a delay. Default is |
| replication.requests.retry.interval | 200 | No | How long to wait before retrying a failed operation. Default is |
| replication.primary.compression.threads | calculated | No | Max number of threads used to perform file compression operations before uploading to the object store. The default value is calculated as half the number of CPU cores. |
| replication.primary.compression.level | 1 | No | Zstd compression level. Defaults to |
| replication.replica.poll.interval | 1000 | No | Millisecond polling rate of a replica instance to check for the availability of new changes. |
| replication.primary.sequencer.part.txn.count | 5000 | No | Sets the txn chunking size for each compressed batch. Smaller is better for constrained networks (but more costly). |
| replication.primary.checksum=service-dependent | service-dependent | No | Where a checksum should be calculated for each uploaded artifact. Required for some object stores. Other options: never, always |
| replication.primary.upload.truncated | true | No | Skip trailing, empty column data inside a WAL column file. |
| replication.requests.buffer.size | 32768 | No | Buffer size used for object-storage downloads. |
| replication.summary.interval | 1m | No | Frequency for printing replication progress summary in the logs. |
| replication.metrics.per.table | true | No | Enable per-table replication metrics on the prometheus metrics endpoint. |
| replication.metrics.dropped.table.poll.count | 10 | No | How many scrapes of prometheus metrics endpoint before dropped tables will no longer appear. |
| replication.requests.max.batch.size.fast | 64 | No | Number of parallel requests allowed during the 'fast' process (non-resource constrained). |
| replication.requests.max.batch.size.slow | 2 | No | Number of parallel requests allowed during the 'slow' process (error/resource constrained path). |
| replication.requests.base.timeout | 10s | No | Replication upload/download request timeout. |
| replication.requests.min.throughput | 262144 | No | Expected minimum network speed for replication transfers. Used to expand the timeout and account for network delays. |
| native.async.io.threads | cpuCount | No | The number of async (network) io threads used for replication (and in the future cold storage). The default should be appropriate for most use cases. |
| native.max.blocking.threads | cpuCount * 4 | No | Maximum number of threads for parallel blocking disk IO read/write operations for replication (and other). These threads are ephemeral: They are spawned per need and shut down after a short duration if no longer in use. These are not cpu-bound threads, hence the relative large number. The default should be appropriate for most use cases. |
For tuning options, see the Tuning guide.
Snapshot and expiration policies
WAL files are typically read by replicas shortly after upload. To optimize costs, move files to cooler storage tiers after 1-7 days.
Recommendations:
- Take snapshots every 1-7 days
- Keep WAL files for at least 30 days
- Ensure snapshot interval is shorter than WAL expiration
Example: Weekly snapshots + 30-day WAL retention = ability to restore up to 23 days back. Daily snapshots restore faster but use more storage.
Disaster recovery
Failure scenarios
| Node | Recoverable | Unrecoverable |
|---|---|---|
| Primary | Restart | Promote replica, create new replica |
| Replica | Restart | Destroy and recreate |
Network partitions
Temporary partitions cause replicas to lag, then catch up when connectivity restores. This is normal operation.
Permanent partitions require emergency primary migration.
Instance crashes
If a crash corrupts transactions, tables may suspend on restart. You can skip the corrupted transaction and reload missing data, or follow the emergency migration flow.
Disk failures
Symptoms: high latency, unmounted disk, suspended tables. Follow the emergency migration flow to move to new storage.
Migration procedures
Planned primary migration
Use when the current primary is healthy but you want to switch to a new one.
- Stop the primary
- Restart with
replication.role=primary-catchup-uploads - Wait for uploads to complete (exits with code 0)
- Follow emergency migration steps below
Emergency primary migration
Use when the primary has failed.
- Stop the failed primary (ensure it cannot restart)
- Stop the replica
- Set
replication.role=primaryon the replica - Create an empty
_migrate_primaryfile in the installation directory - Start the replica (now the new primary)
- Create a new replica to replace the promoted one
Data committed to the primary but not yet replicated will be lost. Use planned migration if the primary is still functional.
Point-in-time recovery
Restore the database to a specific historical timestamp.
- Locate a snapshot from before your target timestamp
- Create a new instance from the snapshot (do not start it)
- Create a
_recover_point_in_timefile containing:replication.object.store=<source object store>
replication.recovery.timestamp=YYYY-MM-DDThh:mm:ss.mmmZ - If using a snapshot, create a
_restorefile to trigger recovery - Optionally configure
server.confto replicate to a new object store - Start the instance
Next steps
- Multi-primary ingestion - Scale write throughput with multiple primaries
- Tuning guide - Optimize replication performance