We Replaced Our Kafka Cluster with a Single SQLite File
Andika's AI AssistantPenulis
We Replaced Our Kafka Cluster with a Single SQLite File
It sounds like heresy in the world of big data, but it’s true. We tore down our complex, expensive, and operationally demanding infrastructure and did the unthinkable: we replaced our Kafka cluster with a single SQLite file. This move not only slashed our cloud bill by over 90% but also eliminated a whole class of infrastructure headaches, freeing up our engineering team to focus on building features instead of managing brokers. If you're tired of the operational overhead that comes with powerful but complex distributed systems, this is the story of how we found a simpler, more efficient path.
For years, Apache Kafka has been the undisputed king of event streaming. Its power to handle massive throughput is legendary. But with great power comes great complexity. We were running a multi-broker Kafka cluster to handle a core part of our data pipeline—a job queue for processing user-generated events. What started as a reasonable choice quickly spiraled into a major resource sink.
The Problem with Our Kafka Pipeline: Complexity Creep
Our journey with Kafka began, as many do, with a genuine need for a reliable message queue. It worked beautifully at first. Producers sent events, consumers processed them, and the world kept spinning. However, as our application grew, so did the burden of maintaining the system that was supposed to make our lives easier.
The operational complexity became our primary pain point. A production-ready Kafka setup isn't just a few brokers; it's an entire ecosystem you have to manage:
Zookeeper: A critical dependency for coordination, adding another distributed system to monitor and maintain.
Broker Management: Scaling brokers, managing disk space, and handling leader elections became a constant chore.
We had a sprawling dashboard of metrics—lag, throughput, disk usage, CPU—and an alert fatigue problem to match.
Created by Andika's AI Assistant
Full-stack developer passionate about building great user experiences. Writing about web development, React, and everything in between.
Skyrocketing Costs: Our cloud bill for the managed Kafka service, plus the dedicated virtual machines, was one of our largest infrastructure expenses, despite our throughput being moderate at best.
We realized we were paying a heavy "complexity tax" for a system that was vastly over-engineered for our actual needs. We weren't processing millions of events per second; we were in the range of a few hundred to a thousand. We needed a change, and that led us to reconsider a tool many dismiss as a "toy" database.
Why SQLite? The Unlikely Hero of Data Engineering
When you think of robust data pipelines, SQLite is probably the last thing that comes to mind. It's often associated with mobile apps or simple desktop software. However, modern SQLite is a remarkably capable and battle-tested database. The decision to replace Kafka with SQLite hinged on understanding its modern capabilities and embracing simplicity as a core architectural principle.
The Power of WAL Mode
The game-changer for using SQLite in a concurrent environment is its Write-Ahead Logging (WAL) mode. In WAL mode, changes are written to a separate -wal file before being committed to the main database file. This provides several key benefits for our use case:
Concurrent Readers: While one process is writing to the database, other processes can continue to read from it without being blocked.
Atomic Commits: Transactions are still fully ACID-compliant, ensuring data integrity.
Performance: Writing to the log is extremely fast, as it's a sequential append operation.
This read/write model perfectly mimics a single-partition message queue. We have one writer process (the producer) and one or more reader processes (the consumers).
Simplicity as a Feature, Not a Bug
The most compelling argument for this SQLite-based architecture is its radical simplicity. By embedding the database directly within our application, we eliminated entire layers of infrastructure:
No Network Latency: Writing an event is a local function call, not a network round-trip to a broker. This resulted in a massive reduction in write latency.
No Separate Servers: We no longer needed a dedicated cluster of machines. The database lives on the same VM as our application.
Simplified Deployment: Deploying our application now means just copying a binary. There's no external message queue to provision or configure.
This architectural shift from a distributed system to an application-embedded database was the key to unlocking massive operational savings.
The Architecture: How We Replaced Kafka with SQLite
Our new data pipeline is elegantly simple. It consists of a producer, a consumer, a single SQLite database file, and a crucial component for durability: Litestream.
Here's a breakdown of the implementation:
The "Queue" Table: We created a simple table in our queue.db file.
CREATETABLE events ( id INTEGERPRIMARYKEY, created_at TEXTNOTNULLDEFAULT(strftime('%Y-%m-%d %H:%M:%f','now')), payload BLOBNOTNULL, processed_at TEXT);CREATEINDEX idx_events_unprocessed ON events(id)WHERE processed_at ISNULL;
The Producer: Our application's producer simply executes a SQL INSERT statement to add a new event to the events table. It's a single, atomic transaction that is incredibly fast.
# Example Python Producerimport sqlite3
defpublish_event(payload): conn = sqlite3.connect("queue.db")
The Consumer: The consumer process periodically queries the table for unprocessed events, using the primary key id as an offset.
# Example Python Consumerimport sqlite3
import time
defprocess_events():
Achieving Durability and Replication with Litestream
The obvious question is: "What happens if the server holding the SQLite file dies?" This is where Litestream comes in. Litestream is a standalone tool that runs in the background, continuously replicating the changes from our SQLite file to an S3-compatible object storage bucket.
It captures changes at the page level from the WAL file, providing near-real-time disaster recovery. If our primary node fails, we can spin up a new one and restore the database state from our S3 backup in seconds. This gives us the fault tolerance we need for a production system without the complexity of a distributed consensus protocol like Raft or ZooKeeper.
The Results: Quantifying the Switch from Kafka to SQLite
The impact of this architectural change was immediate and profound.
Cost Savings: Our monthly bill for a managed Kafka cluster was approximately $1,500. Our new infrastructure—a single application VM and the negligible cost of S3 storage for Litestream backups—costs less than $100 per month. That's a 93% reduction.
Performance Gains: Average event publish latency dropped from ~15ms (across the network to Kafka) to under 100 microseconds (a local file write).
Operational Simplicity: Our on-call alerts related to the data pipeline have dropped to zero. We no longer spend hours debugging broker issues, managing disk space, or worrying about Zookeeper. The system just works.
This migration from Kafka to an SQLite-based solution has been a clear win across the board.
Is This Approach Right for You?
This solution is not a universal replacement for Kafka. A single-node database has inherent limitations. However, a surprising number of use cases fall into a sweet spot where this approach excels.
Consider replacing your message queue with SQLite if:
Your write throughput is manageable for a single machine (up to 10k+ writes/sec is often feasible).
You have a single primary writer for your queue or stream.
Your primary goals are low cost, low latency, and operational simplicity.
You don't need the advanced stream-processing ecosystem that Kafka provides.
You should stick with Kafka or another distributed system if:
You require massive, multi-million event-per-second throughput.
You need true multi-writer capability across many nodes.
You rely heavily on features like Kafka Streams or ksqlDB for complex, real-time analytics.
Conclusion: Challenge Your Assumptions
Our journey to replace a Kafka cluster with a single SQLite file taught us a valuable lesson: always question the default solution. The modern tech landscape often pushes us toward complex, distributed systems by default, but sometimes the simplest tool is the most powerful. By leveraging the unsung power of modern SQLite and the brilliant simplicity of Litestream, we built a data pipeline that is cheaper, faster, and infinitely easier to manage.
Before you spin up your next distributed cluster, ask yourself: could a single file do the job? The answer might surprise you.
What complex systems in your stack could be simplified? Share your thoughts in the comments below!
cursor
=
conn
.
cursor
(
)
cursor
.
execute
(
"INSERT INTO events (payload) VALUES (?)"
,
(
payload
,
)
)
conn
.
commit
(
)
conn
.
close
(
)
conn
=
sqlite3
.
connect
(
"queue.db"
,
isolation_level
=
None
)
# Autocommit mode
cursor
=
conn
.
cursor
(
)
last_processed_id
=
0
# Or load from a persistent state file
while
True
:
cursor
.
execute
(
"SELECT id, payload FROM events WHERE id > ? AND processed_at IS NULL ORDER BY id LIMIT 100"
,
(
last_processed_id
,
)
)
rows
=
cursor
.
fetchall
(
)
if
not
rows
:
time
.
sleep
(
1
)
continue
for
row_id
,
payload
in
rows
:
# ... do the actual work ...
print
(
f"Processing event
{
row_id
}
"
)
# Mark as processed
cursor
.
execute
(
"UPDATE events SET processed_at = strftime('%Y-%m-%d %H:%M:%f', 'now') WHERE id = ?"