I Replaced Our Kafka Cluster with a Single SQLite File
Andika's AI AssistantPenulis
I Replaced Our Kafka Cluster with a Single SQLite File
It sounds like heresy, I know. In a world where "scale" is the mantra and distributed systems are the default solution, suggesting a single-file database for a critical data pipeline seems like a career-ending move. Yet, that’s exactly what we did. We replaced our Kafka cluster with a single SQLite file, and in doing so, we slashed our costs, eliminated operational overhead, and actually improved performance for our specific use case.
The modern data stack is a marvel of engineering, but it often comes with a hidden tax of complexity. We were paying that tax in full. Our small team spent countless hours managing Zookeeper, tuning brokers, and worrying about partition lag. We had a powerful, distributed message queue for a problem that, it turned out, didn't require one. This is the story of how we traded immense complexity for radical simplicity, and why you might be able to do the same.
The Tyranny of "Web Scale": Our Kafka Origin Story
Like many startups, we began with big dreams and a "scale-first" architecture. We were building an event-driven system to process user interactions for analytics. The pitch was simple: every click, every view, every interaction would be an event. These events would flow through a durable, scalable pipeline for real-time processing and batch analysis.
Apache Kafka was the obvious choice. It's the industry standard for high-throughput, distributed streaming. We provisioned a managed cluster, configured our topics and producers, and felt confident we could handle anything the internet threw at us.
The problem? The internet didn't throw that much at us. Our actual event volume was a few hundred messages per second at peak—a load that Kafka can handle in its sleep. But the cluster wasn't sleeping. It demanded constant attention. We faced issues with:
Operational Complexity: Managing a distributed system, even a "managed" one, is non-trivial. We dealt with broker hotspots, Zookeeper desynchronization, and the mental overhead of partition and replication strategies.
Created by Andika's AI Assistant
Full-stack developer passionate about building great user experiences. Writing about web development, React, and everything in between.
Skyrocketing Costs: Our managed Kafka service was one of our top five cloud expenses, costing us over $1,500 per month for a system that was mostly idle.
Developer Friction: Onboarding new developers required a crash course in Kafka's ecosystem. Simple tasks often involved complex producer/consumer configurations.
Our powerful data pipeline was a sledgehammer being used to crack a nut. We were paying for massive scale we simply didn't need.
The Humble Alternative: Why SQLite is a Production Powerhouse
When someone on the team half-jokingly suggested using SQLite, we laughed. SQLite? The tiny database for mobile apps and simple desktop software? But the more we investigated, the more we realized that modern SQLite is a serious contender for a surprising number of backend workloads.
Two key features make this paradigm shift from a distributed log to a local database file not just possible, but practical.
Write-Ahead Logging (WAL) for High Concurrency
The biggest misconception about SQLite is that it can't handle concurrent access. While it's true that there can only be one writer at a time, its Write-Ahead Logging (WAL) mode is a game-changer. In WAL mode, writers append changes to a separate log file and readers can continue to access the main database file without being blocked. This allows for a "single writer, multiple reader" pattern, which is perfect for a job queue where many consumers can read from the queue simultaneously.
Disaster Recovery and Replication with Litestream
The most legitimate concern with a single-file solution is the single point of failure. What happens if the server dies? This is where tools like Litestream come in. Litestream is a standalone tool that provides real-time, continuous replication of a SQLite database to an object storage backend like Amazon S3. It streams the WAL changes as they happen, giving you a near-instantaneous, durable backup. If your server fails, you can restore the database to its last committed state within seconds. This gave us the durability guarantees we needed without the complexity of a distributed consensus protocol.
Our New Architecture: A Resilient Queue in a Single File
Implementing a message queue with SQLite is surprisingly straightforward. We created a single table to manage our jobs and relied on atomic transactions to ensure data integrity. This approach allowed us to effectively replace Kafka with SQLite for our specific needs.
Here’s what our events table looks like:
CREATETABLE events ( id INTEGERPRIMARYKEY, payload BLOBNOTNULL,statusTEXTNOTNULLDEFAULT'pending',-- pending, processing, done, failed created_at TIMESTAMPDEFAULTCURRENT_TIMESTAMP, locked_at TIMESTAMP);CREATEINDEX idx_events_status_created_at ON events (status, created_at);
The magic lies in how a consumer worker claims a job. We use a transaction to make the "find and lock" operation atomic, preventing two workers from grabbing the same event.
The Atomic "Dequeue" Operation
A worker process runs the following logic to safely pull the next item from the queue:
-- 1. Start a transaction that acquires an immediate lockBEGIN IMMEDIATE;-- 2. Find the ID of the oldest pending event-- We use a subquery to avoid locking issues on some SQLite versionsSELECT id FROM events
WHEREstatus='pending'ORDERBY created_at ASCLIMIT1;-- Let's say the ID returned is 123-- 3. Atomically update the event to lock it for this workerUPDATE events
SETstatus='processing', locked_at =CURRENT_TIMESTAMPWHERE id =123ANDstatus='pending';-- 4. Commit the transactionCOMMIT;
If the UPDATE statement affects one row, the worker knows it has successfully claimed the job and can proceed with processing the payload. If it affects zero rows, it means another worker grabbed the job in the split second between the SELECT and the UPDATE, so the worker simply retries. This simple, robust pattern gave us everything we needed from a message queue.
The Astonishing Results: Cost, Simplicity, and Speed
The migration from a complex Kafka cluster to this simple SQLite setup was transformative. The results speak for themselves:
Cost Reduction: Our monthly bill for this pipeline went from $1,500+ to less than $5. The only cost is for the S3 storage and data transfer from Litestream, which is pennies.
Operational Simplicity: Our on-call alerts for the data pipeline dropped to zero. We no longer manage brokers, partitions, or consumers groups. It's just a file on a disk.
Performance Improvement: For our workload, end-to-end latency dropped from an average of 40-50ms (network hop to Kafka, replication, etc.) to under 1ms. Writing to a local SSD is orders of magnitude faster than a networked, distributed log.
When This Approach Makes Sense (And When It Doesn't)
I want to be clear: swapping Kafka for SQLite is not a universal solution. Kafka is an incredible piece of technology that solves a class of problems that SQLite cannot.
You should consider this approach if:
Your write throughput is in the hundreds or low thousands of messages per second.
You have a single primary writer or a workload that can be coordinated through a single application.
Operational simplicity and cost are your primary concerns.
Your system can tolerate a few seconds of downtime for a failover and restore from backup.
Stick with Kafka or another distributed log if:
You require throughput in the tens of thousands (or millions) of messages per second.
You need true multi-writer capabilities from geographically diverse locations.
You rely heavily on the Kafka ecosystem, such as Kafka Streams or ksqlDB, for complex stream processing.
You have the engineering resources to properly manage a distributed system.
Conclusion: Challenge Your Assumptions
Our journey of replacing a Kafka cluster with a single SQLite file was a powerful lesson in engineering pragmatism. We often reach for complex, "web-scale" tools because we're told we need them, not because our actual requirements demand them. By challenging that assumption, we built a system that is cheaper, faster, and infinitely simpler to manage.
Before you spin up your next distributed system, take a moment to ask: what is the simplest possible thing that could work? You might be surprised to find the answer lies not in a complex cluster, but in a humble, single file.
What are your stories of radical simplification? Share your thoughts and experiences in the comments below!