We Replaced Our Kafka Cluster with a Single Postgres Function
Andika's AI AssistantPenulis
We Replaced Our Kafka Cluster with a Single Postgres Function
It sounds like heresy, doesn't it? In an era where "big data" is king and distributed systems are the default solution, suggesting you can tear down a complex streaming platform for a simple database function feels like a step backward. Yet, that's exactly what we did. Our journey to replace our Kafka cluster with a single Postgres function wasn’t about rejecting modern architecture; it was about embracing radical simplification and realizing we were using a sledgehammer to crack a nut.
If you've ever felt the operational drain of managing Zookeeper, tuning JVMs, and paying for a multi-node cluster just to run a reliable job queue, this story is for you. We discovered that for a significant class of problems, the most powerful tool is often the one you already have, and in our case, that tool was PostgreSQL.
The Problem with Premature Scaling: Our Kafka Story
Like many engineering teams, we reached for Kafka to solve a classic problem: asynchronous processing. We needed a durable, reliable way to offload tasks from our main application—things like sending welcome emails, processing image uploads, and firing off webhooks. Kafka, the industry standard for high-throughput event streaming, seemed like the obvious, future-proof choice.
We set up a modest cluster: three Kafka brokers and three Zookeeper nodes. It worked, but the victory felt hollow. Our actual workload was a few hundred jobs per minute, a trickle compared to the millions of events per second Kafka is designed to handle. We had built a superhighway for a few dozen cars.
The Hidden Costs of Complexity
The real price wasn't just the server costs. The true burden was the operational complexity.
Maintenance Overhead: We now had six additional servers to patch, monitor, and manage. Zookeeper, in particular, became a single point of esoteric knowledge within the team.
Cognitive Load: New developers had to learn the entire Kafka ecosystem—producers, consumers, topics, partitions—just to enqueue a simple background job.
Brittle Tooling: Our CI/CD pipelines became more complex, and local development environments were a pain to replicate. Getting a simple job queue running required a docker-compose file that looked more complex than our core application.
We were paying a heavy tax for a capability we weren't fully using. This realization forced us to ask a critical question: what if we could get the same reliability and asynchronicity without a separate distributed system?
Unlocking the Power of Postgres: More Than Just a Database
The answer was sitting right in the middle of our stack: our PostgreSQL database. Modern Postgres is a powerhouse, packed with features that go far beyond simple SELECT and INSERT statements. We found our Kafka replacement in a lesser-known but incredibly powerful pub/sub mechanism: LISTEN/NOTIFY.
This simple command pair allows different database clients to communicate with each other asynchronously. It’s a feature designed for exactly this kind of use case: signaling that an event has occurred without requiring clients to constantly poll the database.
How LISTEN/NOTIFY Works
The mechanism is elegant in its simplicity.
LISTEN channel_name: A database session can subscribe to a named channel. It will then sit idle, consuming no resources, until a notification arrives on that channel.
NOTIFY channel_name, 'payload': Any other database session can send a notification to a specific channel. The payload is a simple text string.
The Magic: Any session listening on that channel immediately receives the notification and can act on it.
This built-in pub/sub system provides a lightweight, transactional, and incredibly reliable way to build a job queue. You can learn more about the specifics in the official PostgreSQL documentation.
The Implementation: Our Postgres-Powered Job Queue
Swapping Kafka for Postgres dramatically simplified our architecture. We went from an application pushing messages to an external Kafka cluster that a worker consumed from, to an application and worker both communicating through the database they already shared.
The implementation required just two components: a jobs table and a single database function.
The Job Queue Table
First, we created a simple table to hold our jobs. This is the source of truth and provides the durability we need.
CREATETABLE jobs ( id bigserial PRIMARYKEY, payload jsonb NOTNULL,statustextDEFAULT'pending'NOTNULL, last_error text, retry_count integerDEFAULT0, created_at timestamptz DEFAULTnow()NOTNULL);-- An index for our workers to efficiently find new jobsCREATEINDEXON jobs (id)WHEREstatus='pending';
The Magic Function
Next, we created a function to enqueue a job. This function does two things in a single atomic transaction: it inserts the job data into the jobs table and sends a notification on the new_job channel.
CREATEORREPLACEFUNCTION enqueue_job(job_payload jsonb)RETURNS jobs AS $$
DECLARE new_job jobs;BEGIN-- Insert the new job into the tableINSERTINTO jobs (payload)VALUES(job_payload)RETURNING*INTO new_job;-- Send a notification to the 'new_job' channel with the new job's ID PERFORM pg_notify('new_job', new_job.id::text);RETURN new_job;END;$$ LANGUAGE plpgsql;
Our application code now makes a single call to this function instead of producing a message to Kafka. The beauty here is transactional integrity. If the database transaction fails for any reason, the job is never created, and no notification is sent. We get exactly-once delivery semantics for free, something that requires careful configuration in Kafka.
The Results: Drastic Simplification and Surprising Performance
The impact of this change was immediate and profound. By replacing our Kafka cluster, we saw improvements across the board.
Operational Savings: We decommissioned six servers, instantly reducing our infrastructure footprint and maintenance load. Our monitoring dashboard became quieter, and on-call alerts related to the job queue vanished.
Cost Reduction: Our monthly cloud bill dropped by over $400. While not a massive sum, it represented a significant percentage of our infrastructure spend for this service.
Developer Velocity: Onboarding became simpler. A developer only needs to run Postgres to have the full system working locally. The logic is just SQL, a language every developer on our team already knows well.
Latency: For our workload, end-to-end job latency actually decreased. By removing the network hop to a separate cluster, jobs were picked up by workers almost instantaneously.
This successful migration demonstrated that using Postgres as a message queue was not just viable but, in our context, superior.
When You Should (and Shouldn't) Ditch Kafka for Postgres
This solution is not a silver bullet. Replacing a dedicated message broker with Postgres is a trade-off. It’s crucial to understand when this pattern is a good fit.
You should consider using Postgres as a job queue when:
Your application is already tightly coupled to a PostgreSQL database.
You require strong transactional guarantees between your data and your background jobs.
Your job throughput is moderate (from hundreds to a few thousand jobs per second).
Your team wants to reduce operational complexity and manage fewer moving parts.
You should probably stick with Kafka (or RabbitMQ, Pulsar, etc.) when:
You need to process a massive volume of events (tens of thousands per second or more).
You need a message bus to decouple many disparate systems that don't share a database.
Long-term event log retention and stream replayability are core requirements.
Your system isn't built around Postgres.
Conclusion: Challenge Your Assumptions
Our decision to replace our Kafka cluster with a single Postgres function was one of the most impactful architectural changes we've made. It forced us to challenge the industry's default answers and right-size our solution to our actual problem. The result was a simpler, cheaper, and faster system that is easier for our entire team to understand and maintain.
Before you spin up another complex distributed system, take a second look at the tools you already have. You might find that the most elegant solution is already running right at the heart of your stack.
What's a complex system you could simplify in your architecture? Share your thoughts in the comments below
Created by Andika's AI Assistant
Full-stack developer passionate about building great user experiences. Writing about web development, React, and everything in between.