We Replaced Our Search Cluster with a Single DuckDB File
Andika's AI AssistantPenulis
We Replaced Our Search Cluster with a Single DuckDB File
It sounds like heresy in the world of big data, but it's true: we replaced our search cluster with a single DuckDB file, and our stack has never been simpler, faster, or cheaper. For years, we operated under the assumption that any serious search or analytics workload required a distributed system—a sprawling cluster of Elasticsearch or OpenSearch nodes, meticulously configured and constantly monitored. The operational overhead was just the cost of doing business. But what if that assumption is wrong? This is the story of how we dismantled our complex infrastructure in favor of a solution that is orders of magnitude simpler and, for our use case, significantly more performant.
The Hidden Costs of Our "Scalable" Search Cluster
Before our migration, our architecture was textbook. We had a multi-node Elasticsearch cluster responsible for powering an internal analytics dashboard. It ingested millions of documents, indexed them, and served queries for our data science and operations teams. On the surface, it worked. But beneath the surface, the cracks were showing.
Spiraling Infrastructure Bills
A "small" three-node cluster isn't small on a cloud provider's bill. We were paying for dedicated instances with high memory and fast SSDs, plus the associated costs for data transfer and snapshots. Our monthly bill for the search infrastructure alone hovered around $2,000. As our data grew, the only answer was to add more nodes, pushing costs even higher. This model felt fundamentally broken for a workload that was 99% reads and only updated once per day.
The Operational Overhead Nightmare
The financial cost was only part of the problem. The operational burden was a constant drain on our engineering team. We dealt with:
Version Upgrades: A notoriously delicate process in distributed systems that always felt like open-heart surgery.
Created by Andika's AI Assistant
Full-stack developer passionate about building great user experiences. Writing about web development, React, and everything in between.
Plugin Maintenance: Keeping our analysis and security plugins up-to-date and compatible.
Monitoring and Alerting: A dedicated suite of tools just to tell us if the cluster was healthy, consuming even more resources.
This complexity was a tax on every new feature and a source of constant, low-level anxiety. We were spending more time maintaining the search cluster than actually leveraging the data within it.
Enter DuckDB: The Unlikely Hero for In-Process Analytics
Our journey to a simpler solution began when we started exploring modern data tools. We stumbled upon DuckDB, which describes itself as "an in-process SQL OLAP database system." Let's break that down:
In-Process: DuckDB runs inside your application. There is no server to install, manage, or connect to. It’s a library, not a separate service.
SQL: It speaks standard SQL, making it immediately accessible to anyone familiar with databases.
OLAP: It’s designed for Online Analytical Processing—meaning it's incredibly fast at running complex queries over large datasets.
The real magic is its architecture. DuckDB uses a columnar data format and vectorized query execution, allowing it to process billions of rows per second on a single machine. And its entire database, including tables and indexes, can be stored in a single file. This was the revelation. Could we just put all our data into one of these files and query it directly?
How We Migrated from Elasticsearch to a Single DuckDB File
The idea was radical, but the execution was surprisingly straightforward. Our goal was to replace the live Elasticsearch cluster with a daily-generated DuckDB file stored in a cloud bucket like S3.
Step 1: Data Preparation: From JSON to Parquet
First, we changed our data pipeline. Instead of pushing JSON documents into the Elasticsearch API, our ETL job now writes the data into the highly efficient Apache Parquet format. Parquet is a columnar storage file format, which is a perfect match for DuckDB's engine. This simple change made our data portable and decoupled it from any specific database technology.
Our daily pipeline now looks like this:
Pull raw data from source systems.
Clean, transform, and structure the data.
Write the final dataset as a partitioned Parquet file to S3.
Step 2: Building the Search and Analytics Layer
With our data in Parquet, creating the search engine was trivial. We wrote a small Python script that runs daily:
import duckdb
# Connect to a new or existing DuckDB filecon = duckdb.connect('search_analytics.db')# Use DuckDB's Parquet reader to load all data from S3# This is incredibly fast and memory-efficientcon.execute("""
CREATE OR REPLACE TABLE documents AS
SELECT * FROM read_parquet('s3://our-data-bucket/documents/*.parquet');
""")# Install and load the full-text search (FTS) extensioncon.execute("INSTALL fts;")con.execute("LOAD fts;")# Create a full-text index on the relevant columnscon.execute("""
PRAGMA create_fts_index(
'documents', 'doc_id', 'title', 'content'
);
""")con.close()
This script generates a single search_analytics.db file containing all our data and a pre-built full-text search index. This file, roughly 50GB for our dataset of 300 million documents, becomes our entire database.
Our application backend, a simple Python Flask API, now just downloads this file from S3 on startup. When a user performs a search, the API uses the DuckDB library to query the local file. The network call to a search cluster is replaced by a function call.
The Surprising Results: Performance, Cost, and Simplicity
After replacing our search cluster with this single DuckDB file, the results were staggering.
Performance: For our most common analytical queries (e.g., aggregations over date ranges with full-text filters), query latency dropped by an average of 40%. By eliminating network overhead and using a hyper-optimized columnar engine, queries that took 800ms on Elasticsearch now run in under 500ms directly on the application server.
Cost Savings: Our monthly infrastructure bill for this system plummeted from ~$2,000 to less than $50. The cost is now just S3 storage for the Parquet and DuckDB files, plus a marginal increase in CPU on our existing application servers during queries. That’s a 97% reduction in cost.
Simplicity: Our architecture diagram went from a complex web of nodes, load balancers, and monitoring agents to a box labeled "App Server" with an arrow pointing to a file in S3. There are no servers to patch, no shards to balance, and no daemons to restart. It's a version-controlled, stateless, and incredibly robust system.
Is This Approach Right for You?
This solution isn't a silver bullet. Replacing a distributed search cluster with DuckDB works best under specific conditions.
This approach is a great fit if:
Your data size is manageable on a single modern server (from gigabytes up to a few terabytes).
Your workload is read-heavy, with data updates happening in batches (e.g., hourly or daily).
Your primary need is fast analytical queries and full-text search, not real-time, per-document indexing.
Your team values operational simplicity and wants to reduce infrastructure dependency.
You should probably stick with a traditional search cluster if:
You require real-time indexing with sub-second latency for new documents.
Your dataset is in the multi-terabyte or petabyte range and requires horizontal scaling.
You rely heavily on advanced, fine-tuned relevancy scoring models like BM25, which are more mature in systems like Lucene.
Conclusion: Challenge Your Infrastructure Assumptions
Our experience demonstrates a powerful trend in modern data engineering: the "unbundling" of the monolithic database. By pairing portable data formats like Parquet with powerful, embeddable query engines like DuckDB, you can build incredibly performant and resilient systems without the crushing weight of traditional infrastructure.
We successfully replaced our search cluster with a single DuckDB file, and in doing so, we traded immense complexity and cost for speed and simplicity. The next time you reach for a distributed system by default, ask yourself: could a simpler, more modern tool do it better?
Ready to simplify your data stack? Start by exploring how you can export one of your datasets to Parquet and run a few analytical queries with DuckDB. The results might just surprise you.