When to build a data warehouse: timing guide for startups

TL;DR

Under 50M rows, under 5 reporting users, under 20 queries/day: you do not need a warehouse yet
A read replica and Metabase will take you further than most people expect
The trigger is combining 3+ data sources, not just volume alone
Going too early is just as costly as going too late — and more common

50M rows before you need a warehouse

3+ data sources is the clearest trigger

10+ analysts is the team-size signal

There are two ways to get data infrastructure badly wrong. The first is building a data warehouse at 20 employees because a new hire used Snowflake at their last job. The second is still running off Google Sheets and a production database at 200 employees, wondering why nobody trusts the numbers.

Both failures are expensive. This guide tells you where you sit in your data journey and what infrastructure choices actually make sense at each stage.

What a data warehouse actually is (and why timing matters)

A database optimised for analytical queries and reading, not writing — columnar storage, not row-based
Separates your analytical workload from your production system so neither slows the other
Built for scanning millions of rows across multiple tables — the queries your app database handles badly without the right infrastructure

Do you need a data warehouse now?

Don't need one yet

Fewer than 50M rows in any single table
Under 5 people running reports
Under 20 analytical queries per day
Data lives in 1-2 places
Slowest dashboard query runs in under 30 seconds on a read replica

Time to build one

Tables above 50-100M rows, or growing at 10M+ rows/month
10+ people regularly pulling reports
50+ analytical queries per day
3+ data sources you need to combine (CRM, payments, app DB)
Analytical queries consistently over 60 seconds even with indexes

The clearest signal in practice: the moment you are manually joining a CRM export to a finance spreadsheet to a database dump to answer one question, you have a data integration problem that a warehouse solves cleanly.

What to use at each stage

Early stage — Postgres read replica + Metabase

Point Metabase at a read replica of your production database. Add materialised views for slow queries and refresh them nightly. Cost is near zero beyond the replica itself.

Best for: under 10M rows, under 10 data users

Middle stage — DuckDB

DuckDB is an in-process analytical database that queries Parquet files, your Postgres replica, and CSV exports with zero infrastructure. A SQL-comfortable analyst can build a full reporting layer in a day at a fraction of the cost of a managed warehouse.

Best for: 10-100M rows, 2-4 data sources

Scale stage — BigQuery, Snowflake, or Redshift

At this point managed cloud warehouses earn their cost. BigQuery suits unpredictable query patterns (pay per query). Snowflake wins on multi-cloud flexibility. Redshift is the natural fit for AWS-heavy teams. MotherDuck (hosted DuckDB) is worth evaluating before jumping to enterprise pricing.

Best for: 100M+ rows, 5+ analysts, multiple integrated sources

When to bring in help

The inflection point: self-serve gets you through data access. Bring in a specialist when the work becomes about data modelling — dimensional schemas, idempotent ELT pipelines, slowly changing dimensions. Getting this wrong creates debt that compounds.

Most of our data engineering and infrastructure design services focus on fixing warehouses built without the right schema and architecture from the start — it is almost always more expensive than doing it correctly the first time
If you are not sure where you sit in terms of data infrastructure readiness, contact us for a technical assessment — no unnecessary infrastructure sold, just honest advice on data strategy

You do not need a data warehouse yet. Here is when you will.

What a data warehouse actually is (and why timing matters)

Do you need a data warehouse now?

What to use at each stage

Early stage — Postgres read replica + Metabase

Middle stage — DuckDB

Scale stage — BigQuery, Snowflake, or Redshift

When to bring in help

Not sure what your data stack actually needs?

You do not need a data warehouse yet. Here is when you will.

What a data warehouse actually is (and why timing matters)

Do you need a data warehouse now?

What to use at each stage

Early stage — Postgres read replica + Metabase

Middle stage — DuckDB

Scale stage — BigQuery, Snowflake, or Redshift

When to bring in help

The automation audit: five processes every SME should have automated by now

Reading your AWS bill like an engineer, not an accountant

Not sure what your data stack actually needs?