TL;DR
  • Under 50M rows, under 5 reporting users, under 20 queries/day: you do not need a warehouse yet
  • A read replica and Metabase will take you further than most people expect
  • The trigger is combining 3+ data sources, not just volume alone
  • Going too early is just as costly as going too late — and more common
50M rows before you need a warehouse
3+ data sources is the clearest trigger
10+ analysts is the team-size signal

There are two ways to get data infrastructure badly wrong. The first is building a data warehouse at 20 employees because a new hire used Snowflake at their last job. The second is still running off Google Sheets and a production database at 200 employees, wondering why nobody trusts the numbers.

Both failures are expensive. This guide tells you where you sit in your data journey and what infrastructure choices actually make sense at each stage.

What a data warehouse actually is (and why timing matters)

  • A database optimised for analytical queries and reading, not writing — columnar storage, not row-based
  • Separates your analytical workload from your production system so neither slows the other
  • Built for scanning millions of rows across multiple tables — the queries your app database handles badly without the right infrastructure

Do you need a data warehouse now?

Don't need one yet
  • Fewer than 50M rows in any single table
  • Under 5 people running reports
  • Under 20 analytical queries per day
  • Data lives in 1-2 places
  • Slowest dashboard query runs in under 30 seconds on a read replica
Time to build one
  • Tables above 50-100M rows, or growing at 10M+ rows/month
  • 10+ people regularly pulling reports
  • 50+ analytical queries per day
  • 3+ data sources you need to combine (CRM, payments, app DB)
  • Analytical queries consistently over 60 seconds even with indexes
The clearest signal in practice: the moment you are manually joining a CRM export to a finance spreadsheet to a database dump to answer one question, you have a data integration problem that a warehouse solves cleanly.

What to use at each stage

1

Early stage — Postgres read replica + Metabase

Point Metabase at a read replica of your production database. Add materialised views for slow queries and refresh them nightly. Cost is near zero beyond the replica itself.

Best for: under 10M rows, under 10 data users
2

Middle stage — DuckDB

DuckDB is an in-process analytical database that queries Parquet files, your Postgres replica, and CSV exports with zero infrastructure. A SQL-comfortable analyst can build a full reporting layer in a day at a fraction of the cost of a managed warehouse.

Best for: 10-100M rows, 2-4 data sources
3

Scale stage — BigQuery, Snowflake, or Redshift

At this point managed cloud warehouses earn their cost. BigQuery suits unpredictable query patterns (pay per query). Snowflake wins on multi-cloud flexibility. Redshift is the natural fit for AWS-heavy teams. MotherDuck (hosted DuckDB) is worth evaluating before jumping to enterprise pricing.

Best for: 100M+ rows, 5+ analysts, multiple integrated sources

When to bring in help

The inflection point: self-serve gets you through data access. Bring in a specialist when the work becomes about data modelling — dimensional schemas, idempotent ELT pipelines, slowly changing dimensions. Getting this wrong creates debt that compounds.