- Under 50M rows, under 5 reporting users, under 20 queries/day: you do not need a warehouse yet
- A read replica and Metabase will take you further than most people expect
- The trigger is combining 3+ data sources, not just volume alone
- Going too early is just as costly as going too late — and more common
There are two ways to get data infrastructure badly wrong. The first is building a data warehouse at 20 employees because a new hire used Snowflake at their last job. The second is still running off Google Sheets and a production database at 200 employees, wondering why nobody trusts the numbers.
Both failures are expensive. This guide tells you where you sit in your data journey and what infrastructure choices actually make sense at each stage.
What a data warehouse actually is (and why timing matters)
- A database optimised for analytical queries and reading, not writing — columnar storage, not row-based
- Separates your analytical workload from your production system so neither slows the other
- Built for scanning millions of rows across multiple tables — the queries your app database handles badly without the right infrastructure
Do you need a data warehouse now?
- Fewer than 50M rows in any single table
- Under 5 people running reports
- Under 20 analytical queries per day
- Data lives in 1-2 places
- Slowest dashboard query runs in under 30 seconds on a read replica
- Tables above 50-100M rows, or growing at 10M+ rows/month
- 10+ people regularly pulling reports
- 50+ analytical queries per day
- 3+ data sources you need to combine (CRM, payments, app DB)
- Analytical queries consistently over 60 seconds even with indexes
What to use at each stage
Early stage — Postgres read replica + Metabase
Point Metabase at a read replica of your production database. Add materialised views for slow queries and refresh them nightly. Cost is near zero beyond the replica itself.
Best for: under 10M rows, under 10 data usersMiddle stage — DuckDB
DuckDB is an in-process analytical database that queries Parquet files, your Postgres replica, and CSV exports with zero infrastructure. A SQL-comfortable analyst can build a full reporting layer in a day at a fraction of the cost of a managed warehouse.
Best for: 10-100M rows, 2-4 data sourcesScale stage — BigQuery, Snowflake, or Redshift
At this point managed cloud warehouses earn their cost. BigQuery suits unpredictable query patterns (pay per query). Snowflake wins on multi-cloud flexibility. Redshift is the natural fit for AWS-heavy teams. MotherDuck (hosted DuckDB) is worth evaluating before jumping to enterprise pricing.
Best for: 100M+ rows, 5+ analysts, multiple integrated sourcesWhen to bring in help
- Most of our data engineering and infrastructure design services focus on fixing warehouses built without the right schema and architecture from the start — it is almost always more expensive than doing it correctly the first time
- If you are not sure where you sit in terms of data infrastructure readiness, contact us for a technical assessment — no unnecessary infrastructure sold, just honest advice on data strategy