data-architectureanalyticsCRM

Architecting Data Lakes for CRM + Market Data: Schema and Storage Choices

UUnknown

2026-02-17

11 min read

Design a domain‑aware lakehouse for CRM and market feeds: schema, partitioning, storage backends, and cost controls for 2026+.

Hook: Why your CRM + Market Feeds lakehouse is probably costing you time and money

If you're responsible for analytics infrastructure, you know the tension: CRM systems generate relational, slowly changing customer records while commodity market feeds produce high‑velocity, high‑cardinality tick data. Putting both into a single analytical lakehouse promises unified insights — but without deliberate schema and storage choices you'll hit exploding costs, poor query performance and operational complexity.

The concise answer (read this first)

Design a domain‑aware lakehouse that keeps CRM and market feed data in distinct table families, uses an open table format (Iceberg/Delta/Hudi) for ACID and time travel, stores columnar Parquet (or Parquet v2) for analytics, partitions market feeds by time + symbol bucketing and CRM by tenant/region + activity date, and enforce lifecycle + tiering policies to control costs. Add query acceleration (materialized views, Z‑ordering/data skipping, caching) and multi‑region read replicas to meet latency needs without full cross‑region egress.

Context and 2026 trends affecting lakehouse design

As of 2026, the market has coalesced around a few reliable patterns:

Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) are the de facto standard for production lakehouses — major cloud providers and engines now provide native support for them.
Parquet v2 adoption and improved columnar engines deliver much better predicate pushdown, dictionary and bloom filters; this makes columnar storage the best fit for combined CRM + market analytics.
Serverless and autoscaling compute reduce ops overhead but require disciplined cost controls (query limits, budget tags) to avoid runaway bills.
Real‑time streaming into the lakehouse (Flink, Kafka, serverless ingestion) is mainstream, but the right balance between hot (real‑time) vs warm/cold (aggregated) storage is critical for cost/latency tradeoffs.

High‑level architecture recommendations

Design the lakehouse as two coordinated domains:

CRM domain — customer master, accounts, interactions, support tickets. Low velocity, highly relational, often SCD type 2.
Market domain — ticks, quotes, trades, order books. Extremely high velocity, time‑series heavy, very large cardinality (symbols, exchanges).

Between them, create a small set of linking dimensions (dim_customer, dim_instrument, dim_time) so analysts can join without duplicating whole datasets.

Why separate domains?

Keeping domains distinct preserves optimized schemas and partitioning strategies for very different access patterns. You still get unified analytics through well‑designed joins and materialized aggregates rather than forcing a one‑size‑fits‑all schema that degrades both workload types.

Schema design patterns

Below are pragmatic schema patterns tailored to CRM and market feeds and then combined patterns for analyst consumption.

CRM schema (operational + analytical)

Use a normalized source layer mirroring CRM objects (customers, accounts, opportunities) stored in Avro/Parquet as raw snapshots or CDC streams.
Implement an SCD Type 2 dim_customer for historical analysis (customer_id, surrogate_key, valid_from, valid_to, current_flag, attributes...).
Create a thin analytics layer of denormalized star schemas for reporting (fact_customer_activity, fact_sales).
Include tenant_id or region in primary key/partitions for multi‑tenant setups.

Market feeds schema (high velocity)

Store raw feed events in append‑only files (Avro/JSON for raw; then convert to Parquet for analytics).
Design facts for ticks and trades: fact_ticks(symbol, exchange, timestamp_utc, bid, ask, last, volume, sequence_id, partition_key).
Keep instrument metadata in dim_instrument (symbol, isin, asset_class, venue, currency).
Use lightweight summary tables (minute_bars, hourly_ohlc) as pre‑aggregated materialized views for most analytics.

Unified analytics schema

Avoid joining raw tick streams to CRM records in ad‑hoc queries. Instead:

Create curated datasets: e.g., customer_instrument_activity that maps customer trades/engagement to instruments — derived via batch/stream processing and stored as compact Parquet tables.
Use a dim_time table with multiple granularities (minute/hour/day) to speed truncation/aggregation.

Partitioning strategies: reduce scan and cost

Partitioning is the single biggest lever for query performance and cost reduction. But wrong partitions create hotspots and small files. Use these rules:

Market feeds — time + symbol bucketing

Primary partition: date (YYYY‑MM‑DD) and hour for intra‑day workloads. This keeps queries scanning a small number of files for recent windows.
Secondary bucketing: hash(symbol) into N buckets (e.g., 64 or 128) or use a dedicated bucket function supported by Iceberg/Delta. Bucketing distributes writes and avoids hotspots for extremely active symbols.
Store sequence_id and file rollup metadata to enable efficient incremental reads.

CRM — tenant/region + activity date

Primary partition: tenant_id or region for multi‑tenant systems, then activity_date (or month) to avoid many small partitions for low‑volume tenants.
For customer master records, avoid over‑partitioning — SCD tables are often compact and benefit more from clustering than deep partition trees.

General partitioning rules

Target optimal file sizes of 256MB–1GB for Parquet files to maximize throughput and minimize per‑file overhead.
Use compaction jobs to merge small files (daily for market feeds, weekly for CRM depending on volume).
Leverage table‑format partitioning features (Iceberg partition transforms, Delta’s Z‑ordering) to support both partition pruning and clustering.

Storage backend choices and tradeoffs

Choose the storage backend to match durability, cost, and access patterns.

Cloud object storage (S3 / GCS / Azure Blob) — default choice

Pros: low cost per TB, native integration with analytics engines, lifecycle tiering, cross‑region replication options.
Cons: egress costs between regions and from cloud provider to external compute, eventual consistency corner cases (mostly resolved by modern providers).
Best for: raw and cold/warm data, large historical market feeds, CRM snapshots.

Managed lakehouse storage (Databricks, Snowflake, cloud provider lakehouses)

Pros: integrated compute + metadata, optimizations like auto‑compaction and optimized file formats, easier operational overhead.
Cons: higher effective cost, vendor lock‑in considerations.
Best for: teams prioritizing developer velocity and who can negotiate enterprise pricing.

Distributed file systems (HDFS / on‑prem)

Pros: local control, predictable egress, potentially lower data transfer costs inside datacenter.
Cons: operational complexity; not ideal for multi‑region scale.

Hybrid recommendations

Use cloud object storage for the lake, but deploy lightweight, regional read replicas (or object cache layers) for latency‑sensitive consumers. In 2026 many engines support read‑through caches and regional query routing which reduces cross‑region egress.

Open table formats: why they matter in 2026

Iceberg, Delta, and Hudi give you ACID semantics, partition evolution, and manifest files that keep metadata scalable. For combined CRM + market use cases, they enable:

Transactional upserts for CRM SCDs without rewriting huge partitions.
Efficient time travel for audits (customer history, trade recon) and reproducible analytics.
Fine‑grained partition/manifest pruning so queries only touch relevant files — essential when ticks run into billions of rows per day.

Query performance optimizations

Prioritize these actions:

Materialized aggregates (minute/hour/day OHLC, customer engagement summaries) to avoid scanning raw ticks for common reports.
Z‑ordering / clustering on (symbol, timestamp) for market data and (tenant_id, customer_id) for CRM to reduce IO for common predicates.
Data skipping indexes (min/max statistics, bloom filters) — Parquet row group stats and Iceberg metadata accelerate predicate pushdown.
Vectorized query engines (Trino, Starburst, Spark with vectorized readers) and pushdown filters for aggressive CPU and IO savings.
Materialized views and query result caches for dashboarding layers; set TTLs to refresh at business-appropriate cadences for market data (e.g., 1–5 minutes for many use cases).

Streaming ingestion patterns

In 2026, real‑time flows are standard. For pragmatic design:

Ingest raw market feed messages into a high‑throughput streaming system (Kafka, Pulsar) and write compact micro‑batches to the lake using an Iceberg/Delta writer or a streaming sink (Flink, Spark Structured Streaming).
Write CRM CDC events (Debezium or cloud CDC) directly into an upsertable table format with change‑capture semantics to maintain SCD tables.
Separate hot path (in‑memory or fast object store for last N minutes) from the warm path (daily Parquet/iceberg files) to balance latency and storage costs.

Cost controls and operational guardrails

Data teams often forget that queries cost money. Use both architectural and policy controls:

Storage cost controls

Tiering: hot (last 7–30 days) in standard object storage, warm (30–365 days) in infrequent access, cold (archival) on Glacier/Archive classes.
Lifecycle rules: auto‑compact and move older tick files to cheaper tiers; keep small, frequently accessed aggregated tables in hot tier.
Compression: choose ZSTD or Brotli for Parquet v2 for a better compression/CPU tradeoff over Snappy in many cases.

Query cost controls

Enforce query quotas and concurrent query limits at the SQL engine level.
Tag queries by team and apply budgeting and alerting to prevent runaway usage.
Use cost‑based query planning and row‑count estimates; reject full table scans of market tick tables unless explicit approval is granted.

Network and egress savings

Prefer regionally colocated compute for large analytic jobs. Use read replicas for cross‑region readers instead of cross‑region scans.
Cache aggregated datasets at edge/CDN for dashboard consumers; precompute API outputs and serve from cache to avoid repeated heavy queries.

Operational playbook: daily/weekly tasks

Daily: run micro‑compaction for market feed partitions, refresh materialized aggregates for dashboards, monitor failed ingestion offsets.
Weekly: run larger compaction and clustering jobs for CRM domains, review small file metrics and manifest file growth.
Monthly: evaluate cross‑region replication costs, revise lifecycle tiers, and review query cost reports to identify rogue queries.

Real‑world example (anonymized)

"A mid‑size commodities trading firm consolidated CRM (sales + client profiles) and 24/7 market feeds into one lakehouse. By re‑partitioning ticks to date + 128 symbol buckets, switching to Parquet v2 with ZSTD, and creating minute_bars + customer_activity materialized views, they reduced interactive query costs by ~30% and cut storage egress by routing reads to regional replicas."

This mirrors outcomes we've seen: focused partitioning, compaction discipline and targeted materializations reduce both latency and cost.

Practical examples: SQL & config snippets

Use these as templates. Adapt for your table format and engine.

Example: create an Iceberg table for market ticks (pseudo‑SQL)

CREATE TABLE market_ticks (
  symbol STRING,
  exchange STRING,
  ts TIMESTAMP,
  bid DOUBLE,
  ask DOUBLE,
  last DOUBLE,
  volume BIGINT,
  sequence_id BIGINT
)
USING ICEBERG
PARTITIONED BY (days(ts), bucket(64, symbol))
STORED AS PARQUET
TBLPROPERTIES ('write.format'='parquet')

Example: CRM SCD Type 2 (pseudo‑SQL)

CREATE TABLE dim_customer (
  customer_id STRING,
  surrogate_id BIGINT,
  valid_from TIMESTAMP,
  valid_to TIMESTAMP,
  current_flag BOOLEAN,
  attributes MAP
)
USING DELTA
PARTITIONED BY (tenant_id)

Schedule compact/optimize jobs after heavy upserts for delta/iceberg.

Migration checklist: moving CRM + market feeds into the lakehouse

Inventory sources: CRM tables, CDC endpoints, market feed topics, instrument reference data.
Define retention policy per domain and map to storage tiers.
Proof‑of‑value: pick a single instrument group and a single tenant for a pilot; measure query latency and storage growth.
Automate ingestion pipelines (CDC + streaming) and establish source‑to‑lake schema mapping.
Implement table formats and initial partitioning; run compaction strategy and test common queries for cost and latency.
Roll out cross‑region read replicas and caches for global teams — consider operational playbooks and regional routing patterns.

Common pitfalls and how to avoid them

Too many small files: set minimum file size thresholds and run frequent compaction jobs.
Over‑partitioning: avoid partitioning by high‑cardinality attributes like raw symbol for long time horizons — prefer bucketing.
Unbounded hot storage: set lifecycle policies and clear ownership for raw feed retention.
Ad‑hoc joins on raw ticks: create curated aggregates and mapping tables for common joins to protect performance.

Future proofing for 2027 and beyond

Expect further convergence: cloud providers will continue to add native Iceberg/Parquet v2 optimizations, engines will improve row‑level indexing and adaptive caching, and cross‑cloud replication patterns will become more cost‑efficient. Architect with open formats, avoid deep vendor lock‑in, and codify partitioning/compaction policies as code so you can adapt quickly.

Actionable takeaways (your immediate next steps)

Audit current file sizes and partition layout — set a target of 256MB–1GB per Parquet file.
Separate CRM and market domains in your catalog and apply domain‑specific partitioning.
Adopt an open table format (Iceberg/Delta/Hudi) for ACID, upserts, and time travel.
Implement lifecycle and tiering policies today to cap storage costs.
Build materialized aggregates for common dashboards to avoid raw tick scans.

Closing / Call to action

Combining CRM and commodity market feeds into a single lakehouse unlocks powerful analytics — but only if you design for domain‑specific schemas, disciplined partitioning and cost‑aware storage. If you want a ready‑to‑run checklist and partitioning templates tailored to your environment, download our architect's playbook or schedule a technical review with our engineers to map these recommendations onto your stack.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.