Cyber security / Data & analytics · 9 months

Consolidating fragmented detection telemetry into one governed threat-intelligence warehouse

A managed detection and threat-intelligence provider

The challenge

The provider monitors its customers’ estates and sells the intelligence that comes out of it, so its whole value rests on answering questions that cross every sensor it runs: how an indicator seen on one customer’s network relates to activity across the rest of the fleet, over time. But that telemetry lived in a dozen disconnected stores: endpoint logs in one system, network and DNS data in another, enrichment feeds in a third, each with its own schema, retention and definition of seemingly shared terms. Every cross-source report was stitched together by hand, took analysts days, and was quietly contradicted by the next one.

Because the data was customer telemetry (often regulated, sometimes OFFICIAL-SENSITIVE), the ad-hoc extracts were also a governance liability: nobody could say with confidence which analyst could see which tenant’s data, where a given detection figure had come from, or whether a customer offboarding had actually purged their records everywhere they had been copied. A reporting problem was, underneath, an accountability problem.

The brief was not "build a dashboard". It was to stand up a single source of truth the SOC and the customers could trust (defensible to an analyst, to an auditor, and to a client’s own security team), and to hand it over as something the in-house team could run and extend without us.

Our approach

We started with the modelling, because that is where these programmes succeed or fail. We mapped the genuinely shared entities across the telemetry sources (assets, identities, indicators, detections) and built conformed dimensions and fact tables in the Kimball tradition, with slowly-changing dimensions wherever an indicator’s history had to be preserved rather than overwritten. The warehouse was layered explicitly: a raw landing zone that is never edited, a cleaned and conformed core, and analyst- and customer-facing marts, so that every published figure traces back, through version-controlled dbt models, to the exact source event that produced it.

For ingestion we treated the source systems’ change logs as the system of record. Rather than hammering live detection platforms with nightly extract queries, we used log-based change-data-capture (Debezium onto Kafka) to stream inserts, updates and deletes as they happened, and made every load idempotent so a pipeline could be safely re-run without double-counting. Each downstream table is a derived, rebuildable view of that change stream, which means a bad transformation is a redeploy, not a data-loss incident.

We designed for the schemas changing underneath us, because across a dozen sensors and third-party feeds they always do. Producer teams and vendors were held to explicit data contracts, pipelines used backward- and forward-compatible encodings (Avro and Parquet), and every schema migration was tested against real historic data before it ran. Data-quality assertions (Great Expectations) run on every load, so a feed silently changing a field’s meaning trips an alert instead of corrupting a quarter of detections.

Governance was engineered in, not bolted on. Records are classified and tagged by tenant at the point of ingestion, access is least-privilege and fully auditable through role-based controls with hard tenant isolation, lineage is captured automatically from source to dashboard, and retention and erasure are enforced by the pipeline, so a customer offboarding provably purges every derived table rather than relying on someone remembering the copies. The whole estate is defined in Terraform so it can be rebuilt from a known state.

What we delivered

A single governed Snowflake warehouse consolidating a dozen telemetry sources, layered into raw, conformed-core and analyst- and customer-facing marts.
Log-based change-data-capture ingestion (Debezium / Kafka) with idempotent, replayable loads and orchestrated dependencies in Airflow.
Version-controlled, tested dbt models giving end-to-end lineage from any published detection figure back to its source event.
Automated data-quality gates, classification and tenant-tagging at ingestion, least-privilege auditable access with hard tenant isolation, and pipeline-enforced retention and erasure.
Full infrastructure as code in Terraform, plus runbooks and a handover that left the in-house team owning the platform.

For the first time we can give the same answer twice and show exactly which sensor it came from. That is worth more than any dashboard.

VP of Threat Intelligence

Outcome

Cross-source threat questions that previously took analysts days of manual stitching are now answered against the warehouse in seconds, from numbers the SOC can defend.

Every published detection figure is traceable to source, and tenant access and customer erasure are provable, turning a standing audit risk into an answerable question.

The in-house team now operates and extends the warehouse independently, onboarding new telemetry sources against the same conformed model rather than starting another silo.

Working on something similar?

Discuss a programme