Cyber security and AI · 7 months
A self-hosted RAG pipeline that took threat attribution from hours to seconds
A multinational cyber threat intelligence firm

The challenge
The firm sells threat intelligence to enterprise and government clients, and its whole value rests on getting to the right answer faster than the adversary moves. It wanted to dramatically improve the speed, accuracy and scale at which emerging threats were identified, triaged and reported, and it had landed on large language models as the obvious lever.
The obvious lever was also the one it could not pull. The data is sensitive, the clients are regulated, and the analysts are sceptical by trade. Sending feeds out to a commercial cloud LLM API was a non-starter on every count. Whatever we built had to run on customer-controlled infrastructure, with provenance you could audit line by line, or it would not be used.
Underneath that sat a more ordinary problem: analysts were drowning. Raw feeds arrived faster than anyone could read them, historic context lived in people’s heads, and a meaningful share of every shift went on chasing false positives rather than the incidents that mattered.
Our approach
We designed and built a retrieval-augmented generation pipeline in Python and LangChain, with vector databases ingesting, embedding and querying vast volumes of threat intelligence. The point of it was plain-language interrogation: an analyst could ask a question of both historic and live data and get a grounded answer back, with the underlying sources attached rather than asserted.
To enrich the raw feeds before they ever reached an analyst, we developed automated classification and entity-extraction models, deployed behind scalable FastAPI microservices on Docker and Kubernetes. That gave the firm a horizontally scalable enrichment layer that ran entirely inside its own boundary.
On top of the pipeline we built automated reporting: LLM-driven summarisation that generates the customer-facing PDF intelligence reports the firm sells, with structured formatting and visualisations rather than a wall of generated text.
Because this was AI in a regulated setting, the engineering discipline mattered as much as the models. We wired up end-to-end CI/CD with pytest and Cypress and built an evaluation harness around data provenance, model quality and responsible deployment, so a change to a prompt or a model was tested like any other change, not trusted on faith.
What we delivered
- A retrieval-augmented generation pipeline (Python, LangChain) over Pinecone and Weaviate, queryable in natural language.
- Classification and entity-extraction models enriching raw feeds, served by FastAPI microservices on Kubernetes.
- An automated reporting pipeline producing structured, customer-facing PDF intelligence reports.
- End-to-end CI/CD with pytest and Cypress, plus an evaluation harness tracking provenance and model quality.
We were told self-hosted LLMs were a research project, not a product. They shipped us one that our analysts actually trust.
Outcome
Analyst time-to-insight on attribution and historic-data queries fell from hours to seconds.
False-positive rates dropped significantly, freeing analysts to concentrate on the highest-priority incidents.
The firm now has a proven, end-to-end capability for self-hosted, domain-tuned LLMs in a regulated, customer-controlled environment: the thing the market said could not be done safely.
Cyber security and AI · 7 months
A self-hosted RAG pipeline that took threat attribution from hours to seconds
A multinational cyber threat intelligence firm

The challenge
The firm sells threat intelligence to enterprise and government clients, and its whole value rests on getting to the right answer faster than the adversary moves. It wanted to dramatically improve the speed, accuracy and scale at which emerging threats were identified, triaged and reported, and it had landed on large language models as the obvious lever.
The obvious lever was also the one it could not pull. The data is sensitive, the clients are regulated, and the analysts are sceptical by trade. Sending feeds out to a commercial cloud LLM API was a non-starter on every count. Whatever we built had to run on customer-controlled infrastructure, with provenance you could audit line by line, or it would not be used.
Underneath that sat a more ordinary, and more fundamental, problem: there was no trustworthy data foundation. Raw feeds arrived faster than anyone could read them, in dozens of incompatible formats, from sources whose schemas changed without warning; historic context lived in people’s heads; and a meaningful share of every shift went on chasing false positives rather than the incidents that mattered. No retrieval layer would be worth anything sitting on top of data nobody could trust or trace.
Our approach
Before any model, we built the data platform underneath. We stood up high-volume ingestion on Apache Kafka to absorb terabytes of heterogeneous live and historic feeds, normalised them into a consistent schema with explicit data contracts so a source changing its format tripped an alert rather than poisoning the store, and orchestrated the batch enrichment in Airflow. Structured records landed in PostgreSQL and the raw, immutable corpus in Parquet on S3: a layered, rebuildable store where every record carries the provenance of where it came from and when.
On that foundation we designed and built a retrieval-augmented generation pipeline in Python and LangChain, with vector databases embedding and querying vast volumes of threat intelligence. The point of it was plain-language interrogation: an analyst could ask a question of both historic and live data and get a grounded answer back, with the underlying sources, and their full lineage, attached rather than asserted.
To enrich the raw feeds before they ever reached an analyst, we developed automated classification and entity-extraction models, deployed behind scalable FastAPI microservices on Docker and Kubernetes. That gave the firm a horizontally scalable enrichment layer that ran entirely inside its own boundary.
On top of the pipeline we built automated reporting: LLM-driven summarisation that generates the customer-facing PDF intelligence reports the firm sells, with structured formatting and visualisations rather than a wall of generated text.
Because this was AI in a regulated setting, the engineering discipline mattered as much as the models. We wired up end-to-end CI/CD with pytest and Cypress and built an evaluation harness around data provenance, model quality and responsible deployment, so a change to a prompt or a model was tested like any other change, not trusted on faith.
What we delivered
- A high-volume ingestion and storage platform (Kafka, Airflow, PostgreSQL and Parquet on S3) normalising terabytes of heterogeneous feeds behind explicit, alerting data contracts.
- A retrieval-augmented generation pipeline (Python, LangChain) over Pinecone and Weaviate, queryable in natural language with full source lineage.
- Classification and entity-extraction models enriching raw feeds, served by FastAPI microservices on Kubernetes.
- An automated reporting pipeline producing structured, customer-facing PDF intelligence reports.
- End-to-end CI/CD with pytest and Cypress, plus an evaluation harness tracking data provenance and model quality.
We were told self-hosted LLMs were a research project, not a product. They shipped us one that our analysts actually trust.
Outcome
A trustworthy, fully traceable data foundation now underpins every query: terabytes of once-siloed feeds normalised into a single governed store.
Analyst time-to-insight on attribution and historic-data queries fell from hours to seconds.
False-positive rates dropped significantly, freeing analysts to concentrate on the highest-priority incidents.
The firm now has a proven, end-to-end capability for self-hosted, domain-tuned LLMs over a governed data platform in a regulated, customer-controlled environment: the thing the market said could not be done safely.
Working on something similar?