Biotech · Pharma

From scattered assays
to a unified
discovery pipeline.

Bespoke data pipelines for proteomics-led drug discovery: connecting assays, public data and third-party tools. Automating everything that slows down the science.

Working with discovery teams at Matchpoint Tx, HotSpot Tx, Roche and others.

The status quo:

01
Experimental data scattered across instruments, formats and folders: every analysis starts with a hunt.
02
Off-the-shelf platforms force a choice: bend the science to fit the tool, or live with manual workarounds.
03
Scientists queue for data support, slowing cycle times when speed matters most.
04
Critical insights buried in spreadsheets, hard to revisit between experiments.

What we believe

Discovery data infrastructure
should fit the science,
not the other way around.

The teams pushing the frontiers of drug discovery don’t run a standard workflow. They run experiments that don’t exist in commercial software, blend public and proprietary data in non-obvious ways, and refine their methods continuously.

Off-the-shelf platforms force a choice: bend the science to fit the tool, or carry the cost of manual workarounds. We believe the right answer is neither.

A small, well-designed pipeline — built around the team’s actual experiments and the team’s actual mental model — pays back many times over. Scientists move from result to insight without waiting in line; science accelerates.

The cost of building infrastructure has collapsed. Discovery teams have the most to gain.

Case — Cloud Data Pipeline · Covalent Chemoproteomics

A unified pipeline for an ACE discovery platform

Matchpoint Therapeutics is a Boston-based biotech building precision covalent medicines through its Advanced Covalent Exploration (ACE) platform — integrating chemoproteomics, machine learning and covalent chemistry library evolution. A high volume of non-standard experiments and tight discovery cycles meant the platform team needed data infrastructure that could keep pace with the science, not constrain it.

Together with Matchpoint’s science and platform-technology teams, we conceptualised the pipeline in a series of workshops and then implemented it piece by piece in their Google Cloud environment. Own assay results, public data and computational tools — including custom Fortran code — flow into a unified data lake and warehouse. Web-based assistants guide ingestion and quality control, custom dashboards reflect Matchpoint’s specific way of looking at the data, and external annotations layer in automatically.

Pipeline architecture

Four-stage pipeline. Sources fan in on the left, expert-guided curation enforces standards, the data lake and warehouse run in your own cloud, and custom dashboards land each experiment in a decision.

Three weeks from kick-off to PoC pipeline, 8 weeks to delivery

Experiment-to-insight in real time, fully automated

Cross-functional teams work independently of data support

It is a pleasure working with the idalab team on our data and machine learning pipeline. They are an outstanding strategic partner, collaborating seamlessly with our science team. Fast, clear communication, structured — yet always happy to adapt ad hoc, if necessary. We are looking forward to continuing the collaboration.

Suresh Singh, PhD Senior Vice President, Computational Sciences

Getting started

Pipeline Architecture Design Sprint

Three weeks to a decision-ready plan: a reference architecture, an implementation roadmap, and the organisational alignment for both.

Results

Reference architecture
A pipeline design fitted to your science, your data and your team’s workflow.
Implementation roadmap
Sprints with effort, risk and dependencies; ready to execute.
Team alignment
Every voice heard, requirements reconciled, decisions on the record.
Frontend prototype (optional)
A critical ingestion or analysis app fleshed out for pressure-testing.

How the three weeks run

Week 1-2 (Discovery)

Listen. Map. Revise.

Interviews and synthesis workshop across science, platform, business. Data & tooling review. Requirement synthesis.

Week 3 (Design)

Draft. Validate. Read out.

Architecture sketch. Iteration with your teams. Roadmap drafting. Closing readout for sponsors and leadership.

Technology & Analysis Stack

From identification to decision making: the workflows and tools we have worked with over the last few years.

Identification & quantification

Label-free quantification (LFQ)MaxQuant · FragPipe · Spectronaut
Data-independent acquisition (DIA)DIA-NN · Spectronaut · MaxDIA
Isobaric labelling (TMT, iTRAQ)FragPipe-TMT · IsobarQuant · MaxQuant
PTM site localizationFragPipe-PTM · MSFragger · MaxQuant
Cross-linking MS (XL-MS)pLink · XlinkX · MeroX

Statistics & modelling

Differential abundance & testingMSstats · limma · DEqMS · Perseus
Batch correction & normalisationComBat · RUV-III · vsn · median-MAD
Missing-value imputationMissForest · MICE · MinDet
Time-course & longitudinal modellinglimma splines · lme4 · MEFISTO
ML for biomarker discoveryscikit-learn · XGBoost · SHAP · PyTorch

Downstream & integration

Pathway & gene-set enrichmentfgsea · clusterProfiler · Reactome · MSigDB
PPI & network analysisSTRING · IntAct · Cytoscape
Structure-prediction integrationAlphaFold 2/3 · ColabFold · ESMFold
Multi-omics integrationMOFA · mixOmics · DIABLO
Affinity & target deconvolutionSAINT · ProHits · mineCETSA · TPP-TR

Workflow integration

Custom dashboardsStreamlit · Dash · Plotly · Observable
Ingestion & QC web appsStreamlit · FastAPI · Pydantic
Pipeline orchestrationSnakemake · Nextflow · Airflow · Prefect
ELN / external-data connectorsBenchling · UniProt · ChEMBL · custom REST
Notification & reportingSlack · Quarto · Jupyter · papermill

How we work

Co-designed with science. We work alongside your scientists and platform team. Pipeline structure, dashboards and ingestion flows are shaped together, not handed over at the end.
Built for the way you actually work. Dashboards mirror your team's specific analyses. Ingestion interfaces enforce your standards. Nothing generic, nothing forced.
Deployed in your infrastructure. For maximum security and control, the pipeline lives in your cloud (Google Cloud, AWS, Azure or hybrid). Sealed off from external access where you need it.
You own what we build. From data structure to interface code, what we deliver is your, all IP captured. And we're happy to support your internal team taking over, or help hire it.

Articles

From volcano plots to biologically stratified effect plots in proteomics

Past the volcano plot lies something more useful: visualisations that put the biology, not the p-value, back at the center of the story.

Are volcano plots really the best tool to understand your data?

From calling them a mere starting point to a complete visual distraction, proteomics experts are challenging the status quo. Here is why.

Clients

Frequently asked questions

What does an engagement look like?

We support along the entire process. Working closely with the science team, we conceptualise the pipeline. From there we implement it piece by piece in your cloud environment, keeping the science team in the loop throughout.

How long does it take to build such a pipeline?

Three to six weeks from the kick-off workshop to the delivery of the initial pipeline is a realistic baseline. From there, additional capability extends the pipeline incrementally in 2–4 week sprints.

What kind of data sources or computational tools can be integrated?

Anything goes — even custom tools written in Fortran can be brought into a 21st-century cloud pipeline.

Can our own data engineering team operate this pipeline?

Definitely. We would love to team up with your data engineering team, and actively support phasing us out when the time is right. Everything we build is yours.

What technology do you use?

Choices are made together with your team to fit existing infrastructure and skills. A common stack:

Programming language: Python
Web-app development: Streamlit
Data lake: cloud object storage (Google Cloud Storage, AWS S3, Azure Blob)
Data warehouse: cloud-native (BigQuery, Snowflake, Redshift)
Web-app deployment: managed app platform (Cloud Run, App Engine, ECS)
Securing web-app access: identity-aware proxy or SSO
Integration of external tools: serverless functions (Cloud Functions, Lambda)

We’re cloud-provider agnostic: happy with Google, Amazon, Microsoft or Nebius.

How do you handle confidential and regulated data?

All work is covered by a mutual NDA and, where applicable, a data processing agreement. The pipeline runs entirely in your environment, under your security and access controls. We work with clients operating under HIPAA, GxP and equivalent regimes.

Let's talk

Benjamin Häusler

Senior Consultant

Benjamin leads our drug discovery data engineering work, partnering with biotech R&D teams to build the data infrastructure their science actually needs.

Let’s start a conversation.

From scattered assaysto a unifieddiscovery pipeline.

Discovery data infrastructureshould fit the science,not the other way around.