Biotech · Pharma

From scattered assays
to a unified
discovery pipeline.

Bespoke data pipelines for proteomics-led drug discovery: connecting assays, public data and third-party tools. Automating everything that slows down the science.

The status quo:

  1. 01

    Experimental data scattered across instruments, formats and folders: every analysis starts with a hunt.

  2. 02

    Off-the-shelf platforms force a choice: bend the science to fit the tool, or live with manual workarounds.

  3. 03

    Scientists queue for data support, slowing cycle times when speed matters most.

  4. 04

    Critical insights buried in spreadsheets, hard to revisit between experiments.

What we believe

Discovery data infrastructure
should fit the science,
not the other way around.

The teams pushing the frontiers of drug discovery don’t run a standard workflow. They run experiments that don’t exist in commercial software, blend public and proprietary data in non-obvious ways, and refine their methods continuously.

Off-the-shelf platforms force a choice: bend the science to fit the tool, or carry the cost of manual workarounds. We believe the right answer is neither.

A small, well-designed pipeline — built around the team’s actual experiments and the team’s actual mental model — pays back many times over. Scientists move from result to insight without waiting in line; science accelerates.

The cost of building infrastructure has collapsed. Discovery teams have the most to gain.

Case — Cloud Data Pipeline · Covalent Chemoproteomics

A unified pipeline for an ACE discovery platform

Matchpoint Therapeutics is a Boston-based biotech building precision covalent medicines through its Advanced Covalent Exploration (ACE) platform — integrating chemoproteomics, machine learning and covalent chemistry library evolution. A high volume of non-standard experiments and tight discovery cycles meant the platform team needed data infrastructure that could keep pace with the science, not constrain it.

Together with Matchpoint’s science and platform-technology teams, we conceptualised the pipeline in a series of workshops and then implemented it piece by piece in their Google Cloud environment. Own assay results, public data and computational tools — including custom Fortran code — flow into a unified data lake and warehouse. Web-based assistants guide ingestion and quality control, custom dashboards reflect Matchpoint’s specific way of looking at the data, and external annotations layer in automatically.

Pipeline architecture
Drug-discovery proteomics data pipelineA four-stage pipeline. Stage 1 Ingest brings in internal assay results, public databases and external tools. Stage 2 Curate applies guided ingestion, quality checks and standards. Stage 3 Store and Compute runs on a data lake and warehouse with reproducible workflows in your own cloud. Stage 4 Explore and Decide delivers custom dashboards, annotated context and experiment-to-decision flow for scientists.STAGE 01IngestInternal assay resultsPublic databasesExternal tools & codeAny format welcomedSTAGE 02CurateGuided ingestion UIQuality checks built inStandards by designExpert input where it mattersSTAGE 03Store & ComputeData lake → warehouseReproducible workflowsYour cloud, your controlPlug in any analysisSTAGE 04Explore & DecideCustom dashboardsExternal annotationsSelf-serve for scientistsExperiment → decisionDEPLOYED IN YOUR CLOUD · CO-DESIGNED WITH YOUR SCIENCE TEAM
Four-stage pipeline. Sources fan in on the left, expert-guided curation enforces standards, the data lake and warehouse run in your own cloud, and custom dashboards land each experiment in a decision.

01

Three weeks from kick-off to PoC pipeline, 8 weeks to delivery

02

Experiment-to-insight in real time, fully automated

03

Cross-functional teams work independently of data support

It is a pleasure working with the idalab team on our data and machine learning pipeline. They are an outstanding strategic partner, collaborating seamlessly with our science team. Fast, clear communication, structured — yet always happy to adapt ad hoc, if necessary. We are looking forward to continuing the collaboration.
Suresh Singh, PhD Senior Vice President, Computational Sciences

Getting started

Pipeline Architecture Design Sprint

Three weeks to a decision-ready plan: a reference architecture, an implementation roadmap, and the organisational alignment for both.

Results

  1. Reference architecture

    A pipeline design fitted to your science, your data and your team’s workflow.

  2. Implementation roadmap

    Sprints with effort, risk and dependencies; ready to execute.

  3. Team alignment

    Every voice heard, requirements reconciled, decisions on the record.

  4. Frontend prototype (optional)

    A critical ingestion or analysis app fleshed out for pressure-testing.

How the three weeks run

Week 1&2 (Discovery)

Listen. Map. Revise.

Interviews and synthesis workshop across science, platform, business. Data & tooling review. Requirement synthesis.

Week 3 (Design)

Draft. Validate. Read out.

Architecture sketch. Iteration with your teams. Roadmap drafting. Closing readout for sponsors and leadership.

Technology & Analysis Stack

From identification to decision making: the workflows and tools we have worked with over the last few years.

Identification & quantification
  • Label-free quantification (LFQ)MaxQuant · FragPipe · Spectronaut
  • Data-independent acquisition (DIA)DIA-NN · Spectronaut · MaxDIA
  • Isobaric labelling (TMT, iTRAQ)FragPipe-TMT · IsobarQuant · MaxQuant
  • PTM site localizationFragPipe-PTM · MSFragger · MaxQuant
  • Cross-linking MS (XL-MS)pLink · XlinkX · MeroX
Statistics & modelling
  • Differential abundance & testingMSstats · limma · DEqMS · Perseus
  • Batch correction & normalisationComBat · RUV-III · vsn · median-MAD
  • Missing-value imputationMissForest · MICE · MinDet
  • Time-course & longitudinal modellinglimma splines · lme4 · MEFISTO
  • ML for biomarker discoveryscikit-learn · XGBoost · SHAP · PyTorch
Downstream & integration
  • Pathway & gene-set enrichmentfgsea · clusterProfiler · Reactome · MSigDB
  • PPI & network analysisSTRING · IntAct · Cytoscape
  • Structure-prediction integrationAlphaFold 2/3 · ColabFold · ESMFold
  • Multi-omics integrationMOFA · mixOmics · DIABLO
  • Affinity & target deconvolutionSAINT · ProHits · mineCETSA · TPP-TR
Workflow integration
  • Custom dashboardsStreamlit · Dash · Plotly · Observable
  • Ingestion & QC web appsStreamlit · FastAPI · Pydantic
  • Pipeline orchestrationSnakemake · Nextflow · Airflow · Prefect
  • ELN / external-data connectorsBenchling · UniProt · ChEMBL · custom REST
  • Notification & reportingSlack · Quarto · Jupyter · papermill

How we work

  1. Co-designed with science. We work alongside your scientists and platform team. Pipeline structure, dashboards and ingestion flows are shaped together, not handed over at the end.
  2. Built for the way you actually work. Dashboards mirror your team's specific analyses. Ingestion interfaces enforce your standards. Nothing generic, nothing forced.
  3. Deployed in your infrastructure. For maximum security and control, the pipeline lives in your cloud (Google Cloud, AWS, Azure or hybrid). Sealed off from external access where you need it.
  4. You own what we build. From data structure to interface code, what we deliver is your, all IP captured. And we're happy to support your internal team taking over, or help hire it.

Clients

Roche Arkuda Therapeutics Bayer Biotronik Charité Helios Kliniken HotSpot Therapeutics Kintiga Kymera Therapeutics Matchpoint Therapeutics Schwind eye-tech Sofinnova Partners

Frequently asked questions

What does an engagement look like?
We support along the entire process. Working closely with the science team, we conceptualise the pipeline. From there we implement it piece by piece in your cloud environment, keeping the science team in the loop throughout.
How long does it take to build such a pipeline?
Three to six weeks from the kick-off workshop to the delivery of the initial pipeline is a realistic baseline. From there, additional capability extends the pipeline incrementally in 2–4 week sprints.
What kind of data sources or computational tools can be integrated?
Anything goes — even custom tools written in Fortran can be brought into a 21st-century cloud pipeline.
Can our own data engineering team operate this pipeline?
Definitely. We would love to team up with your data engineering team, and actively support phasing us out when the time is right. Everything we build is yours.
What technology do you use?

Choices are made together with your team to fit existing infrastructure and skills. A common stack:

  • Programming language: Python
  • Web-app development: Streamlit
  • Data lake: cloud object storage (Google Cloud Storage, AWS S3, Azure Blob)
  • Data warehouse: cloud-native (BigQuery, Snowflake, Redshift)
  • Web-app deployment: managed app platform (Cloud Run, App Engine, ECS)
  • Securing web-app access: identity-aware proxy or SSO
  • Integration of external tools: serverless functions (Cloud Functions, Lambda)

We’re cloud-provider agnostic: happy with Google, Amazon, Microsoft or Nebius.

How do you handle confidential and regulated data?
All work is covered by a mutual NDA and, where applicable, a data processing agreement. The pipeline runs entirely in your environment, under your security and access controls. We work with clients operating under HIPAA, GxP and equivalent regimes.

Let's talk

Benjamin Häusler

Senior Consultant

Benjamin leads our drug discovery data engineering work, partnering with biotech R&D teams to build the data infrastructure their science actually needs.

Let’s start a conversation.