Skip to content

Nextflow and the Metadata Gap: How Quilt Bridges Science and Scale

Kevin Moore June 8, 2025

Nextflow, S3, metadata

By Kevin Moore, CEO, Quilt

I recently had the privilege of speaking at the 2025 Nextflow Summit, which allowed me to explore one of the most persistent challenges in life sciences data—metadata—and how Quilt uses the abstra

ction of data containers to help teams bridge the gap between the raw data researchers generate and the context they need to actually use it.

In most scientific environments, your data isn’t the problem – your metadata is.

You may have petabytes of raw sequencing data stored in S3, collected across multiple systems, platforms, and labs. But when it’s time to analyze it, share it, or reproduce a result six months later, things fall apart. Why?

Because critical metadata is still scattered across spreadsheets, internal databases, LIMS platforms like Benchling, or worse, forgotten in the minds of scientists who have since moved on.

At Quilt, we believe that fixing this isn’t just a matter of governance hygiene – it’s a strategic necessity. AI can’t scale on top of brittle, tribal-knowledge workflows. Reproducibility and reuse won’t happen when files are unlabelled, siloed, or stale.

That’s why we’re building a better foundation.

Why MetaData Matters More Than Ever

As I related in the presentation:

“We talk to biopharma companies all the time who say, ‘Yeah, I have a lot of FASTQs. I’m really not sure what they are.’”

It’s not just about having data. It’s about knowing where it came from, what it means, and how to use it. Metadata is what transforms raw files into reusable, trustworthy, queryable information.

When that metadata is scattered or incomplete, even the simplest analytical questions – like identifying cancer cell lines with high EGFR expression – require tedious manual effort.

Rather than asking researchers to stop and engineer schema-heavy warehouses upfront, we meet teams where they are – with messy, real-world data – and provide them with tools to wrap that data in logic, context, and structure.

Quilt Data Packages: Docker for Data

We designed Quilt Data Packages to bring structure, portability, and traceability to raw scientific data, just like Docker containers do for software.

Keep the raw files in native formats (FASTQs, BAMs, OME-TIFFs, FCS)
Annotate those files with rich metadata
Track every version, link outputs to inputs, and maintain reproducibility by default

The result: self-contained, reproducible units of data and metadata that can be shared, searched, governed, and trusted.

Instead of building the data warehouse first and re-engineering your workflows to fit it, the warehouse emerges as your current pipeline creates packages for every run and every experiment.

Automate Metadata Integration for Nextflow and Benchling

Many teams already utilize LIMS platforms, such as Benchling, or sequencing tools like BaseSpace. Quilt doesn’t replace those systems – it connects them.

Using our new Quilt Packaging Engine, we:

Listen for events (e.g., file uploads to Amazon S3)
Package those files automatically
Pull metadata from systems like Benchling via API
Keep raw data and context tightly linked in one reproducible unit

Support for standards like RO-Crate enables us to ingest and build packages automatically, converting your sequencing workflows into versioned, queryable datasets without the need for copy-pasting metadata from system to system.

What This Means for Governance, Compliance, and AI Readiness

Metadata governance shouldn’t require perfection – it should prioritize capturing, connecting, and providing context.

By encapsulating data and metadata together:

You improve data lineage and auditability
You ensure compliance-readiness for GxP environments
You make data reusable and collaborative across teams
You remove barriers to scaling AI and ML workflows

Perhaps most critically, you reduce your dependency on “hero workflows”—those undocumented, one-off manual hacks that never scale.

Final Thought

Quilt Data Packages are to scientific data what Docker containers were to software: a clean abstraction that makes something chaotic finally tractable. They bring reproducibility, context, and composability to the scientific stack, ensuring your datasets are versioned, searchable, and compliant with the increasing demands of regulated environments.

If your team is struggling to find, reuse, or trust your data, Quilt can help.

Reach out, or better yet, come see a data package in action.

Comments