S3 is an excellent object store and an inadequate catalog. Most teams discover this the same way: one bucket becomes a per-team bucket, then per-project prefixes, then a naming convention nobody follows, then a spreadsheet of where the good data lives, then the person who maintained the spreadsheet leaves the company.
By the time anyone uses the word "catalog," the problem isn't storage. It's findability, version control, and trust. The layer that fixes those problems is sometimes built in-house and sometimes purchased. Either way, the requirements are similar. The list below describes the ten capabilities we look for when evaluating any versioned S3 data catalog, written from the perspective of a biotech R&D team that has to support scientists, engineers, and QA from the same substrate.
S3 versioning gives you per-object history. That's necessary but not sufficient. What teams actually need is collection-level versioning: the ability to point at "v3 of the KRAS RNA-seq dataset" and get back the exact bundle of FASTQs, sample manifest, QC report, and README that existed at that revision, even if individual files have since been rewritten or moved.
The right primitive is a single cryptographic hash per dataset revision. The hash addresses the whole bundle. If anything inside changes, the hash changes. The hash is the thing you cite in your validation document or your paper.
"You can attach metadata" appears on every catalog vendor's website. The question is whether the catalog enforces a schema, so that tissue: liver doesn't appear alongside Tissue: Liver and tissue_type: hepatic in the same search results.
What works in production:
A scientist looking for "every RNA-seq dataset from the KRAS-001 project, on liver tissue, processed with nf-core/rnaseq 3.14 or later" should not be writing Athena queries. The catalog should index structured metadata, path components, and file content (README, manifest CSV, parquet column names), with sub-second response times across millions of objects. If search feels like Google, scientists will use it. If it feels like a niche internal tool, they will fall back to Slack.
Downloading a 12GB BAM file to see what's in it is how data reuse dies. A catalog that earns its keep includes:
The underrated feature of a real catalog is the property that makes a good Python package usable: a README inside the artifact, in the same revision, accessible with the same permissions. When the data changes, the README changes in the same commit. Markdown is the right format because the same file is readable by a new team member, an external collaborator, and an LLM. The format already works for code; there is no reason it shouldn't work for data.
Splitting scientists into "Python users" and "web users" with different tools is one of the fastest ways to fragment a data culture. Both audiences should reach the same packages, with the same metadata, through interfaces appropriate to them. In practice that means a Python client that can install a package the way pip installs a library, a web UI that exposes the same package with previews and search, and an API surface (including MCP) so AI agents inherit the same access patterns rather than getting a parallel one.
"Where did this come from?" should be a one-click answer on every dataset. The information that needs to travel with the package includes the producing pipeline (and version, and parameters), the upstream packages by hash, the user or service account that registered it, and for derived data a clickable path back to raw inputs. Emitting OpenLineage events is useful for teams that have invested in broader data observability, but the lineage itself should live with the package, not in a parallel system.
Most catalogs let teams store anything. Catalogs that hold up under regulatory scrutiny let teams refuse to store certain things. The pattern is a workflow contract: an NGS output package must contain a sample manifest, a QC report, and a pipeline version metadata field, with specific schema requirements satisfied, before it can be registered as releasable. This is how the dataset that ends up in front of an FDA reviewer still has the artifacts it needs six months later. Inari runs this pattern in production with workflows defined in a Python configuration deployed alongside their Quilt instance.
Three properties separate a usable audit trail from a CloudTrail dump:
If the inspection story requires writing a sixty-line SQL query against CloudTrail logs, there isn't an audit trail yet. There are raw materials for one.
The last item is architectural, and it determines whether the previous nine are even possible. A real S3 catalog runs inside your AWS account, under your IAM, KMS, and CloudTrail. It stores data in your S3 buckets, in standard formats. It works with the AWS services your stack already uses (Glue, Athena, SageMaker, Bedrock, HealthOmics). And it lets your security team enforce posture using the controls they already audit. A catalog that requires shipping your data into someone else's account to be searchable is not a catalog of your data.
We built the Quilt Data Platform and the Quilt Web Catalog against this list, so it would be odd to claim anything else. The honest map:
Some teams should build this internally; cataloging is a strategic differentiator in a few specific contexts. The teams we've talked to who tried it without that strategic reason describe a similar arc: a thin Python wrapper in year one, a web UI in year two, a request for lineage and audit trails in year three that requires rewriting the metadata model, and by year four a small product owned by three engineers who would prefer to be doing science. If cost is the primary reason to build, the math rarely holds. If the catalog is the differentiator, build it deliberately.
To walk specific datasets through the list together, the Quilt team is happy to set up a working session: quilt.bio/demo.