Scientists and security people use the word "provenance" to mean different things. Scientists ask where the data came from. Security people ask whether it has been altered. In a regulated environment, both meanings have to work together. A defensible result depends on being able to show, on demand, that the dataset under it has not changed since it was signed and that it traces cleanly back to its inputs.
This post is an explainer of three primitives that combine to give that property on Amazon S3: cryptographic hashing and signing, S3 object versioning with Object Lock, and immutable audit logs. Each one is useful on its own. The composition is where the defensibility comes from. The post ends with what a catalog layer (Quilt, in our case) adds on top.
For any regulated dataset, three claims need to be supportable. First, integrity: the data has not been altered since it was registered. Second, attribution: it was produced by a specific party using a specific pipeline. Third, a temporal claim: it existed in this form on this date and in this state. A system that can support all three has provenance. A system that supports two of three does not.
A cryptographic hash function (SHA-256 is the right default in 2026) maps any input to a fixed-length fingerprint. The hash is deterministic, so the same input produces the same hash, and collision-resistant, so different inputs almost certainly produce different hashes. Publishing the hash of a file today and recomputing it a year later is a near-perfect integrity check.
What hashing handles well is single-file integrity. What it does not handle is the fact that a regulated record is usually a bundle: a sample manifest, the raw reads, the QC report, the pipeline parameters, and a README, all of which have to be intact together for the record to be meaningful. The natural extension is top-level hashing: hash every file in the bundle, then hash the sorted list of those hashes (plus the metadata) to produce one fingerprint for the whole package. That single hash is what you cite when claiming "version 3 of the KRAS-001 NGS package." Change one byte in any file and both the leaf hash and the top-level hash change.
# Conceptual model of a package hash
leaf_hashes = [sha256(f) for f in package_files]
metadata_hash = sha256(json.dumps(metadata, sort_keys=True).encode())
package_hash = sha256(b"".join(sorted(leaf_hashes) + [metadata_hash]))
Signing layers identity on top of hashing. The hash is encrypted with a private key controlled by a single party, and anyone with the corresponding public key can verify two things: that the signature came from the key holder, and that the hash still matches the data. For 21 CFR Part 11 eSignatures, what gets signed is the package top-hash plus a signature manifest containing the signer's identity, UTC timestamp, and meaning of signature (approval, review, authorship). The signed bundle becomes a durable claim of authorship over a specific record state.
Hashing proves after the fact that data has not changed. Storage controls handle the harder problem of refusing to let it change in the first place.
S3 Versioning makes every PutObject create a new version. Deletes become "delete markers" that hide the latest version without destroying prior ones, and restoration is a single API call. Versioning solves the accidental-overwrite problem and the ransomware problem cleanly. It does not solve the privileged-deletion problem: an admin with sufficient privileges can still permanently delete versions or disable versioning. Versioning gives continuity, not tamper-evidence.
S3 Object Lock adds the tamper-evidence layer. Configured at the bucket level, it offers three modes. Compliance mode prevents anyone (including the root account) from deleting or altering an object before its retention period expires, and is appropriate for records with hard regulatory retention windows. Governance mode lets privileged accounts override the lock with a documented bypass, useful when an explicit admin escape hatch is needed. Legal hold is indefinite and removable only by a specific permission, intended for litigation or active investigations.
The pattern that holds up for regulated data: compliance-mode Object Lock on a dedicated registry bucket, KMS-encrypted, versioned, replicated to a second region with the same posture. Records land once and stay for the retention period. That storage commitment is what backs up the cryptographic hashes. Saying "the hash of genomics/kras-001-rnaseq@a1b2c3 is X" only matters if the underlying bytes cannot be quietly retracted.
The third primitive is the temporal claim: proving when things happened and who did them. A log that can be modified by a sufficiently privileged user is not an audit log. It's a comment box. An audit log that lives in the same place as the data, under the same permissions, has the same problem.
The pattern that works on AWS:
Why two logs? CloudTrail records what happened at the AWS API layer ("an object was put"). The catalog log records what happened at the records layer ("version 3 of the KRAS package was released by Alice"). Both matter. Both should be tamper-evident. Both should survive privileged users.
None of the primitives is sufficient alone. The useful property is what emerges when they compose. A chain of custody for a single regulated dataset under this architecture:
Each step makes a claim backed by a primitive that does not depend on the others to be trustworthy. If a primitive fails (an admin disables CloudTrail for a window, say), the other layers still hold, and the gap itself is evidence in the catalog's event log.
Three patterns reliably undermine an otherwise sound posture. Logs stored in the same place as the data, under the same permissions, are not logs. S3 versioning treated as tamper-evidence creates a false sense of security; it protects against accidents and not against privileged deletion, which is what Object Lock exists for. And hashing files without hashing the package leaves dataset-level integrity uncertain; "version 3" remains a label rather than a verifiable claim.
The primitives above are AWS-native. They can be composed by hand. The Quilt Data Platform exists to make that composition operational without rebuilding it for every workflow. The pieces it brings:
The right starting point for any provenance posture on S3 is the three primitives. The catalog comes after. To walk one regulated dataset through the full chain together, the Quilt team is happy to do a working session: quilt.bio/demo.