Next-generation sequencing (NGS) has revolutionized our understanding of molecular mechanisms underpinning human disease, driven the development of clinically-relevant biomarkers, and fueled the discovery of novel drug targets. Despite the ever-growing volume of NGS data generated, tools to efficiently access, share and query sequencing data lag behind technological advances. Data inaccessibility represents a persistent barrier hindering the productivity of interdisciplinary research teams, in both academia and industry. This article sheds light on the common challenges encountered by researchers when generating, processing, sharing, and analyzing genomic data, while also exploring Quilt as a solution to make large-scale genomic data more accessible to all researchers, irrespective of their programming abilities.
Unlocking the potential of NGS data can be a difficult endeavor, one that routinely presents researchers with a unique array of challenges. As detailed below, these challenges arise from the immense scale, complexity, deep domain knowledge and reliance on specialized tools to effectively process and analyze NGS data.
NGS data is big! Each sequencing run produces millions to billions of short DNA fragments, called reads. When dealing with multiple samples or replicates, NGS datasets can often balloon to terabytes of raw sequencing data. The sheer volume of data generated — even for average sized profiling studies — requires careful management to avoid overwhelming computational resources and storage limits. Consequently, NGS data is often stored and interacted with via high-performance computing (HPC) or cloud solutions — introducing a prominent accessibility barrier for team members less versed in programming.
NGS data is often stored in boutique file formats unique to the genomics community and requires specialized applications to interact with the data. For example, NGS pipelines often store results in binary formats, such as BAM (Binary Alignment/Map) files or VCFs (variant call format), which are optimized for computational efficiency but not user-friendly for data exploration. Accessing and extracting specific information from these files typically requires the use of command-line tools and scripts, which presents a significant barrier for scientists who are not proficient in command-line operations or comfortable navigating complex data structures.
Processing of NGS data involves multiple interconnected stages, including quality control, alignment to reference genomes, variant calling, and downstream analysis. Each of these stages generates intermediate files and logs. These files are often scattered across different directories and storage locations, making it difficult to track the entire data workflow and pinpoint specific files — particularly when conducting a search long after the initial pipeline execution.
Ensuring the reproducibility of analyses is a fundamental aspect of rigorous scientific research. However, maintaining a precise record of particular data versions, packages, and analysis pipelines used in a given study is challenging. Traditional methods often rely on ad hoc documentation and file naming conventions, making it difficult to establish clear data versioning and lineage. In the biopharmaceutical industry, where precision is paramount, reproducibility and versioning are critical throughout the drug development lifecycle, from research inception to clinical trials and post-market monitoring.
Genomics research is inherently multidisciplinary, involving numerous research teams from project management and clinical operations to data engineering, computational biology, and experimental scientists — all of whom require access to various results from genomics pipelines. Effective collaboration across diverse teams is essential to advance research agendas, but can be hindered by the piecemeal solutions available to effectively interact with and interpret genomics data, especially when not all team members are comfortable using command-line tools.
Significant strides have been made to enhance the accessibility of genomics data to the broader scientific community through interactive web interfaces. However, these portals usually deliver data in its final processed form, such as discrete mutation calls or normalized gene expression values, often omitting valuable information regarding the underlying pipeline parameters, as well as upstream raw values and intermediate files, which can be crucial for internal research efforts. Collectively, these challenges underscore the need for a comprehensive solution that can bridge the gap in data accessibility, enabling research teams to seamlessly query, track, and share NGS data throughout its entire lifecycle.
Quilt tackles NGS data challenges by bundling all elements of genomics pipelines, including raw data, execution commands, logs, outputs, and downstream analysis, into cohesive version-controlled packages. In addition to supporting interactive visualizations, these packages simplify the process of sharing and querying data, thereby fostering effortless collaboration among colleagues across diverse disciplines, regardless of programming proficiency.
The remainder of this article presents a real-word example of how Quilt data packages can streamline processing, analysis, visualization and sharing of a specific type of NGS data known as whole exome sequencing (WES). The workflow mirrors current state-of-the-art data protocols used by bioinformaticians and computational biologists, while further leveraging the advantages of Quilt packages to streamline and expedite traditionally time-consuming and fragmented tasks to ultimately unlock the maximum utility of genomic data.
Genomic pipelines need data! For this case study, we downloaded raw WES data in the form of FASTQs from 9 samples in the Cancer Cell Line Encyclopedia (CCLE) — a groundbreaking database of gene expression, genotype, and drug sensitivity data for hundreds of in vitro human tumor models, called cancer cell lines.
After uploading the FASTQs to the project’s Quilt package using a simple quilt3 push
, we processed the data with nf-core/sarek — a community maintained Nextflow pipeline for WES pre-processing, somatic variant calling and annotation. By enabling the nf-quilt plugin, we can read input data and automatically write pipeline outputs directly to the project’s Quilt package by adding a single -plugins parameter in the execution command.
nextflow run nf-core/sarek -profile docker \
-plugins nf-quilt@0.4.5 \ # use nf-quilt plugin
-resume \
--outdir "quilt+s3://quilt-example-bucket#package=wes/result" \
--input "quilt+s3://quilt-example-bucket#package=wes/result
@3fdbe6d509e1c19b373126deba20bf967773c4c16961ddef1b67865618035f49&path=input_fastqs
%2Fsarek_samplesheet.csv" \
--wes true \
--save_mapped true \
--save_output_as_bam true \
--tools "freebayes,mpileup,mutect2,strelka,vep"
“By combining the reproducibility of Nextflow workflows with the versioning and query capabilities of Quilt, analysts can easily access and manage data across multiple pipeline iterations, while benefiting from auto-generated run documentation.”
NGS pipelines often fragment data, intermediate files and logs across multiple directories, complicating tracking the flow of data and pinpointing specific files. With Quilt packages, every element, from input data to custom outputs and visualizations, is unified and versioned, with added facet search capabilities to easily locate items based on their contents.
Quilt offers an intuitive, user-friendly interface on top of S3 buckets for accessing and interacting with data without the need for programming, called the Quilt catalog. The Quilt catalog can render previews directly in the browser for multiple file types, including html, csv and genomics-specific file types such as FASTQs and VCFs. For example, researchers can easily view the interactive MultiQC report produced by the nf-core/sarek WES pipeline to quickly evaluate the quality of sequencing runs without having to dig around their computational resources.
Additionally, Quilt makes it almost effortless to explore genomic alignments and variants across samples using Integrative Genomics Viewer (IGV). For example, we can directly load BAM files generated by the nf-core/sarek
pipeline to browse somatic mutations, such as KRAS, and evaluate the quality and quantity of read alignments across samples at different genomic loci.
In addition to the more traditional JavaScript implementation of IGV, the IGV team has developed a python application called igv-reports to generate HTML reports that consist of a table of genomic sites or regions and associated IGV views for each site. These reports can be rendered directly within Quilt packages using the Quilt’s package file server.
In contrast to the browsable JavaScript implementation, the package file server IGV rendering is exclusively focused on the genomic sites specified within the HTML report. It does not permit browsing via the search bar, such as entering gene names or new genomic positions. Nevertheless, the IGV package file server solution offers a significant advantage by eliminating the need to load entire BAM files, as is required by the JavaScript version. As a result, this approach ensures lightning-fast browsing of the variant table, making it particularly well-suited for focused analyses at specific genomic sites — such as visually inspecting alignments at common cancer hotspots.
Every piece of data in a Quilt package has versioned shareable links, ensuring everyone views the same exact data, reducing confusion and misinterpretation. Quilt links eliminate the need to email large genomic files, simplifying communication between team members. No more mutation_file_final_FINAL_v2b.csv! Quilt’s user-friendly interface, efficient data packaging, and real-time collaboration capabilities empower researchers to focus on their science rather than wrestling with data accessibility.
With Quilt, computational and bench scientists can collaboratively interact with large-scale genomic data, such as WES, in real time. Quilt lets you view computational biology code and Jupyter notebooks right directly in the Quilt catalog interface. This is particularly helpful when you need to review or understand the nitty gritty details of an analysis performed by a colleague — such as the variant filtering strategy used to whittle down WES VCFs to a high confidence list of somatic mutations performed in our example.
Quilt empowers users to perform analysis on the fly directly within the interface. Computational biologists can push results files generated from custom analyses into Quilt packages to unlock a suite of interactive queries and visualizations. For example, we can query a WES results table from our example containing >25,000 mutations (rows) to quickly identify which of the CCLE samples harbor KRAS mutations — a popular cancer hotspot mutation of interest to many research groups.
In addition to quick lookups, Quilt’s interactive visualizations can be used to perform analyses in real-time and discover new associations in genomic data. By enabling Perspective visualizations on a simple csv file generated from our WES analysis, we can quantify the number of mutations per cell line using bar plots, compare mutation rate between cancer types using a swarmplot and explore the relationship between doubling time and percent genome altered with a scatterplot — all without writing a single line of code!
Quilt’s versatile visualization toolbox has the power to simplify meeting prep by providing an accessible platform for discussions, eliminating the need to generate extensive slides decks. The user-friendly catalog interface and streamlined data sharing capabilities can enhance communication, reduce confusion around data and support effective decision-making across organizational levels within biopharmaceutical companies.
The challenge with next generation sequencing data, such as WES, lies not only in its sheer volume but also in its accessibility, organization, and the ability to facilitate collaboration across diverse research teams. Quilt offers a way to package all components of a data pipeline, from raw data and execution commands to logs, outputs, and downstream analysis, into a cohesive and versioned package.
Quilt packages facilitate seamless collaboration across bench and computational researchers, enabling efficient access and analysis of large-scale genomic data. By reducing barriers to data access, Quilt empowers teams to discover, interpret, and progress their research rapidly to maximize discoveries from their genomic data.
Enabling widespread institutional access to large-scale genomic data not only enriches scientific exploration, but also has the potential to position biopharmaceutical companies at the forefront of genomics research, fostering innovation and maintaining competitiveness in the rapidly evolving field.
Post questions in the comments or visit https://quiltdata.com/ to learn more about managing and efficiently interacting with genomic data through Quilt data packages.
The whole exome sequencing Quilt package described above is publicly available for browsing at https://open.quiltdata.com/b/quilt-example/packages/examples/whole-exome-sequencing.
The associated Quilt Boston Workshop presentation from September 6, 2023 is also available.