1.19.23
We Integrated FSx for Lustre into AWS Genomics Workflows and Decreased Costs by One-Third
And you can reproduce our architecture in your own AWS environment. Here’s how.
by
Adam Tebbe, Henrique Silva, Eva Fast
and Sarthak Vilas Patel

A version of this article was first published on AWS for Industries on April 4, 2022. Since that time, AWS started using Loka's version of Cromwell.

Chronic kidney disease (CKD) affects 780 million people globally—one in every ten. To discover novel therapeutics to help this underserved patient population, most of which has no good treatment options, Goldfinch Bio built a Structural Variant Analysis capability on AWS to help find a cure for CKD.

Few data sets are available to commercial organizations in the public domain, so Goldfinch invested significant financial and labor resources into whole-genome sequencing on thousands of patients with kidney disease. Previously, much of the analysis performed on these data sets relied solely on the identification of single nucleotide polymorphisms (SNPs, single base pair changes in the DNA). With the recent development of Broad’s GATK-SV pipeline, the performance of Amazon FSx for Lustre and the help of AWS Advanced Consulting Partner Loka, Goldfinch Bio utilized its existing sequencing data to identify structural variants that can lead to new medicines for patients suffering from chronic kidney disease. As a result, the analytical value of Goldfinch’s existing genomic data increased overnight.

Goldfinch Bio believes that analysis of structural variation is a fundamental piece of the genomic puzzle, with the potential to provide a missing link between genomic variation and kidney disease.

Solution Overview

The GATK-SV pipeline was developed by the Broad Institute, written in Workflow Definition Language (WDL) and orchestrated using the Cromwell engine. The team knew that the large intermediate and reference files required by the pipeline would be best served with a high-performance file system. They also needed a solutions integrator with deep cloud experience to tie it all together in a reproducible way so the scientists at Goldfinch could use it on their own.

Pipeline and HPC Orchestration: We chose Broad’s GATK-SV pipeline and AWS’ own Genomics Workflows on AWS for orchestration. The pipeline was developed to call, filter and integrate structural variants across large cohorts using short-read sequencing data. This is particularly important because it can leverage existing data, as opposed to alternative sequencing approaches that would require a significant investment. The Genomics Workflows on AWS reference architecture makes it easy to stand up Cromwell environments on AWS Batch.

This is particularly important because it can leverage existing data, as opposed to alternative sequencing approaches that would require a significant investment.

High Speed Storage: GATK-SV is a complex and resource-intensive pipeline that runs almost 12.5k+ batch jobs and generates millions of files/objects along the way (stats as per execution for 156 1K-Genome samples). This pipeline was initially ported to work with S3 which previously pulled down/localized the required files, performed the task and then uploaded the output/staging files back to S3 for each Batch job. This was a major performance bottleneck which led us to explore other scalable and high-performance options.

Amazon FSx for Lustre was a great fit for this scenario due to its scalability, performance, provisioning, mounting capabilities and wide range of file system types like SSD, HDD and Scratch along with customizable throughput ranges. Additionally, compared to alternatives, the operating cost is also low.

We then integrated FSx with AWS Batch and Cromwell (workflow orchestration engine) to handle FSx along with the infrastructure templates and started seeing the performance boost right away. As evidenced from the benchmark table provided later in this post, the execution time dropped drastically from 3+ days to 1.3 days, which not only saves on the effort but also helps reduce overall cost of EC2 instances used and enables quick analysis of the results. We now have functionality to use either FSx or S3 depending on the use case and performance/cost requirements.

The execution time dropped drastically from 3+ days to 1.3 days, which not only saves on the effort but also helps reduce overall cost of EC2 instances used.

Deeply Qualified Solutions Integrator: Modifying the pipeline and storage required not only changing the code contained within WDL files, but also Cromwell, the workflow orchestration engine and the AWS Infrastructure Templates. Loka is an Advanced Tier partner with a deep understanding of both cloud and scientific workflow definitions, as well as strong familiarity with storage. Our specialized knowledge of HPC, AWS Batch, open-source packages, open data sets and processes like parallelization (which enables thousands of jobs to run simultaneously) delivered Goldfinch best of breed solutions faster than they could’ve achieved on their own.

How to Deploy GATK-SV in Your Own AWS Environment

The following steps will help you reproduce our architecture in your own AWS environment. Note that these steps offer high-level guidance. You can find more detailed deployment instructions on GitHub.

Step 1) Deploy the Genomics Workflows on AWS

  1. Open and follow the Genomics Workflows on AWS deployment guide

  2. When deploying the Genomics Workflow Core stack, supply the following:

    1. For CreateFSx, choose Yes

    2. For Cromwell FSxStorageType, choose Scratch

    3. For FSxStorageVolumeSize, type 24000 (MELT) or 16800 (without MELT)

  3. When deploying the Cromwell Resources stack, supply the following:

    1. For FSxFileSystemID, supply the id from the gwfcore template output tab

    2. For FSxFileSystemMount, supply the id from the gwfcore template output tab

    3. For FSxSecurityGroupId, supply the id from the gwfcore template output tab

Step 2) Deploy GATK-SV onto your Cromwell host

We built a small CloudFormation template that deploys an AWS SSM command document to run against your Cromwell server. This command document clones the Broad’s GATK-SV repo and makes some tweaks to the Cromwell config files to support connecting with Amazon FSx.

  1. Deploy the SSM command:

    wget https://github.com/goldfinchbio/aws-gatk-sv/blob/master/templates/cf_ssm_document_setup.yaml
    aws cloudformation deploy –stack-name "gatk-sv-ssm-deploy" –template cf_ssm_document_setup.yaml –capabilities CAPABILITY_IAM
    Bash
  2. Open the AWS Systems Manager Console

  3. In the navigation pane, under Documents, choose “Owned by me”

  4. Search for gatk-sv-ssm-deploy in the search bar

  5. Click the Execute Automation button

    1. For Instance Id, choose the listed Cromwell server

    2. For S3OrFSXPath, use the mount name in the Stack Output from Step 1

Step 3) Execute and monitor the pipeline

  1. Start a shell session on the Cromwell server

  2. Run cromshell submit commands

    1. The below will run the pipeline for 156 (1000 Genomes) samples.

      cromshell submit /home/ec2-user/gatk-sv/gatk_run/wdl/GATKSVPipelineBatch.wdl /home/ec2-user/gatk-sv/gatk_run/aws_GATKSVPipelineBatch.json /home/ec2-user/gatk-sv/gatk_run/opts.json /home/ec2-user/gatk-sv/gatk_run/wdl/dep.zip
      Bash
  3. Monitor pipeline

    1. cromwell status
    2. Alternatively, consider usingget_batch_status.py script to gather the information from AWS Batch and CloudWatch logs to give a consolidated and better view of the resources and job completion details along with higher level and module level summaries.

The AWS-GATK-SV reference architecture diagram featuring Cromwell, AWS Batch, and Amazon FSX. The AWS-GATK-SV reference architecture diagram featuring Cromwell, AWS Batch, and Amazon FSx.

Measuring Quality, Performance and Cost

Quality
Using a trial dataset that contains 156 individuals from the 1000Genomes project, we showed concordance of our results with the original pipeline.

Concordance between our pipeline and Broad’s published standard

DEL – Deletion, DUP – Duplication, CNV - copy number variant, INS - Insertion, INV - Inversions, CPX - Complex SV, OTH - (Breakends and Translocation) DEL – Deletion, DUP – Duplication, CNV – copy number variant, INS – Insertion, INV – Inversions, CPX – Complex SV, OTH – (Breakends and Translocation)

The number of various structural variants such as deletions (DEL) or duplications (DUP) is consistent between the four conditions (see figure above). We performed two FSx runs with and without the proprietary structural variant caller MELT which resulted in a reduction of insertions (INS, 19,001 vs 10,091) as expected.

To further investigate consistencies between pipelines, we compared the exact position in the genome where structural variants were detected by our three callsets (S3 and FSx) and the original pipeline (Broad). This evaluation is stringent because the false positive/negative class would also include structural variants that are offset by only a few base pairs. Comparing Amazon FSx (with MELT) with the Broad gold standard we reached a precision of 0.95 and a recall/sensitivity of 0.92. Comparing S3 and FSx (without MELT) run resulted in a precision of 0.95 and recall/sensitivity of 0.96 when using S3 as the reference dataset.

Performance and Cost
Our optimizations decreased runtime and reduced our overall costs by a third. Below is a table describing the duration it took to complete each module on Amazon FSx and S3 with and without MELT.

GATK-SV module runtimes and costs with Amazon FSx and MELT

GATK-SV module runtimes and costs with Amazon FSx and MELT

Conclusion

By analyzing structural variation in short-read sequencing data and studying different types of genomic rearrangements, we can uncover new insights and better understand certain diseases. This novel process helps us gain greater value out of existing data.

Resources such as the UK Biobank and AllofUS increase the number of population scale whole genome sequencing datasets which lend themselves to structural variant research. While this is a rapidly evolving methodology, it can be overwhelming for researchers with limited technical resources to leverage a pipeline like GATK-SV.

It is our hope that novel drug targets can be discovered to help patients in need of therapeutic intervention. By releasing the customizations and optimizations of this pipeline as an open-source reference architecture, the community can leverage these improvements on additional data to lead to novel scientific insights.

Want to dive deeper?

Adam Tebbe

Adam Tebbe is a VP of Computational Data Science and Technology at Goldfinch Bio, Inc. and leads the computational group at Goldfinch. He has an extensive background in IT, informatics, software engineering and data science. Adam has been developing software, tools, and platforms in the cloud for more than 10 years, with a background working across technical teams in small biotech and large pharma. He’s interested in applications of technology and platform development to enable scientific discovery that will lead to transformations in patient care. 

Henrique Silva

Henrique Silva is a Machine Learning Lead at Loka. He began his career in robotics in academia and evolved into the Machine Learning field. There he developed several projects that involve cutting-edge technologies and high-performing cloud architectures. In the last couple of years he has applied his skills working with large amounts of data and high-performance computing in the field of life sciences, contributing to several open source projects. When not working on a new data engineering challenge, Henrique enjoys playing paddle tennis, known around the world as padel.

Eva Fast

Eva Fast is a Senior Computational Biologist at Goldfinch Bio, Inc. She came from an experimental biology background before getting excited by the rapid data growth within healthcare and switching her focus to computational biology. She has experience in various genetics, genomics, clinical data and imaging analyses and enjoys collaborating to transform these workflows into scalable pipelines. In her free time she likes to bike in and around the Boston area.

Sarthak Vilas Patel

Sarthak Vilas Patel is a Senior Data Engineer at Goldfinch Bio, Inc. He’s an expert in architecting, building, maintaining, testing, and supporting highly scalable applications, infrastructures and CI/CD pipelines in diverse industries. He is passionate about cloud computing, solving complex problems and learning new tools and technologies. In his spare time he enjoys exploring new places, watching movies and hiking.

- More from Loka
The
                        Hybrid
                        Four-Day
                        Workweek
                        Is
                        Here
STARTUP CULTURE
The Hybrid Four-Day Workweek Is Here
Silicon Valley startup Loka® is at the vanguard of a global movement.
In
                  2021,
                  Every
                  Business
                  Was
                  a
                  Startup
STARTUP CULTURE
In 2021, Every Business Was a Startup
Four final takeaways from a year that wouldn't let companies get comfortable
Bigdata
Humanity
The now and near
future of data
A short and sweet trends
forecast on a software engineer’s
building blocks
Swipe
Loka's syndication policy
Free and easy
Put simply, we encourage free syndication. If you’re interested in sharing, posting or Tweeting our full articles, or even just a snippet, just reach out to medium@loka.com. We also ask that you attribute Loka, Inc. as the original source. And if you post on the web, please link back to the original content on Loka.com. Pretty straight forward stuff. And a good deal, right? Free content for a link back.
If you want to collaborate on something or have another idea for content, just email me.
We’d love to join forces! 🙌

Silicon Valley Office

350 2nd Street, Suite 8 Los Altos, CA 94022

San Francisco Office

535 Mission St, 14th floor San Francisco, CA 94105