broadinstitute/gatk 4.1.8.0 on GitHub

Download release: gatk-4.1.8.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.8.0 release:

A major new release of GenomicsDB (1.3.0), with enhanced support for shared filesystems such as NFS and Lustre, support for MNVs, and better compression leading to a roughly 50% reduction in workspace size in our tests. This also includes a fix for an error in GenotypeGVCFs that several users were encountering when reading from GenomicsDB.
A major overhaul of the PathSeq microbial detection pipeline containing many improvements
Initial/prototype support for reading from HTSGET services in GATK
- Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
Fixes for a couple of frequently-reported errors in HaplotypeCaller and Mutect2 (#6586 and #6516)
Significant updates to our Python/R library dependencies and Docker image

Full list of changes:

New Tools
- HtsgetReader: an experimental tool to localize files from an HTSGET service (#6611)
  - Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
- ReadAnonymizer: a tool to anonymize reads with information from the reference (#6653)
  - This tool is useful in the case where you want to use data for analysis, but cannot publish the data without anonymizing the sequence information.
HaplotypeCaller/Mutect2
- Fixed an "evidence provided is not in sample" error in HaplotypeCaller when performing contamination downsampling (#6593)
  - This fixes the issue reported in #6586
- Fixed a "String index out of range" error in the TandemRepeat annotation with HaplotypeCaller and Mutect2 (#6583)
  - This addresses an edge case reported in #6516 where an alt haplotype starts with an indel, and hence the variant start is one base before the assembly region due to padding a leading matching base
- Better documentation for FilterAlignmentArtifacts (#6638)
- Updated the CreateSomaticPanelOfNormals documentation (#6584)
- Improved the tests for NuMTFilterTool (#6569)
PathSeq
- Major overhaul of the PathSeq WDLs (#6536)
  - This new PathSeq WDL redesigns the workflow for improved performance in the cloud.
  - Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.
  - Removed microbial fasta input, as only the sequence dictionary is needed.
  - Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
  - Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
  - Metrics are now parsed so they can be fed as output to the Terra data model.
  - CRAM-to-BAM capability
  - Updated WDL readme
  - Deleted unneeded WDL json configuration, as the configuration can be provided in Terra
- Added an --ignore-alignment-contigs argument to PathSeq filtering that lets users specify any contigs that should be ignored. (#6537)
  - This is useful for BAMs aligned to hg38, which contains the Epstein-Barr virus (chrEBV)
GenomicsDB
- Upgraded to GenomicsDB version 1.3.0 (#6654)
  - Added a new argument --genomicsdb-shared-posixfs-optimizations to help with shared POSIX filesystems like NFS and Lustre. This turns on disable file locking and for GenomicsDB import it minimizes writes to disks. The performance on some of the gatk datasets for the import of about 10 samples went from 23.72m to 6.34m on NFS which was comparable to importing to a local filesystem. Hopefully this helps with Issue #6487 and #6627. Also, fixes Issue #6519.
  - This version of GenomicsDB also uses pre-compression filters for offset and compression files for new workspaces and genomicsdb arrays. The total sizes for a GenomicsDB workspace using the same dataset as above and the 10 samples went from 313MB to 170MB with no change in import and query times. Smaller GenomicsDB arrays also help with performance on distributed and cloud file systems.
  - This version has added support to handle MNVs similar to deletions as described in Issue #6500.
  - There is added support in GenomicsDBImport to have multiple contigs in the same GenomicsDB partition/array. This will hopefully help import times in cases where users have many thousands of contigs. Changes are still needed from the GATK side to make use of this support.
  - Logging has been improved somewhat with the native C/C++ code using spdlog and fmt and the Java layer using apache log4j and log4j.properties provided by the application. Also, info messages like No valid combination operation found for INFO field AA - the field will NOT be part of INFO fields in the generated VCF records will only be output once for the operation.
- Made VCFCodec the default for query streams from GenomicsDB (#6675)
  - This fixes the frequently-reported NullPointerException in GenotypeGVCFs when reading from GenomicsDB (see #6667)
  - Added a --genomicsdb-use-bcf-codec argument to opt back in to using the BCFCodec, which is faster but prone to the above error on certain datasets
CNV Tools
- DetermineGermlineContigPloidy can now process interval lists with a single contig (#6613)
- FilterIntervals now filters out any singleton intervals (#6559)
- Fixed an inaccurate error message in SVDDenoisingUtils (#6608)
Docker/Conda Overhaul (#5026)
- Our docker image is now built off of Ubuntu 18.04 instead of 16.04
  - This brings in newer versions of several important packages such as samtools
- Updated many of the Python libraries installed via our conda environment and included in our Docker image to newer versions, resolving several outstanding issues in the process
- R dependencies are now installed via conda in our Docker build instead of the now-removed install_R_packages.R script
  - Due to this change, we recommend that tools that use R packages (e.g., to create plots) should now be run using the GATK docker image or the conda environment.
- NOTE: significant updates and changes to the Ubuntu version, native packages, and R/python packages may result in corresponding numerical changes in results.
Mitochondrial Pipeline
- Minor updates to the mitochondrial pipeline WDLs (#6597)
Notable Enhancements
- RevertSamSpark now supports CRAMs (#6641)
- Fixed a VariantAnnotator performance issue that could cause the tool to run very slowly on certain inputs (#6672)
- More flexible matching of dbSNP variants during variant annotation (#6626)
  - Add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.
  - Be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.
- Added a --min-num-bases-for-segment-funcotation argument to FuncotateSegments (#6577)
  - This will allow for segments of length less than 150 bases to be annotated if given at run time (defaults to 150 bases to preserve the previous behavior).
- SplitIntervals can now handle more than 10,000 shards (#6587)
Bug Fixes
- Fixed interval summary files being empty in DepthOfCoverage (#6609)
- Fixed a crash in the BQSR R script with newer versions of R (#6677)
- Fix crash when reporting error when trying to build GATK with a JRE (#6676)
- Fixed an issue where ReadsSourceSpark.getHeader() wasn't propagating the reference at all when a CRAM file input resides on GCS, so it always resulted in a "no reference was provided" error, even when a reference was provided. (#6517)
- Fixed an issue where ReadsSourceSpark.checkCramReference() always tried to create a Hadoop Path object for the reference no matter what file system it lives on, which fails when using a reference on GCS. (#6517)
- Fixed an issue where the tab completion integration tests weren't emitting any output (#6647)
Miscellaneous Changes
- Created a new ReadsDataSource interface (#6633)
- Migrated read arguments and downstream code to GATKPath (#6561)
- Renamed GATKPathSpecifier to GATKPath. (#6632)
- Add a read/write roundtrip Spark integration test for a CRAM and reference on HDFS. (#6618)
- Deleted redundant methods in SVCigarUtils, and rewrote and moved the rest to CigarUtils (#6481)
- Re-enabled tests for HTSGET now that the reference server is back to a stable version (#6668)
- Disabled SortSamSparkIntegrationTest.testSortBAMsSharded() (#6635)
- Fixed a typo in a SortSamSpark log message. (#6636)
- Removed incorrect logger from DepthOfCoverage. (#6622)
Documentation
- Fixed annotation equation rendering in the tool docs. (#6606)
- Adding a note as to how to filter on MappingQuality in DepthOfCoverage (#6619)
- Clarified the docs for the --gcs-project-for-requester-pays argument to mention the need for storage.buckets.get permission on the bucket being accessed (#6594)
- Fixed a dead forum link in the SelectVariants documentation (#6595)
Dependencies
- Updated HTSJDK to 2.22.0 (#6637)
- Updated Picard to 2.22.8 (#6637)
- Updated Barclay to 3.0.0 (#4523)
- Updated Spark to 2.4.5 (#6637)
- Updated Disq to 0.3.6 (#6637)
- Updated the version of Cromwell used on Travis to v51 (#6628)