broadinstitute/gatk 4.1.1.0 on GitHub

Highlights of the 4.1.1.0 release:

A substantial (~33%) speedup to the HaplotypeCaller in GVCF mode (-ERC GVCF)
Major updates to Mutect2, including completely overhauled filtering and smarter handling of overlapping read pairs.
A tensorflow update for CNNScoreVariants that speeds up the tool by roughly ~2X when using the 2D model.
Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
Important bug fixes to Funcotator, VariantEval, GenomicsDBImport, and other tools, as well as to the --pedigree argument for annotations.

Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes:

HaplotypeCaller
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
  - This speeds up whole-genome GVCF mode calling (-ERC GVCF) by ~33% in our tests!
- Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
- Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
- Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
- Added a --force-active argument that marks all regions as active. Useful for debugging/diagnostics. (#5635)
- HaplotypeCallerSpark: made performance improvements to allow the tool to run on WGS in strict mode (#5721)
- Fixed rare infinite recursion bug in KBestHaplotypeFinder (also affects Mutect2)(#5786)
Mutect2
- Overhaul of FilterMutectCalls, which now applies a single threshold to an overall error probability (#5688)
  - FilterMutectCalls automatically determines the optimal threshold.
  - The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
  - Includes a rewrite of Mutect2 documentation -- better organization and now includes command line examples in addition to math.
- Mutect2 now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)
  - This especially improves indel sensitivity.
- Optimized Mutect2 read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840)
- New Mutect2 panel of normals workflow using GenomicsDB for scalability (#5675)
  - Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
- Rewrote Mutect2 active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814)
- Funcotator updates in Mutect2 WDL (#5742) (#5735)
- Prune assemby graph before checking for cycles (#5562)
- Refactor Mutect2 inheritance so that it doesn't have inactive arguments (#5758)
- Added CRAM support to the Mutect2 WDL (#5668)
- Split MNPs in Mutect2 PON WDL, fixing a potential bug (#5706)
- Handle negative infinity log likelihoods from PairHMM in Mutect2 (#5736)
- Fixed overfiltering in Mutect2 in GGA alleles mode with no reads (#5743)
- Correct some Mutect2 VCF header lines (#5792)
- Handle unmarked duplicates with mate MQ = 0 in Mutect2 (#5734)
- Output sample names in Mutect2 PON header (#5733)
- Avoid error due to finite precision error in Mutect2 PON creation (#5797)
- Update Mutect2 javadoc to reflect v4.1 changes. (#5769)
- Renamed the OxoGReadCounts annotation to OrientationBiasReadCounts (#5840)
CNNScoreVariants
- We now use the latest Intel-optimized tensorflow (#5725)
  - This speeds up the 2D CNN by roughly 2X in our tests!
- FilterVariantTranches is out of beta (#5628)
- Fixed CNNScoreVariants hanging when the conda environment is not set up (#5819)
  - We now make sure that the GATK tool Python package is present before executing streaming Python commands.
- Extensive updates to the CNN WDLs (#5251)
Mitochondrial Calling Pipeline
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
  - Fixes a large number of missed calls
- Use adaptive pruning in the mitochondria pipeline (#5669)
- Changed defaults in mitochondria mode in response to Mutect2 filtering overhaul (#5827)
- Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
- Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
- Updated the haplochecker version to 0.1.2 to fix a bug with flipping the major and minor hg headers in its output (#5760)
- Added the rest of the mitochondria joint-calling pipeline (#5673)
  - Merging and genotyping "somatic" GVCFs from Mutect2
- Added a read filter for unmapped reads and their mates (#5826)
- Refactored the MT WDL to make validations easier (#5708)
- Updated a variable name in MT WDL to match gatk-workflows version (#5694)
GenotypeGVCFs
- Added an option to merge intervals for better GenotypeGVCFs performance on GenomicsDB exome input (#5741)
- Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
  - GenotypeGVCFs now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped
  - Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
Funcotator
- Non-locatable data sources can create funcotations again (#5774)
  - Fixes a bug where Funcotator was not adding funcotations from non-locatable data sources
- Fixed handling of symbollic alleles when determining best transcript for GencodeFuncotation creation. (#5834)
- FilterFuncotations: support for multi-allelic variants (#5588)
- FilterFuncotations: support for gnomAD for allele frequency in ClinVarFilter and LofFilter, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)
- Added # as a character to be sanitized by VCFOutputRenderer (#5817)
- Added in Markdown files for Funcotator forum posts (#5630)
- Updated Funcotator documentation with a FAQ section to respond to user comments (#5755)
CNV Tools
- Improved memory usage in gCNV (#5781)
- Improved memory requirements of CollectReadCounts (#5715)
- Added some fixes for minor CNV issues (#5699)
- Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
- Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
Miscellaneous Changes
- SelectVariants can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
- VariantEval bug fix: don't require the output file to already exist (#5681)
- Fixed the --pedigree argument in the PossibleDeNovo annotation (#5663)
- GenomicsDBImport: fixed a core dump when querying overlapping deletions (#5799)
- GatherPileupSummaries: a new tool that combines the output of GetPileupSummaries from disjoint scatter jobs (#5599)
- VariantsToTable: add splitting for allele-specific annotations and ADs (#5697)
- CalculateGenotypePosteriors: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)
- Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
- ReadsPipelineSpark: fixed an "Interval not within the bounds of a contig" error (#5645)
- Concordance: fixed the tool to allow for no variation alleles in the truth data. (#5718)
- ReblockGVCF: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)
- Change UpdateVCFSequenceDictionary to use the specified dictionary uniformly (#5093)
- Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
- Print the Picard/HTSJDK versions in addition to the GATK version when running with --version (#5757)
- IndexFeatureFile: fixed a crash on VCFs with 0 records (#5795)
- PrintBGZFBlockInformation: removed the file extension check so that we can accept bams (#5801)
- Added a new read filter: IntervalOverlapReadFilter (#5656)
- Add NIO Path support to TableReader and TableWriter (#5785)
- Replaced IntervalsSkipList with OverlapDetector (#4154)
- Removed some unused arguments in VCF merging code (#5745)
- Kebab-case some arguments in LocusWalker and LocusWalkerSpark (#5770)
- Removed an unnecessary IllegalArgumentException in PairHMM (#5705)
- Removed accidental uses of log4j v1 (#5682)
- Improvements to Spark evaluation scripts (#5815)
- Extract tests from PrintReadsIntegrationTest to share with the Spark version. (#5689)
Documentation
- Improved the documentation for the StrandOddsRatio annotation (#5703)
- Fixed the descriptions of some HaplotypeCaller arguments (#5658)
- Update VariantRecalibrator example code to reflect new tagged argument syntax (#5710)
- Corrected javadoc for the InbreedingCoeff annotation (#5768)
- CalculateGenotypePosteriors: minor updates to javadoc and logger type (#5601)
- Added and Updated javadoc for SortSamSpark and MarkDuplicatesSpark (#5672)
- Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
- Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
- Trimmed overly-long tool one-line summaries to shorten --list display width. (#5551)
Dependencies
- Updated HTSJDK to 2.19.0 (#5812)
- Updated Picard to 2.19.0 (#5812)
- Updated Disq to 0.3.0 (#5812)
- Updated google-cloud-nio to 0.81.0 (#5752)