broadinstitute/gatk 4.0.5.0 on GitHub

Highlights of this release include the ability to emit MNPs in Mutect2 and HaplotypeCaller via a new --max-mnp-distance argument, much better active region detection for low allele fractions in Mutect2, new priors for variants sites and homRef blocks in HaplotypeCaller, a new tool FilterAlignmentArtifacts to filter false positive alignment artifacts in the Mutect2 pipeline, performance improvements to CNNScoreVariants and Funcotator, and a new --sites-only-vcf-output GATK engine argument to suppress genotypes when writing VCFs.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

Mutect2
- Made Mutect2 active region determination much better for low allele fractions (#4832)
  - In particular, this makes Mutect2 vastly better for mitochondrial and cfDNA calling
- Mutect2 can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
- Tweaked Mutect2 read position filter to handle non-biological (eg FFPE) insertions better (#4851)
- Fixed Mutect2 bug where triallelic normal artifacts were sometimes hidden from filtering engine (#4809)
- Mutect2 STR filter now also looks at insertions (#4845)
  - This lowers the indel false positive rate dramatically.
- Mutect2 WDL:
  - now outputs MAF segmentation (#4837)
  - now runs FilterAlignmentArtifacts (#4848)
  - now uses lenient validation in SortSam (#4844)
Added new tool FilterAlignmentArtifacts (#4698)
- Filters false positive alignment artifacts (that is, apparent variants due to reads being mapped to the wrong genomic locus) from a VCF callset by checking variant-supporting reads and their mates.
- By considering the realignment of the read and its mate, it saves a lot of variants, especially in low-complexity regions, from being filtered as mapping errors.
HaplotypeCaller
- HaplotypeCaller can now emit MNPs according to adjustable distance threshold specified via --max-mnp-distance (#4650)
- New HaplotypeCaller priors for variants sites and homRef blocks (#4793)
  - Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
  - Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
  - As a side effect of this change, CalculateGenotypePosteriors now supports indels.
- GCS/NIO output support for the -bamout argument (#4721)
-new-qual in HaplotypeCaller/Mutect2/GenotypeGVCFs no longer counts spanning deletions as support for variant qual (#4801)
CNNScoreVariants
- Performance improvements to the prep of the input tensors in the 2D model (#4735)
- Bug fix to prevent a crash on the ends of the mitochondrial contig (#4751)
GATK Engine
- Added a new traversal type TwoPassVariantWalker that does two passes over its input variants (#4744)
- Enable the -L argument to read feature files (such as .bed or .vcf files) from non-local Paths, including GCS buckets (#4854)
- Added --sites-only-vcf-output argument to the GATK engine to suppress genotype fields when writing VCFs (#4764)
- Tools that use annotations now use the barclay annotation plugin (#4674)
- Added new ReadQueryNameComparator (#4731)
- Automatically schedule temporary resource files for delete on exit (#4616)
Spark tools
- Added support for g.vcf.gz files in Spark. #4274 (#4463)
- Spark tools can now write SAM files #4295. (#4471)
- Added a --output-shard-tmp-dir argument to specify the parts directory for un-sharded BAM writing (#4666)
MarkDuplicatesSpark
- Fixed MarkDuplicatesSpark so it handles supplementary reads with unmapped mates properly (#4785)
- Added a distinction between PCR orientation and Optical Duplicates orientation in MarkDuplicatesSpark (#4752)
- Fixed serialization crash in MarkDuplicatesSpark (#4778)
- Fixed queryname partitioning bug where asking for queryname sort would result in reads with the same name being split between partitions (#4765)
- Changed MarkDuplicatesSpark to sort non-queryname sorted bams before processing to ensure marking is consistent across shards (#4732)
- Renamed some MarkDuplicatesSpark arguments to follow the "kabob-style" convention (#4715)
- MarkDuplicatesSpark now uses the Picard OpticalDuplicatesFinder directly (#4750)
- MarkDuplicatesSpark now uses Picard metrics code directly (#4779)
BwaSpark: disable sequence dictionary validation when aligning reads #4131 (#4308)
Funcotator
- Major performance improvements due to added caching and other optimizations (#4740)
- Various fixes (#4783) (#4817) (#4770)
  - Sanitize special characters when outputting VCF so that VCF validation passes
  - Ordering specified in the header did not match the variants and hg19/b37 - VCF datasources were being inconsistently processed, inducing a lot of missed annotations.
  - Added Funcotator tests for Clinvar and Gencode v28 in hg38, and mixed chr/no-chr GENCODE.
  - Eased restrictions so that Gencode v28 would be recognized as a valid gtf. Future versions of Gencode will not fail just based on the version number and warning will be emitted instead.
  - Refining handling of transcripts with missing sequence info.
  - Refactored UTR VariantClassification handling.
  - Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
  - Added tests to prevent regression on data source date comparison bug.
  - Fixed DNA Repair Genes getter script.
  - Fixed an issue in COSMIC to make it robust to bad COSMIC data.
  - Gencode no longer crashes when given an indel that starts just before an exon.
  - Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
  - Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
  - Fixed an issue in 5' UTR processing that would cause variant alleles with length > 1 to throw an exception (fixes issue #4712).
  - Fixed a bug in the version detection for Funcotator data sources that would prevent newer data source versions from being detected as compatible (date comparison error).
- Gencode data sources now have names preserved from config files. (#4823)
GCNV kernel tunings (#4720)
- Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
- Introduced separate internal and external admixing rates
- Introduced two-stage inference for cohort denoising and calling
- Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
- Took a first step toward tracking and logging parameters during inference, starting with the ELBO history.
Validation of sequence dictionaries from multiple BAMs now throws warning instead of exception in CNV workflows. (#4758)
SV tools
- Tweak BWA to allow "gappier" alignments in local assemblies (#4708)
- Added a new experimental tool named CpxVariantReInterprepterSpark to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants (#4602)
- Fix "UnhandledCaseSeen" error in StructuralVariationDiscoveryPipelineSpark (#4677)
Added new SingleSequenceReferenceAligner class to align against an on-the-fly single contig reference using Bwa-Mem (#4780)
Updates to the conda environment for Python-based tools (#4749)
- Fix #4741, where newer versions of conda appear to treat relative references in the environment yml as being relative to the yml file instead of relative to the cwd (based on observation).
- Add a second conda yml file (gatkcondaenv.intel.yml) for environments that use Intel hardware acceleration and the Intel Tensorflow package (based on #4735).
- Added a gradle task (condaEnvironmentDefinition) to generate the conda yml files from a single template to ensure that all the environment definitions remain in sync. This task also generates the Python package archive.
- Added a gradle task (localDevCondaEnv) to create or update a local (non-Intel) conda environment. This is a shortcut for use during development when you're iteratively changing/testing Python code and want to update the conda env.
Added a new WEX test bam to src/test/resources/large, with a companion target interval list (#4756)
Add slightly modified version of GATK3 github issue template (#4796)
Updated htsjdk to 2.15.1 (#4830)