Highlights of this release include the ability to emit MNPs in Mutect2
and HaplotypeCaller
via a new --max-mnp-distance
argument, much better active region detection for low allele fractions in Mutect2
, new priors for variants sites and homRef blocks in HaplotypeCaller
, a new tool FilterAlignmentArtifacts
to filter false positive alignment artifacts in the Mutect2
pipeline, performance improvements to CNNScoreVariants
and Funcotator
, and a new --sites-only-vcf-output
GATK engine argument to suppress genotypes when writing VCFs.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Mutect2
- Made
Mutect2
active region determination much better for low allele fractions (#4832)- In particular, this makes
Mutect2
vastly better for mitochondrial and cfDNA calling
- In particular, this makes
Mutect2
can now emit MNPs according to adjustable distance threshold specified via--max-mnp-distance
(#4650)- Tweaked
Mutect2
read position filter to handle non-biological (eg FFPE) insertions better (#4851) - Fixed
Mutect2
bug where triallelic normal artifacts were sometimes hidden from filtering engine (#4809) Mutect2
STR filter now also looks at insertions (#4845)- This lowers the indel false positive rate dramatically.
Mutect2 WDL
:
- Made
-
Added new tool
FilterAlignmentArtifacts
(#4698)- Filters false positive alignment artifacts (that is, apparent variants due to reads being mapped to the wrong genomic locus) from a VCF callset by checking variant-supporting reads and their mates.
- By considering the realignment of the read and its mate, it saves a lot of variants, especially in low-complexity regions, from being filtered as mapping errors.
-
HaplotypeCaller
HaplotypeCaller
can now emit MNPs according to adjustable distance threshold specified via--max-mnp-distance
(#4650)- New
HaplotypeCaller
priors for variants sites and homRef blocks (#4793)- Added new
--population-callset
argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors - Added new
--num-reference-samples-if-no-call
argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel - As a side effect of this change,
CalculateGenotypePosteriors
now supports indels.
- Added new
- GCS/NIO output support for the
-bamout
argument (#4721)
-
-new-qual
inHaplotypeCaller
/Mutect2
/GenotypeGVCFs
no longer counts spanning deletions as support for variant qual (#4801) -
CNNScoreVariants
-
GATK Engine
- Added a new traversal type
TwoPassVariantWalker
that does two passes over its input variants (#4744) - Enable the
-L
argument to read feature files (such as.bed
or.vcf
files) from non-local Paths, including GCS buckets (#4854) - Added
--sites-only-vcf-output
argument to the GATK engine to suppress genotype fields when writing VCFs (#4764) - Tools that use annotations now use the barclay annotation plugin (#4674)
- Added new
ReadQueryNameComparator
(#4731) - Automatically schedule temporary resource files for delete on exit (#4616)
- Added a new traversal type
-
Spark tools
-
MarkDuplicatesSpark
- Fixed
MarkDuplicatesSpark
so it handles supplementary reads with unmapped mates properly (#4785) - Added a distinction between PCR orientation and Optical Duplicates orientation in
MarkDuplicatesSpark
(#4752) - Fixed serialization crash in
MarkDuplicatesSpark
(#4778) - Fixed queryname partitioning bug where asking for queryname sort would result in reads with the same name being split between partitions (#4765)
- Changed
MarkDuplicatesSpark
to sort non-queryname sorted bams before processing to ensure marking is consistent across shards (#4732) - Renamed some
MarkDuplicatesSpark
arguments to follow the "kabob-style" convention (#4715) MarkDuplicatesSpark
now uses the PicardOpticalDuplicatesFinder
directly (#4750)MarkDuplicatesSpark
now uses Picard metrics code directly (#4779)
- Fixed
-
BwaSpark
: disable sequence dictionary validation when aligning reads #4131 (#4308) -
Funcotator
- Major performance improvements due to added caching and other optimizations (#4740)
- Various fixes (#4783) (#4817) (#4770)
- Sanitize special characters when outputting VCF so that VCF validation passes
- Ordering specified in the header did not match the variants and hg19/b37 - VCF datasources were being inconsistently processed, inducing a lot of missed annotations.
- Added Funcotator tests for Clinvar and Gencode v28 in hg38, and mixed chr/no-chr GENCODE.
- Eased restrictions so that Gencode v28 would be recognized as a valid gtf. Future versions of Gencode will not fail just based on the version number and warning will be emitted instead.
- Refining handling of transcripts with missing sequence info.
- Refactored UTR VariantClassification handling.
- Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
- Added tests to prevent regression on data source date comparison bug.
- Fixed DNA Repair Genes getter script.
- Fixed an issue in COSMIC to make it robust to bad COSMIC data.
- Gencode no longer crashes when given an indel that starts just before an exon.
- Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
- Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
- Fixed an issue in 5' UTR processing that would cause variant alleles with length > 1 to throw an exception (fixes issue #4712).
- Fixed a bug in the version detection for Funcotator data sources that would prevent newer data source versions from being detected as compatible (date comparison error).
- Gencode data sources now have names preserved from config files. (#4823)
-
GCNV
kernel tunings (#4720)- Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
- Introduced separate internal and external admixing rates
- Introduced two-stage inference for cohort denoising and calling
- Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
- Took a first step toward tracking and logging parameters during inference, starting with the ELBO history.
-
Validation of sequence dictionaries from multiple BAMs now throws warning instead of exception in CNV workflows. (#4758)
-
SV tools
- Tweak BWA to allow "gappier" alignments in local assemblies (#4708)
- Added a new experimental tool named
CpxVariantReInterprepterSpark
to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants (#4602) - Fix "UnhandledCaseSeen" error in
StructuralVariationDiscoveryPipelineSpark
(#4677)
-
Added new
SingleSequenceReferenceAligner
class to align against an on-the-fly single contig reference using Bwa-Mem (#4780) -
Updates to the conda environment for Python-based tools (#4749)
- Fix #4741, where newer versions of conda appear to treat relative references in the environment yml as being relative to the yml file instead of relative to the cwd (based on observation).
- Add a second conda yml file (
gatkcondaenv.intel.yml
) for environments that use Intel hardware acceleration and the Intel Tensorflow package (based on #4735). - Added a gradle task (
condaEnvironmentDefinition
) to generate the conda yml files from a single template to ensure that all the environment definitions remain in sync. This task also generates the Python package archive. - Added a gradle task (
localDevCondaEnv
) to create or update a local (non-Intel) conda environment. This is a shortcut for use during development when you're iteratively changing/testing Python code and want to update the conda env.
-
Added a new WEX test bam to
src/test/resources/large
, with a companion target interval list (#4756) -
Add slightly modified version of GATK3 github issue template (#4796)
-
Updated htsjdk to 2.15.1 (#4830)