Highlighting this release are some important fixes and improvements to the HaplotypeCaller
, in particular support for genotyping spanning deletions and a fix to the reference confidence calculation around indels. This release also brings support for "Requester Pays" GCS (Google Cloud Storage) buckets, fasta.gz
support to the -R
/--reference
argument, a port of LeftAlignAndTrimVariants
from GATK3, a new tool FuncotatorDataSourceDownloader
to download Funcotator
datasources, and bug fixes to Mutect2
, VariantRecalibrator
, and SelectVariants
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
-
HaplotypeCaller
- Fixed the reference confidence calculation upstream of indels (#5172)
- Improve hom-ref GQs near indels in GVCFs. Also consider bases on either side of indels informative if local assembly has been performed.
- The previous behavior generated some PL=0,0,0 no-calls because the CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel reference confidence model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.
- Make HaplotypeCaller genotype and output spanning deletions (#4963)
- Modifies HaplotypeCaller so that it can output and genotype spanning deletion alleles represented by the * allele.
- Fixes #2960
- Previously, the output of HaplotypeCaller would not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflow GenotypeGVCFs adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.
- Simplify HaplotypeBAMWriter code. #944 (#5122)
- Fixed the reference confidence calculation upstream of indels (#5172)
-
Mutect2
- Mutect2 now emits DP values in the FORMAT field (#5185)
- Add
--get-af-from-ad
option to recalculate the allele fraction based on AD instead of the Bayesian estimate (#5118)- Recommended for mitochondrial applications
- Fixed a
StringIndexOutOfBoundsException
crash in the ReferenceBases annotation when a variant is within 10 base pairs of the end of a chromosome (#5151) - Restore base quality filter code that got removed unintentionally in #4895. (#5123)
- Remove extra space in the
MutectVersion
header line (previously wasMutect Version
) (#5184)
-
Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new
--gcs-project-for-requester-pays
argument (#5140) -
Added fasta.gz support to the
-R
/--reference
argument in walker tools (#5120) -
Added GCS/NIO support to the
--tmp-dir
argument (#4469) -
Upgraded
google-cloud-java
to the official 0.62.0 release, and move off of our custom fork of the library. This release includes the retry for transient502
errors that we added to our fork in GATK 4.0.8.0 (#5194) (#5135) -
Ported the
LeftAlignAndTrimVariants
tool from GATK3 (#5144) -
VariantRecalibrator
: the serialized model now sets annotation order (#3655)- This addresses a problem where serialized GMMs for VQSR assumed that the annotation order would be the same between the commands that generated them and the commands that used them. VQSR no longer depends on the commandline order of the annotations.
-
SelectVariants
: Drop sites with the * allele as the only ALT when running with--exclude-non-variants
(#5129) -
Funcotator
:- Created a new
FuncotatorDataSourceDownloader
tool to download data sources. (#5150) - Add an experimental
FilterFuncotations
tool (#4991) - Updated COSMIC to annotate protein change strings with their counts. (#5181)
- Fix INDEL start/stop position and alleles for VCF gencode output. (#5131)
- Get datasource version from a manifest file instead of the README (#5149)
- Extract a new
FuncotatorEngine
to make it easier to write additional tools in the future that leverage Funcotator's annotation engine (#5134) - Handle character encoding error cases. (#5124)
- Created a new
-
CNNScoreVariants
: -
CNV tools
: -
SV tools
:- Bug fix to read name mangling in
ExtractOriginalAlignmentRecordsByNameSpark
(#5107) - Added an
InsertSizeDistribution
class to represent expected insert-size distribution (normal and log-normal distributed) parameterized by insert size mean and stddev (#4827) - Added documentation clarification and additional validation to
SVInterval
(#5157) - Test and utils clean up (#5116)
- Bug fix to read name mangling in
-
MarkDuplicatesSpark
: -
Clone read base qualities rather than reference them directly in the read clipper code to prevent unsafe array operations (#4926)
-
Fix three bugs in the
AlignmentUtils
class (#3494)- The treatment of D-over-D in function applyCigarToCigar() was backward.
- In function
createReadAlignedToRef()
the read start position passed to theleftAlignIndel()
call was incorrect if the haplotype has an indel relative to reference. - When the
leftAlignIndel()
call drops any leading D operator in the result cigar, the read start position needs to be adjusted accordingly.
-
Test infrastructure improvements:
-
Documented use of
--temp-dir
withGenomicsDBImport
. (#5047) -
Deleted obsolete experimental tool
MarkDuplicatesGATK
in favor ofMarkDuplicatesSpark
(#5166) -
Deleted obsolete experimental tool
BaseRecalibratorSparkSharded
(#5192) -
Upgraded htsjdk to version 2.16.1 (#5168)
-
Upgraded Picard to version 2.18.13. (#5173)