broadinstitute/gatk 4.2.0.0 on GitHub

Download release: gatk-4.2.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.2.0.0 release:

We've worked closely with Illumina to port a number of significant innovations for germline short variant calling from their DRAGEN pipeline to GATK. These improvements will form the basis of the upcoming open-source implementation of the DRAGEN pipeline which we're calling DRAGEN-GATK
A number of other fixes and improvements to HaplotypeCaller to improve the phasing of variant calls and to fix edge cases with indels and spanning deletions
A new pipeline for gCNV exome joint calling

Full list of changes:

DRAGEN-GATK (#6634) (#7063)
- With this release we've worked closely with Illumina to make improvements to the GATK HaplotypeCaller to allow it to output germline short variant calls that are functionally equivalent to the calls made by their DRAGEN 3.4.12 pipeline. See our blog post on DRAGEN-GATK for more details on these improvements. A full DRAGEN-GATK pipeline that leverages these new features will be released in the near future as a WDL workflow script in the WARP repo on GitHub as well as a featured workspace in Terra.
- Below is a summary of the improvements we've ported from DRAGEN in this release. We recommend that most users wait until the complete DRAGEN-GATK pipeline is released as a WDL workflow before evaluating these features, though advanced users comfortable with building their own pipelines are welcome to try them out now:
  - DragSTR: a port of DRAGEN's model for STRs (Short Tandem Repeats) that adjusts HMM indel priors based on empirical reference contexts for better indel calling.
    - Using DragSTR involves running two new tools prior to the HaplotypeCaller:
      - ComposeSTRTableFile: scans a reference for STR sites and outputs a table file with a subsample of the available STR sites across the genome.
      - CalibrateDragstrModel: given the STR table for a reference produced by ComposeSTRTableFile and the reads for a specific sample, generates a model for potential sequencing errors for STR sites of various sizes for that sample.
    - After running these tools, you then run HaplotypeCaller with the --dragstr-params-path argument to pass it the DragSTR model generated by CalibrateDragstrModel.
  - BQD (Base Quality Dropout) and FRD (Foreign Read Detection): two new genotyper error models ported from DRAGEN
    - The Base Quality Dropout (BQD) model penalizes variants with low average base quality scores and high average sequencing cycle counts among genotyped reads and reads that were otherwise excluded from the genotyper to model read-context dependent sequencing errors.
    - The Foreign Read Detection (FRD) model uses an adjusted mapping quality score as well as read strandedness information to penalize reads that are likely to have originated from somewhere else on the genome or from contamination.
    - To activate the BQD and FRD models, run HaplotypeCaller with the --dragen-mode argument.
  - Added a new variant QUAL score model that reports the variant QUAL score as the posterior of the reference genotype based on the sample-dependent DRAGEN STR and flat SNP priors.
HaplotypeCaller
- We now add physical phasing information (PGT/PID/PS attributes) to genotypes with spanning deletion alleles (#6937)
- Fixed two phasing bugs (#7019)
  - Fixed "HaplotypeCaller emitting incorrect phasing when genotyping hom-het-het" (#6463)
  - Fixed "Phased variants do not have the same phase set identifier" (#6845)
- Fixed quality score calculation for sites with spanning deletions (#6859)
  - This fixes a bug in the AlleleFrequencyCalculator that was causing quality to be overestimated for sites with * alleles representing spanning deletions.
- Added the ability for indels to be recovered from dangling heads in the assembly graph, and a new --num-matching-bases-in-dangling-end-to-recover argument for filtering dangling ends (#6113) (#7086)
- Improved handling of indels/spanning deletions in the cigar base quality adjustment code. (#6886)
  - This aims to better handle the edge cases that come up when mates have mismatching numbers of bases at the start or end of the reads relative to each-other.
- Fixed a bug where overlapping reads in subsequent assembly regions could have invalid base qualities (#6943)
- Convert non-ACGT IUPAC bases to N in HaplotypeCaller prior to assembly to prevent a crash (#6868)
- Renamed the --mapping-quality-threshold argument to --mapping-quality-threshold-for-genotyping, and updated its documentation to be less confusing (#7036)
- Added an option for HaplotypeCaller and Mutect2 to produce a bamout without artificial haplotypes (#6991)
- Updated the --debug-graph-transformations argument to emit the assembly graph both before and after chain pruning (#7049)
Mutect2
- Fixed the --dont-use-soft-clipped-bases argument in Mutect2 to actually work as intended (#6823)
  - Due to a bug, this option did nothing because a copy of the original reads was modified. By deleting the unnecessary mapping quality filtering (this is totally redundant with the M2 read filter), we finalize (and thereby discard soft clips if requested) an assembly region made from the original reads, not a copy.
- Fixed a bug in the Mutect2 engine active region code that could affect the ability to call tumor alts when the normal has a different alt at the same site (#6908)
- Removed an obsolete cram to bam conversion step in the Mutect2 WDL (#6970)
- Updated the Mutect2 whitepaper in docs/mutect/mutect.pdf to accurately reflect current filter names, and updated the section on FilterAlignmentArtifacts (#6967)
CNV Calling
- A new pipeline for gCNV exome joint calling (#6554)
  - Added a new tool (JointGermlineCNVSegmentation) and associated workflow (scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl) to combine gCNV segments and calls across samples
  - JointGermlineCNVSegmentation segments and genotypes CNV calls from the germline CNV pipeline jointly across multiple samples.
  - The workflow in scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl produces a joint, multi-sample genotyped VCF.
  - For whole genomes, we recommend CNVs as part of a full SV callset with https://github.com/broadinstitute/gatk-sv (soon to be added to Terra)
- GermlineCNVCaller now restarts inference once with a new random seed when inference diverges. Also added a new entry point to PythonScriptExecutor that returnes ProcessOutput. (#6866)
  - This is intended to alleviate transient issues with GermlineCNVCaller inference in which the ELBO converges to a NaN value, by calling the python gCNV code with an updated random seed input.
- CreateReadCountPanelOfNormals: fixed a bug in the logic for filtering zero-coverage samples and intervals (#6624)
- FilterIntervals: fixed a bug in the tool logic when filtering on annotations and -XL is used to exclude intervals (#7046)
SV Calling
- PrintSVEvidence: a new tool that prints any of the Structural Variation evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF) (#7026)
  - This tool is used frequently in the GATK-SV pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing GATK-SV pipeline.
GenomicsDB
- Introduced a new feature for GenomicsDBImport that allows merging multiple contigs into fewer GenomicsDB partitions (#6681)
  - Controlled via the new --merge-contigs-into-num-partitions argument to GenomicsDBImport
  - This should produce a huge performance boost in cases where users have a very large number of contigs. Prior to this change, GenomicsDB would create a separate folder/partition for each contig, which slowed down import to a crawl when there were many contigs.
Funcotator
- Added sorting by strand order for transcript subcomponents (#7065)
  - This fixes an issue where the coding sequence, protein prediction, and other annotations could be incorrect for the hg19 version of Gencode, due to the individual elements of each transcript appearing in numerical order, rather than the order in which they appear in the transcript at transcription time.
- Updated the Funcotator tutorial link in the tool documentation. (#6920) (#6925)
Mitochondrial pipeline
- Simplified the max_reads_per_alignment_start argument in mitochondria_m2_wdl/AlignAndCall.wdl (#6904)
- Remove the unused "autosomal_coverage" parameter from the Filter task in mitochondria_m2_wdl/AlignAndCall.wdl (#6888)
Notable Enhancements
- Add a -O option to save the output to a file in the following tools: FlagStat, CountBases, CountReads, CountVariants, and CountBasesInReference (#7072)
- DepthOfCoverage: added a new gene_statistics output file (#7025)
- ReblockGVCF: allow reblocking with no PLs (#6757)
Bug Fixes
- Fixed a ClosedChannelException error when doing multiple queries on remote CRAM files, and added a test to verify proper stream management (#7066)
- SelectVariants: Fixed an issue where SelectVariants could generate duplicate VCF header lines in some circumstances, resulting in an invalid VCF (#7069)
- VariantAnnotator: fixed a NullPointerException by adding a validation check that all samples in the input bam are present in the provided vcf before running (#6944)
- SplitNCigarReads: fixed an error where the read mate key was not sufficiently strict about read names, causing cigar errors (#6909)
- CalculateGenotypePosteriors: ensure that resources have the same sequence dictionary as the input VCF (#6430)
- MarkDuplicatesSpark: fixed a NullPointerException when a null ReadNameRegex was provided (#7002)
- GnarlyGenotyper: bugfix for the QUALapprox calculation, tolerate missing VarDP, and support AS_QUALapprox if QUALapprox is missing (#7061)
- Fixed the GATK version number in the docker image when doing releases to not end in "-SNAPSHOT" (#6883)
Miscellaneous Changes
- Switched GATK to the Apache 2.0 license (#7079)
- We now print the current Spark version on GATK startup (#7028)
- Added a log warning message when the total size of the PL arrays for a variant will likely exceed 100,000 (#6334)
- Added a script to publish GATK tool WDLs for each release (#6980)
- Migrated the GATKPath base class to HtsPath (#6763)
- Migrate additional tools to GATKPath (#6718)
- Made BaseUtils.convertIUPACtoN() and BaseUtils.simpleBaseToBaseIndex() methods more robust to handle all possible byte values (#7010)
- Enabled CARROT integration for triggering test runs from PR comments (#6917) (#6986)
- Added loci information to several annotation warnings (#6891)
- VariantRecalibrator: added locus information to a ref allele mismatch error message (#6964)
- ReferenceConfidenceVariantContextMerger: corrected AS annotation warning message to use GATK4 annotation names (#6985)
- Made the CNNScoreVariants task in cnn_variant_wdl/cnn_variant_common_tasks.wdl robust to the reads and index being in different locations. (#6900)
- Updated gcloud docker commands in build_docker.sh (#7078)
- Added version number to the dockstore yml file (#6905)
- Switched travis gcloud installation to use noninteractive mode (#6974)
- Deleted the obsolete tool FixCallSetSampleOrdering (#7022)
- Echo the log file after a failed travis run. (#7020)
- Temporarily disable the PairHMMUnitTest on Java 11. (#7044)
- Pin our h5py version to 2.10.0. (#6955)
Documentation
- Added a link to the new gatk-tool-wdls repository to the README (#6982)
- Updated JEXL documentation website link in SelectVariants and VariantFiltration (#7029)
- Updated the ApplyVQSR docs to consistently use the GATK4 tool name: ApplyRecalibration -> ApplyVQSR
- Modified the README to reflect the current download size for Git LFS files (#6933)
- Fixed a typo in the conda environment YML documentation. (#6935)
- Removed reference to -Dtest.single from the README (#6914)
- Fixed a typo in a javadoc comment in HaplotypeCallerEngine (#7033)
Dependencies
- Updated HTSJDK to 2.24.0 (#7073)
- Updated Picard to 2.25.0 (#7075)