cov-lineages/pangolin v2.0 on GitHub

Release notes pangolin 2.0

This release of pangolin comes with some major changes, including a significant speed-up and improvements in assignment accuracy for larger lineages. The new assignment algorithm (that we have termed pangoLEARN) is described in detail below. One significant benefit of this approach over the previous algorithm is that it allows us to incorporate all of the diversity of the large lineages into the assignment system rather than just picking a select few. This approach will also improve our approach to homoplasies in the phylogeny as these sites would likely not be informative. We have pulled out informative sites and this information is included in the data release on pangoLEARN. The top SNPs that are most positively and negatively associated with a given lineage are detailed in those files.

Practical information for the user include the following:

data is now being pulled from cov-lineages/pangoLEARN rather than cov-lineages/lineages. This is accounted for in the conda environment.yml file but for those not using conda, this data will need to be pip installed. Other new dependencies include minimap2 and datafunk (also pip installable via git+https://github.com/cov-ert/datafunk.git).
The previous algorithm is still accessible using the --legacy flag, but for the most recent data release information we encourage you to use pangolin 2.0.
Use of pangolin remains the same pangolin <your-query-fasta>
The output csv now only has a single support column (assignment probability) rather than the previous UFbootstrap and aLRT values. The original format is output if using --legacy
Our intentions going forward are to phase out the legacy algorithm as it was struggling to scale with the increase in lineage number and sequences but it is still available in the current release of pangolin.
pangoLEARN contains information about the top SNPs that are most positively and negatively associated with a given lineage. The lineage recall report is also available in this repository.

pangoLEARN details

pangoLEARN is an alternative algorithm for lineage assignment, implemented as of pangolin 2.0. This new algorithm, which relies on machine learning, offers much faster lineage assignment, as the phylogenetic approach was struggling to scale with the increase in number of lineages needing to be represented in the guide tree. This new approach also takes into account all of the diversity present within a lineage rather than just selecting a representative few. The consequences of this approach mean that for large lineages, we have improved our recall and precision significantly. We are continuing to develop more sophisticated approaches to machine learning for lineage assignment, which we hope will offer even better improvements in both speed and accuracy.

The current version of pangoLEARN uses multinomial logistic regression, but the pipeline has been written so that as more complex models are developed,the user will be able to choose which model to use to assign their lineages.

While in standard regression a line of best fit is found for a set of training data, which represents a linear relationship between variables of interest, a logistic regression fits a sigmoid function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.

The model was trained using 30,000 SARS-CoV-2 sequences from GISAID (acknowledgements here), their assigned lineages being manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 14 hours on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of multinomial logistic regression. The code for this process is available in the cov-lineages/cov-support repository.

Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.

Contributions

Emily Scher and Áine O'Toole have worked together to develop pangolin 2.0

cov-lineages/pangolin v2.0 pangolin v2.0 on GitHub

Release notes pangolin 2.0

pangoLEARN details

Contributions

cov-lineages/pangolin v2.0
pangolin v2.0

on GitHub