Methods

K-mer signatures are calculated using the method outlined in Bauer et al 2020. Briefly, a multi-alignment is performed using all SARS-CoV-2 genome sequences with MAFFT. To limit the impact of sequencing errors and artefacts, the alignments are trimmed such that only positions with 95% coverage are retained. Identical sequences are then collapsed into a single representative sequence (see Table below). Frequency of all possible 10-mers are then calculated for each genome and then the resulting vector compressed into two Principal Components and plotted.