GMM4TB: Disentangling mixed infection in tuberculosis

8 min readJul 23, 2024

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), is increasingly complicated by drug resistance and mixed-strain infections (MSIs). This study applies Gaussian Mixture Models (GMMs) to whole genome sequencing (WGS) data to analyse MSIs in TB. GMMs accurately predict drug resistance profiles and deconvolute mixed-strain infections, providing a computational method to understand and manage these complex cases. This approach enhances our ability to diagnose and treat drug-resistant TB, offering valuable insights for clinical decision-making and infection control. (This is also an analysis of the study Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis)

TB and Drug resistance

Tuberculosis (TB), caused by the bacterium Mycobacterium tuberculosis (MTB), is a major global health problem, leading to over 1 million deaths each year. This article explores the complexities of mixed infections in drug-resistant TB, using advanced techniques like whole genome sequencing (WGS) and statistical models to gain better insights into these challenging cases.

Mixed strain infection (MSI)

High transmission rates and inadequate diagnosis can lead to MSIs, with multiple strains of M. tuberculosis present within a single host.
MSIs may arise from multiple infection events, often seen in relapse cases.
No horizontal transfer, i.e. the transfer of resistance coding gene across the bacteria in proximity.

Characteristics and Implications of MSIs

An MSI strain can belong to the same or different lineages and remain distinct with varied proportions within the host.
Different drug resistance patterns can exist among the strains, as horizontal gene transfer is not observed in M. tuberculosis.

Treatment Challenges with MSIs

Heteroresistance, where resistance mutations aren't fixed in the bacterial population, often occurs in MSIs, leading to higher treatment failure rates.
Inadequate treatment approaches can trigger MSIs, resulting in treatment non-compliance or failure, reduced treatment options, and potential exposure to less effective second/third-line drugs.

Proportion of TB heteroresistance. Countries with no data are coloured on the map in light grey.

Mixed-strain infections (MSIs) appear in WGS data as a mix of genetic types. Identifying and separating these different strains is important for accurate drug resistance profiles. The GMM approach provides a way to do this without growing the bacteria in a lab. This helps doctors make better treatment decisions.

In conclusion, combining WGS data with GMM statistical analysis gives us better understand mixed infections in drug-resistant TB. This can improve diagnosis, treatment, and infection control, ultimately helping to save lives and prevent the spread of resistant TB strains.

Essential Link: Understanding SNP Frequencies in VCF and Differentiating Strains

Genetic variation between different strains is crucial in studying tuberculosis (TB) mixed infections. One key type of genetic variation is single nucleotide polymorphisms (SNPs), single base changes in the DNA sequence. The frequency of these SNPs helps us understand the composition of mixed infections. This data is often stored in a Variant Call Format (VCF) format.

What is a VCF?

A VCF file is a standardised text file used in bioinformatics to store information about genetic variants. It includes details like the variant's position on the genome, the reference(original) base, the alternate(mutated) base(s), and other metadata. See my other post on creating and extracting information from vcf files. The Bcftools guide explains why and how to manipulate VCF files.

Here's a simplified example of what a VCF entry might look like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       123456  .       A       G       50      PASS    DP=100;AF=0.3

In this example:

#CHROM Indicates the chromosome number.
POS the position of the SNP on the chromosome.
REF the reference base.
ALT the alternate base.
DP (depth) shows the number of reads covering this position.
AF (allele frequency) indicates the frequency of the alternate allele (0.3 or 30%).

How to access the data from VCF

In the context of TB mixed infections, VCF files contain SNP data for the different strains present in a sample. The allele frequency (AF) is particularly important as it tells us the proportion of reads supporting the alternate allele compared to the total reads at that position. This helps identify which SNPs are from which strains.

Below are code extracts: chromosome number, genomics position, reference allele, alternative allele and alternative allele frequency from the vcf file

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AF\n' input.vcf > allele_frequencies.tsv

Differentiating Strains Using SNP Frequencies

When dealing with mixed infections, the frequency of each SNP can indicate the presence of different strains.

Different samples will always have different sets of SNPs across the genome. More than one peak can often be seen in the case of MSI.
If a VCF file shows an SNP with an allele frequency of 0.5, it suggests that the sample might be a mix of two strains, each contributing equally to the genetic material at that position. Researchers can dissociate the mixture into component strains by analysing these frequencies across many SNPs.

GMM plot on top of alternative allele frequency histogram. 2/8 shows an example count of alternative alleles as reflected in the middle graph, indicating that 2 out of 8 reads covered on the SNP site are alternative alleles (C).

In MSI, we expect reference(original) and alternative(mutated) alleles to be in the same positions, but the mixing of two different strains will influence the ratio. Hence, the frequency of the SNPs will be able to reflect the underlying strain frequencies.

The frequency distribution is assumed to be Gaussian. GMM allows the modelling of multiple normal distributions. The figure above shows an aligned sequence with reference allele T and alternative allele C.

The frequency of alternate alleles can be calculated by 2/8 (number of alternate alleles / total number of reads). The obtained frequency is produced and fitted into a Gaussian model. The peaks of the two Gaussian distributions should reflect the frequencies of the two strains within the mix.

Why GMM

When tackling complex problems like mixed infections in tuberculosis (TB), we need powerful tools to make sense of the data. One such tool is the Gaussian Mixture Model (GMM). But what exactly is a GMM, and why is it useful for understanding mixed infections?

What is a Gaussian Mixture Model (GMM)?

Imagine you have many data points, like the heights of different monkey species in a jungle. If you plotted these heights, you'd probably see a bell-shaped curve, or what statisticians call a "normal distribution" or "Gaussian distribution." This curve tells you that most monkeys are of average height, with fewer monkeys being very short or very tall.

Now, let's say you have data that looks like it's a mix of several bell-shaped curves. Maybe you're looking at heights in a jungle with multiple distinct species, each with its own average height and variation. SNP frequency also works in a similar way (in fact, one may speculate whether it is the SNP frequency distribution that resulted in the similar height distribution.)

the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough — central limit theorem

A GMM helps you understand this kind of mixed data. It assumes that the data is a combination of several Gaussian distributions, each representing a different group.

How Does a GMM Work?

A GMM tries to find the best combination of these bell-shaped curves to explain your data. Here's a simplified rundown of how it works:

Initialisation: Start with an initial guess of how many groups (or clusters) there are and what their distributions look like.
Expectation Step: Calculate the probability that each data point belongs to each group.
Maximisation Step: Update the parameters (e.g. Peak, spread and overlaps of the the bell shape curve) of the Gaussian distributions to better fit the data, based on the probabilities (probability of a point being in either Gaussian distribution) calculated in the previous step.
Iteration: Repeat the Expectation and Maximisation steps until the model stabilises and the best fit is found.

Why is GMM Suitable for Mixed Infections?

Mixed infections in TB mean that a patient has more than one strain of the TB bacterium. This can happen because of different infections or because the bacterium evolves within the patient. Identifying and understanding these mixed infections is crucial because different strains might respond differently to treatment.

Here's why GMM is particularly useful for this:

Flexibility: GMM can handle the complexity of mixed infections by modelling each strain as a separate Gaussian distribution. This allows it to accurately represent the mix of strains in a patient.
Probabilistic Nature: GMM calculates the probability that each data point (in this case, each genetic feature of the bacteria) belongs to each strain. This probabilistic approach is powerful for teasing apart the contributions of different strains.
Scalability: GMMs can work with large datasets, making them suitable for analysing comprehensive genomic data from many patients. Using Gaussian Mixture Models (GMMs), researchers can further refine this process. GMMs analyse the SNP frequency data to probabilistically assign each SNP to a specific strain, based on the assumption that the data is composed of multiple Gaussian distributions. This allows for a precise understanding of the strain composition within a sample, aiding in the diagnosis and treatment of drug-resistant TB.

In summary, SNP frequency data stored in VCF files is critical for differentiating between strains in mixed infections. The allele frequency information, combined with advanced statistical models like GMMs, provides a detailed picture of the genetic landscape, enabling better management of TB.

The study Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis demonstrates the potential of combining WGS data with GMM statistical analysis to provide a deeper understanding of mixed infections in drug-resistant TB, ultimately improving diagnosis, treatment, and infection control efforts.

Methods

The approach: GMM Gaussian Mixture model for SNP

Clinical Isolates and Sequence Analysis:

Analysed MTB isolates using Illumina NGS technology.
Applied direct association (i.e. library search) software to infer drug resistance (e.g. TB-Profiler)
Use bcftools to extract SNP information from vcf files, see in my other post on how to create and extract information from vcf files Bcftools guide: why and how to manipulate VCF files

2. Gaussian Mixture Model:

Built GMMs using Scikit-learn to analyse allele counts across SNPs, estimating the number of mixtures and their proportions.

# SNPfrequency input format
X = [0.1, 0.9, 1, 0.6, 0.5 ....]
# Creating GMM model
gmm = GaussianMixture(n_components=number of mixes in the sample, covariance_type='full')

# Traing the model
gmm.fit(X)

# Predict the labels
labels = gmm.predict(X)

3. Assigning drug resistance to individual strains

Drug resistance responsible mutation SNP be run through the model to determine which of the strains it belongs, and hence determine the drug resistance for each strain. Normal drug resistance prediction for the sample (Drug resistance (DR) profile), dissected/disentangled drug resistance (Major strain DR mutation, Minor strain DR mutations)
Drug resistance information can be obtained by comparing SNPs against a known library (Direct Association). One such tool for TB is the TB-profiler

*Mixing proportions: (TB-Profiler prediction) [GMM prediction]; PAS

As can be seen in the first sample ERR4796347, drug resistance only comes from the Major strain, whereas the following two samples also show unequal drug resistance for each strain.

Code Implementation for GMM Analysis

The complete Python implementation of the Gaussian Mixture Model analysis for TB MSI study (GMM4TB) in the form of a runnable script can be found at: https://github.com/linfeng-wang/GMM4TB

Conclusion

This study showcases the potential of combined WGS data and GMM statistical analysis to provide insights into mixed infections of MTB. Accurate profiling of mixed infections can significantly aid in the treatment and management of drug-resistant TB. The provided code demonstrates a practical implementation of the GMM approach for analysing mixed infections.

For more details, you can access the complete research paper here.