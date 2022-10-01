High-throughput genome sequencing is changing the way we practice medicine by providing new abilities to identify the genetic causes of disease. DNA sequencing applications often rely on selected databases for analysis and interpretation of results. However, during this process, it is almost impossible to avoid small amounts of DNA that do not belong to the organism of interest. These contaminants come from a variety of sources, including laboratory personnel, reagents used, and even the samples themselves. When the samples under investigation are human, microbial contaminants can be interpreted as infectious agents. Conversely, when from bacteria, human genome contaminants may have previously unknowingly assembled into the reference sequence of a bacterial genome and thus become a misleading source when such structures are detected in subsequent studies. .

A new study (Sci Rep 2022; 12:9863-9863) illustrates how this happens. Unassembled DNA sequencing “reads” were collected from nearly 5,000 people, and scanned to identify viruses, bacteria, and archaea. Those that did not match the human genome were collated and analyzed to produce a picture of the human “contaminoma,” which can be divided into three categories (Figure 1A): viral reads associated with the human virome (e.g., the collection of all viruses found in humans); bacterial or viral reads introduced through sample collection (eg, normal microbiota of the sample site) and handling, propagation of cell lines, or laboratory reagents and kits used for sequencing (eg, contaminants experimental); and bacterial mismatch reads due to contamination of the human sequence in bacterial genome databases (ie, computational contamination).

Figure 1: Understanding the human contaminant

The danger of the third category is that it can lead to spurious associations between microbes and disease. This research illustrates this with an intriguing finding: After identifying all the bacterial reads present in the 5,000 human DNA samples, they detected more than 50 bacteria that were significantly more common in men versus women (sex was binary assigned). . Rather than jump to the conclusion that these results reflected actual bacterial infections that were more common in men, the authors instead asked what would happen if bacterial genomes were contaminated with fragments of the human Y chromosome (Figure 1B). In this case, the sequences derived from this chromosome would coincide (erroneously) with those of the bacterial genomes. Supporting this hypothesis was the status of Y-chromosome sequencing, which, as of earlier this year, was still inconclusive. 77,647 short DNA sequences were identified from reads that aligned with bacterial genomes that were significantly more common among men.

Fortunately, the final sequence of the Y chromosome was published this year. The 77,647 “bacterial” sequences identified by the study were aligned with the Y chromosome sequence, and 73,691 of them (95%) were found to match, indicating that these sequences are in fact human, confirming the previous hypothesis. This result emphasizes the need to be cautious when interpreting the results of large DNA sequencing projects, and raises the question of whether reported associations between particular microbial species and cancer, blood, or autoimmune diseases still hold. Could some of them be artifacts of computational contaminants?

The problem of human sequence contamination in bacterial genomes extends beyond the Y chromosome. More than 3,000 microbial genomes have been reported to contain small human fragments. This complicates the clinical use of microbial genomes in the diagnosis of infectious diseases, in which sequencing of human samples is the basis for pathogen identification. This method involves comparing the DNA or RNA sequences of a patient sample with those of all known microbial genomes (viruses, bacteria, fungi, and parasites) to identify the cause of the infection. In this context, distinguishing microbial readouts associated with a true pathogen from contaminants is essential to avoid misdiagnosis. Despite the best efforts of researchers, computational contaminants can affect even the most robust databases.

In addition to computational contaminants, experimental contaminants can be difficult to discern, especially for samples with low biomass (usually containing a small proportion of host microbial DNA), such as blood and cerebrospinal fluid. Some of the most common experimental contaminants come from known pathogens, including staphylococci, pseudomonas, and mycobacterial species, to name a few. Fortunately, well-designed experimental controls can be applied to detect these, provided they are known to exist.

The reported results illustrate how each human genome sequencing project captures different life forms, including DNA sequences from bacteria and viruses, and how false associations between infectious diseases and traits (such as sex) could arise. This work emphasizes the obligation of having complete and accurate genomic sequences to avoid computational contamination of reference sequences and improve diagnostic accuracy. It also underscores the need for standard protocols to identify “contaminoma” to ensure the fidelity of sequencing-based diagnostics and testing.