Existing genomic databases are full of potential clues about future epidemics that could strike humanity.
In the space of a few months, SARS-CoV-2 and its multiple variants have wreaked unprecedented havoc on our planet, so much so that more than two years after the start of the pandemic, we are still not not out of trouble. But while a semblance of a way out of the crisis finally seems to be on the horizon, virologists have learned their lesson; they now seek to anticipate the arrival of the next threats. This is the subject of the work of an American team spotted by Futura; its researchers have just discovered more than 100,000 new viruses – including nine coronaviruses.
“These are fundamental works,” says bluntly the bioinformatician Rodney Brister, not affiliated with the study. “This demonstrates how little we know about this group of organisms.”, he explains in the same interview to the prestigious journal Science.
To achieve this, bioinformatician Artem Babaian and high-performance computing expert Jeff Taylor began by harvesting an incredible amount of genomic data collected by researchers from all walks of life for years. In total, they found themselves in possession of 16 TB of genetic analyzes of diverse and varied organisms,from fish to human microbiota to farm soil”.
The devil is in the details
But what is even more interesting than this immense genetic register are the hidden treasures in the watermark. We know, for example, that the virus genome that infect the organisms in question is also captured during sequencing. However, this genetic material tends to blend into the mass of data produced by sequencing, and therefore to slip through the cracks.
And for biologists, it’s a real heartbreak, because they know full well that this situation deprives them of potentially important discoveries. Unfortunately, it is simply inconceivable to hope to disentangle all this information manually hoping to stumble upon a viral genome by chance. Naturally, Babaian and Taylor therefore chose to bet on high-performance computing.
They bet on the Serratus cloud platform, a system optimized for very large-scale sequence comparison. Thanks to this formidable tool, they managed to comb through this mountain of data in search of the gene for a rather special enzyme: the RNA-dependent polymerase, a central part of the viral machinery. This allowed them to flush out small fragments of viral genes, scattered all over their database.
By compiling all of this, they were able to trace the tracks leading to more than 100,000 new viruses, including 9 from the famous Coronaviridae family. Note that this is a big family of viruses and that at present, there is no suggestion that these new representatives pose a public health risk.
A virological “surveillance network”
This is partial information, which does not fully map the genome like proper sequencing. But each of them is a piece of a large evolutionary puzzle; researchers can exploit them to reconstruct veritable viral family trees.
This filiation is very important when faced with a new microorganism. Indeed, if one can identify a relationship with an already documented virus, it is much easier to understand how it works and, if necessary, take the appropriate measures within an acceptable time frame. “We have transformed this database into a real virological surveillance network”, explains Babaian.
This is precisely the kind of substantive work that we lacked before the Covid-19 crisis. By being more proactive on this issue, it might have been possible to prevent the situation from turning into a disaster after the appearance of the first cases. Hopefully similar studies will continue to be carried out by researchers; it is our best weapon to hope to respond to future viral threats in time. One day, this type of work may allow us to fight an even more dangerous microorganism more effectively without being caught off guard.
The text of the study is available here.