SARS-CoV-2 Sequencing Data: The Devil Is in the Genomic Detail

Oct. 28, 2020

The study of SARS-CoV-2 whole genome sequencing (WGS) data has led to many important findings about this pathogen, and we will need more sequence data from samples from all over the world in order to come up with effective approaches to control and prevent COVID-19 infections. Scientists from across the globe are collaborating to generate and share this invaluable information and apply it to disease diagnosis and control efforts.

Genome Characteristics of Betacoronaviruses and SARS-CoV-2

Betacoronaviruses are a group of enveloped, positive-sense, single-stranded RNA viruses in the subfamily Coronavirinae in the family Coronaviridae. The genomes of these viruses, ranging from 27-32 kb in size, are the largest among RNA viruses. Each genome encodes polyproteins that undergo proteolysis to become nonstructural proteins of various functions, such as viral proteases (3CL, PL) and RNA-dependent RNA polymerase (RdRP), all of which are integral to transcription and replication. These genomes also encode several structural proteins, including spike protein (S), membrane protein (M), envelope small membrane protein (E) and nucleocapsid protein (N). This seminal paper, published in Cell earlier in 2020, reviews the architecture of the SARS-CoV-2 transcriptome and the mechanism of viral gene expression.

Diagram of the genome structure of SARS-CoV-2.
The genome of SARS-CoV-2 showing transcription sites and protein coding domains.
Source: ViralZone

Prior to the identification of SARS-CoV-2, betacoronaviruses found among humans included endemic human coronaviruses causing respiratory tract infections (such as OC43 and HKU1) and epidemic human coronaviruses. The latter are believed to have crossed over from animals to humans, and include MERS-CoV, which causes Middle East Respiratory Syndrome (MERS), and SARS-CoV, which causes Severe Acute Respiratory Syndrome (SARS). In January 2020, when an RNA virus was identified as the etiologic agent of the disease soon to be named COVID-19, scientists immediately sequenced its genome. The virus had 79.0% sequence identity to SARS-CoV, and even higher sequence identity of 86.7%-89% with SARS-like coronaviruses originating in bats, with only 50% sequence identity with MERS-CoV.  The International Committee on Taxonomy of Viruses (ICTV) named the new virus SARS-CoV-2. Although evidence suggests that bats are likely a reservoir for the virus, their ecological separation from humans indicates that other mammalian species may have acted as "intermediate" or "amplifying" hosts.

One of the most remarkable features of coronaviruses is their intrinsic proofreading mechanism. The replication process for RNA viruses generally has high error rates and results in quasispecies - a population of viruses with different genomic mutations acquired through replication errors that all reside in the same host. However, coronaviruses encode a protein called nonstructural protein 14 (nsp14) that possesses proofreading activity. This proofreading mechanism is believed to be important to coronaviruses due to their large and complex genomes. Without it, the high mutation rates typically associated with RNA virus replication would have a detrimental effect on the fitness of coronaviruses. Although the mutation rate of coronaviruses (including SARS-CoV-2) is approximately 10-fold lower than those of other RNA viruses, these viruses still acquire some mutations as they spread from host to host. For SARS-CoV-2, epidemiologists estimate a mutation rate of 33 genomic mutations/year. Scientists use the presence of these mutations in SARS-CoV-2 genomes to assign a lineage or clade to each strain. One of the schemes recently published in Nature Microbiology by scientists in the U.K. involves designation of a virus strain as 1 of the 2 lineages (A or B), followed by numerical values based on phylogenetic evidence of emergence from an ancestral lineage into another geographically distinct population.

Global Collaborations to Collect and Analyze SARS-CoV-2 Sequence Data

After the first SARS-CoV-2 genome was published, scientists all over the world soon realized the immediate necessity to obtain as much genetic information from as many SARS-CoV-2 strains as possible. At the beginning of the pandemic, many research groups tried to develop their own protocols to obtain SARS-CoV-2 sequencing data from culture or clinical specimens that tested positive for the virus. Multiple approaches have been implemented. In an effort to standardize sequencing procedures, an international workgroup called Advancing Real-Time Infection Control Network (ARTIC), consisting of scientists from the U.K., Belgium and the U.S., devised a method of SARS-CoV-2 whole genome sequencing (WGS) on the Oxford Nanopore Technologies sequencing platforms. The protocol has since been adapted for other sequencing platforms, allowing more laboratories to study the genome of the virus. The Office of Advanced Molecular Detection (AMD) at the Centers for Disease Control and Prevention (CDC) maintains a GitHub page containing a comprehensive list of protocols, tools and resources for SARS-CoV-2 whole genome sequencing on various platforms, including Illumina, PacBio and Ion Torrent.

During a pandemic, it is imperative that sequence data of the pathogen of interest are shared in publicly-accessible repositories. The World Health Organization (WHO) strongly supports public access to sequence data to inform public health and research decision-making during outbreaks. One of the largest curated international repositories of SARS-CoV-2 sequence data is hosted by GISAID (Global Initiative on Sharing All Influenza Data). As of September 2020, almost 100,000 full SARS-CoV-2 genomic sequences, along with key contextual information (metadata) associated with each sequence, have been uploaded and shared on the GSAID SARS-CoV-2 Genomic Epidemiology (EpiCov) platform. NextStrain and NextClade open-source bioinformatics tools use GSAID data, allowing users to create highly customizable visualizations.

Map of possible global transmission pattern of SARS-CoV-2 clades from a subset of GISAID data.
Map of possible global transmission pattern of SARS-CoV-2 clades from a subset of GISAID whole genome sequencing data.
Source: Nextstrain

In the U.S., the National Center for Biotechnology Information (NCBI) continues to lead efforts making SARS-CoV-2 sequence data sharable and accessible. The NCBI SARS-CoV-2 Resources Page allows researchers to submit assembled or raw SARS-CoV-2 sequence data directly to GenBank or Sequence Read Archive (SRA) Databases. Additionally, the CDC’s AMD program established a new initiative called the SARS-CoV-2 Sequencing for Public Health Emergency Response, Epidemiology and Surveillance (SPHERES). This national consortium provides a platform for public health agencies and other stakeholders to discuss updates in methods and procedures, challenges and other issues related to genomic epidemiology. Certain states in the U.S. also have their own collaborative network of public health laboratories to share sequence data and track SARS-CoV-2 strains that are spreading within their states.

WGS Informs Mitigation Strategies Through Genomic Epidemiology

Genomic epidemiology is defined as the use of genomic sequence data to understand infectious disease transmission and population dynamics. Whole genome sequencing provides the most high-resolution data, and allows for the efficient relatedness analysis that is instrumental to outbreak investigations at all levels, ranging from within a community to intercontinental spread during a pandemic. A study published in the New England Journal of Medicine used SARS-CoV-2 genomic epidemiology to guide the investigation of an outbreak within a skilled nursing facility. Investigators were able to determine relatedness between different SARS-CoV-2 virus sequences from various patients, concluding that pre-symptomatic patients most likely contributed to transmission.

Additionally, genomic epidemiology is one of the cornerstones of large-scale transmission dynamic studies that help answer questions about how a pathogen enters and moves around in a certain geographical area, such as between countries or continents. For example, WGS data in GISAID provides a real-time overview of the distribution of different clades of SARS-CoV-2 among geographical regions, shedding light on the diversity of strains and possible intra- and inter-continental transmissions. A genomic epidemiology study published in Science examined SARS-CoV-2 cases in Northern California from late January to mid-March 2020. WGS and phylogenetic analyses demonstrated the cryptic introduction of at least 7 different SARS-CoV-2 lineages into California, including epidemic WA1 strains associated with Washington state. Lineages associated with outbreak clusters in 2 counties were defined by a single base substitution in the viral genome.

Findings from these large-scale genomic epidemiology studies provide health authorities and other stakeholders valuable insight that could inform policy decisions about mitigation strategies for targeted interventions. A recent study published in Nature Medicine described the use of near to real-time SARS-CoV-2 sequencing analysis, in conjunction with epidemiological data, to assess the dynamics of SARS-CoV-2 transmission in the Netherlands, leading to the decision to implement stricter national measures to limit the spread of COVID-19.

Tweaking COVID-19 Diagnostics and Developing Therapeutics With WGS Data

SARS-CoV-2 sequence data allow scientists to develop new targets for molecular assays and track the trends of mutations that may lead to reduced sensitivity of existing assays. For example, GISAID routinely performs common diagnostic primer checks against high-quality genomes in the collection to monitor trends of mutations that may affect clinical diagnostics testing.

Additionally, the availability of SARS-CoV-2 sequence data allows researchers to identify potential therapeutic targets and provides a basis for epitope mapping and modeling along with the prediction of immune response to the virus, all of which could help guide therapeutics and vaccine development. From the beginning of the pandemic, scientists have been using WGS data to perform epitope mapping and structural modeling. A very recent report in Proceedings of the National Academy of Sciences (PNAS) analyzed 18,514 SARS-CoV-2 sequences sampled since December 2019. The authors noted that the rare mutations across the genomes were likely due to neutral evolution and not adaptive selection. The authors hypothesized that due to the limited genomic diversity seen in SARS-CoV-2, one vaccine may be able to provide universal protection against most, if not all, SARS-CoV-2 strains.

WGS Uncovers Fundamental Virus Biology

In order to understand the pathogenesis and immune response in COVID-19, scientists may analyze SARS-CoV-2 genome sequence data in an effort to explain the mechanism behind their observations. For example, a recent study published in Cell demonstrated that patients infected with SARS-CoV-2 strains containing the D614G mutation in the S gene (encoding the spike protein) had higher upper respiratory tract viral loads. An experiment using pseudotyped virions suggested that the D614G variant possessed a fitness advantage compared to the parent strain, and may be associated with increased infectivity. Another study reported that this mutation resulted in a conformational shift of the S protein toward a state that allowed for more efficient binding with the ACE2 receptor. These findings support the observation that SARS-CoV-2 strains harboring the D614G mutation are becoming more prevalent globally. Characterization of the D614G variant allows scientists to better understand viral entry mechanisms, which could lead to development of therapeutic agents or vaccines that effectively block SARS-CoV-2 infection. Without a large, comprehensive database of SARS-CoV-2 sequences, researchers would not have been able to make this seminal discovery.

SARS-CoV-2 genome sequencing may also help in the diagnosis of COVID-19 when the nature of an infection is unique or complicated. This is well-demonstrated by the use of WGS to determine whether individuals who got sick twice within a short period of time had relapsed or been reinfected with a different strain of SARS-CoV-2. Evidence suggesting that reinfection is possible has been reported in several countries including Hong Kong, Nevada (USA), India and the Netherlands.

In summary, WGS data for SARS-CoV-2 are invaluable to our disease control and prevention efforts. International communities of researchers, laboratories, commercial entities and public health agencies must continue contributing these data in order to maintain extensive and robust sequence databases. As we learn more about the biology and transmission dynamics of the virus, we may be able to come up with sustainable and effective strategies to put a stop to this pandemic.

The statements and opinions expressed in this article are those of the author and do not necessarily reflect those of Los Angeles County Department of Public Health and Public Health Laboratories nor of the American Society for Microbiology.

Author: Peera Hemarajata, M.D., Ph.D., D(ABMM)

Peera Hemarajata, M.D., Ph.D., D(ABMM)
Peera Hemarajata, M.D., Ph.D., D(ABMM) is an Assistant Director at the Los Angeles County Public Health Laboratories.