Analysis of Genomic Characteristics and Transmission Routes of Patients With Confirmed SARS-CoV-2 in Southern California During the Early Stage of the US COVID-19 Pandemic

Key Points Question During the early phase of the outbreak, what were the transmission routes and genomic characteristics of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spread in Los Angeles, California? Findings This case series of 192 patients found that 82% of SARS-CoV-2 isolates from Los Angeles shared closest similarity to those originating in Europe vs those from Asia (15%). Using the variation signature of the viral genomes, 2 main clusters were identified, with the top variants sharing genomic features from European SARS-CoV-2 isolates, and several subclusters of SARS-CoV-2 outbreaks represented trackable community spread in Los Angeles. Meaning These findings suggest that SARS-CoV-2 genomes in Los Angeles were predominantly related to the isolates originating from Europe, which are similar to viral strain distributions in New York, New York; a smaller subgroup of SARS-CoV-2 genomes shared similarities to those from originating from Asia, indicating multiple sources of viral introduction within the Los Angeles community.


Introduction
The emergence of the coronavirus disease 2019  global pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 1 presents the scientific community with an urgent need to understand all aspects of this novel virus. The SARS-CoV-2 genome sequences deposited in public databases 2,3 are pivotal resources in understanding its virulence and for guiding approaches to therapeutics and vaccines. 4 Assessing core genomic features across all global populations can be used for comparative analysis to identify features unique to SARS-CoV-2 as well as assist in epidemiologic and public health endeavors. 2,[5][6][7][8][9][10][11][12][13][14][15] SARS-CoV-2 is a coronavirus with a 29 903-base pair (bp) single-stranded RNA genome 16 containing 14 open reading frames and 27 estimated proteins. 17 Viral genome annotation can assess the conserved wild-type sequence across all patients with COVID- 19. Genomic epidemiology has emerged as a useful tool to track sources of transmission and SARS-CoV-2 evolution within communities and throughout the world. 9,10,13,18 The consortium Global Initiative on Sharing All Influenza Data (GISAID) 2,3 classifies the global distribution of SARS-CoV-2 into 2 main clades differing in their origins: (1) clade 19A, originating from China, and (2) clade 20A, originating form Europe.
Clade 20B was seeded by a strain from China, but once in Europe, its variation profile became the predominant strain of the European pandemic. 19 The first patient with confirmed COVID-19 in the US presented on January 19, 2020, in Washington state. 20 While Seattle recorded the first observed transmission of SARS-CoV-2 from China, the largest SARS-CoV-2 outbreak in the US to date was in New York, New York. 9,12 New York isolates were seeded on multiple introductions from Europe. 9 A study by Deng et al 13  Los Angeles, California, is the largest city on the US West Coast and had its first patient with confirmed COVID-19 in late January 2020. 21 Accordingly, it was one of the first major US cities to take precautionary measures and restrict the population to their homes as fatalities increased in early March 2020. 22 As of August 10, 2020, more than 200 000 confirmed SARS-COV-2-positive cases and 4996 COVID-19-related 3 deaths have been recorded in Los Angeles county. Cedars-Sinai Medical Center (CSMC), located in Los Angeles, serves more than 1 million people and is the largest health service center west of the Mississippi River. A reverse transcription-polymerase chain reaction (RT-PCR) diagnostic test for SARS-CoV-2 infection was adopted March 21, 2020, allowing our clinical laboratory to rapidly screen and identify patients with SARS-CoV-2 infection. After transmission from China, our timeline for SARS-CoV-2 infection follows other reported introductions into different global populations. 5,11,14,15,[23][24][25][26] At the time of our study, the only Los Angeles SARS-CoV-2 genome deposited in GISAID was not linked to a particular model of introduction. 3 Based on these cumulative findings, we hypothesize the local Los Angeles community was likely exposed to a US West Coast SARS-CoV-2 strain, which was directly transmitted from China. In an effort to further understand this evolving virus, we sought to perform next-generation sequencing (NGS) analysis on patients with confirmed SARS-CoV-2 infection. We conducted phylogenetic analyses on this unique West Coast population to identify local community spread within the greater Los Angeles area. A broad geographic distribution comparison of SARS-CoV-2 isolates in Southern California from early in the COVID-19 US outbreak with isolates in New York, Washington state, and China was conducted to ascertain transmission pathways of SARS-CoV-2 dissemination into Los Angeles. In this case series, we report potential sources of SARS-CoV-2 introduction into the Los Angeles community.

Sample Collection
Appropriate regulatory review was completed by the CSMC Office of Research Compliance and Quality Improvement. A waiver of informed consent was granted per institutional policy because the study did not require interaction or intervention with participants, posed no more than minimal risk to privacy of individuals, did not impact patients' clinical care, could not be practically conducted without access to protected health information, and a requirement to obtain consent would render the research impracticable, as some patients were no longer receiving care at time of the study.

Sample Preparation
Total nucleic acid was extracted using the QIAamp Viral RNA Mini Kit on the QIAcube Connect   27 All samples with greater than 50% of the SARS-CoV-2 genome covered with more than 10× depth were included in the study, which totaled 133 isolates.

Targeted NGS and Phylogenetic Analyses
These genomes passed quality control assessment by Nextclade 28 and were retained for downstream phylogenetic analysis. Duplicated reads were labeled with Picard, 29 and BCFtools 30 was used to generate consensus sequences. Data used in this study have been deposited to GISAID (eTable 2 in the Supplement). The mapping ratio was calculated by Samtools, 31

JAMA Network Open | Infectious Diseases
Genomic Characteristics and Transmission Routes of SARS-CoV-2 in Southern California within Nextstrain global clades. 3 As of September 2020, global SARS-CoV-2 clades were designated into clades 19A and 19B of Asian origin and clades 20A, 20B, and 20C of European origin. P values were 2-sided, and statistical significance was set at .05.

Sequenced SARS-CoV-2 Specimens From CSMC
We sequenced 192 specimens with RT-PCR results positive for SARS-CoV-2 using the Illumina targeted respiratory virus panel. These specimens were collected among 192 patients (median Overall, low mapping ratios with less than 50% genome coverage correlated with samples with increased Ct value (>30 cycles) in the RT-PCR diagnostic test.

Analyses of Coinfection of Other Respiratory Pathogens and SARS-CoV-2
Sequencing reads from across the sample cohort were mapped to all 41 respiratory viral pathogens (eTable 1 in the Supplement). Despite finding fragmental reads from other viruses, no samples had non-SARS-CoV-2 viral genomes with mapped ratios greater than 5% of total mapped reads in samples with total mapping. Accordingly, there was no evidence of coinfection of other respiratory viral pathogens with SARS-CoV-2 in our sample population.

Variant Landscape
Whole-genome comparison of the CSMC samples revealed more than 99.8% identity with the SARS-CoV-2 reference genome. Variation analyses of these isolates revealed a total of 518 variation sites detected across the length of the SARS-CoV-2 genome (Figure 1). A total of 436 variants (84.3%) were private variations and 5 variants (0.1%) were found in more than 50% of all samples (Table). In total, 82 sites had variant in more than 2 isolates containing a mean (SD) of 5.  Orange dots indicate the top 20 altered sites; blue dots, the rest of the variations detected.

JAMA Network Open | Infectious Diseases
Genomic Characteristics and Transmission Routes of SARS-CoV-2 in Southern California From our most-observed variation sites, 4 variants have been previously reported, including in the 5′-UTR(C241T), along with C3037T, C14408T, and A23403G. 37 We found 125 samples (65.1%) with all 4 variants present in the genome. While C3037T causes a synonymous variation in nsp3(F105F), C14408T and A23403G resulted in amino acid changes in RNA primase (ie, nsp12, P323L). The China and Northern California variation 10,13 in the S protein (D614G) was observed in this Los Angeles cohort. Variations at G25563T(ORF3a) and C1059T(nsp2) have been reported to be coexpressed. 37 The Washington state and China variants, 38 C8782T(nsp4) and T28144C(ORF8), were also frequently altered in the Los Angeles isolates.

Phylogenetic Analysis
We performed phylogenetic analysis of 133 samples with more than 50% of the genome covered and more than 10× genome depth to identify which SARS-CoV-2 isolates were most similar (Figure 2).
From the top 6 variation sites along the phylogenetic tree (Figure 3), we observed a minimum of 2 groups containing distinct variant signatures. Within these groups, the bottom subclade of the tree contained all 6 variants. A subset of 4 variants that tracked together, as previously described, 37 were in 2 main clusters ( Figure 3A, C, D, and E). While these variants tightly segregated into 2 main clusters of the tree, they did not track with sample collection date (eFigure 2 in the Supplement). The genomic diversity in our population was present from the earliest samples collected and remained throughout the study time frame.

Phylogenetic Tree Traces of Community Transmission in the Early Stage of the COVID-19 Pandemic
A phylogenetic tree of all Los Angeles isolates was constructed to track SARS-CoV-2 genome differences. A cluster was defined as a group of patients with SARS-CoV-2 strains that originated from the same branching point in the tree. From our local phylogenetic tree analysis, 13 patients, representing more than 10% of our sample population, were identified in 1 cluster (Figure 2). Analysis of the patients' demographic data revealed that they all lived in the same or adjacent postal codes,

JAMA Network Open | Infectious Diseases
Genomic Characteristics and Transmission Routes of SARS-CoV-2 in Southern California within a 3.81 km 2 radius of each other, and were all members of the same religious denomination. The viral genome exclusively shared between these patients was variant C18877T within the nonstructural protein, nsp14 (eFigure 3 in the Supplement). A community transmission event with known close contact was observed within a tightly associated cluster containing 5 patients, in which all 5 viral genomes shared 3 variants: T13575C, T16506C, and C25466T. Additionally, we observed a cluster of 10 isolates in which 5 patients were known residents of the same skilled nursing facility (SNF) and another patient was a resident of a nearby (ie, within 1 block) SNF. Three additional isolates from this cluster belonged to health care workers with likely contact with patients from the same SNF. The last patient in this cluster was related to one of the patients in the SNF. We did not observe other clear connections within samples outside of these 3 clusters.

Joint Phylogenetic Analysis
To  with genomes from New York, Washington state, and China found that they shared similarities to all subclades derived from these regional locations (eFigure 4 in the Supplement).

Discussion
To our knowledge, this case series is the first comprehensive study of a COVID-19 sample population from Los Angeles, one of the major outbreak centers in the US. A caveat to our sample collection is that emergency departments are less frequented by younger patients and biased to patients 18 years and older. Thus, the mean age of CSMC patients was approximately 60 years, which is consistent with older adults being more susceptible to COVID-19. 5,21,24 Patients with higher viral loads detected by RT-PCR also correlated with a higher percentage of SARS-CoV-2 genome coverage by sequencing.
From a technical perspective, 48 patients with lower sequencing coverage (less than 50% of the total cohort) were diagnostically confirmed to have SARS-CoV-2 infection by RT-PCR testing at more than 30 cycles. 39 Thus, when using NGS approaches for diagnostic purposes, a potential caveat is that genome sequencing favors patients with higher viral titers and may not capture those who have low viral copy numbers.
Analysis of 40 other respiratory viruses did not reveal coinfection with SARS-CoV-2 in our cohort, which is consistent with other studies, indicating that rates of coinfection are low in patients with SARS-CoV-2 infection. 40 However, we could not rule out the possibility of coinfection or superinfection for viruses with low copy numbers but the high viral load of SARS-CoV-2 made it The local phylogenetic tree found 2 large clusters, which were mainly defined by 6 highfrequency variations. Phylogenetic analysis of these samples by collection date reveals that the main variants that defined these 2 large clusters were observed throughout March and April; therefore, they were present in the community prior to our collection date, Previous studies highlight religious communities being at particularly acute risk in a pandemic owing to large communal events, such as services, weddings, and funerals. 19,41 Moving forward, community leaders should be aware of the unique risks posed to their congregations and plan accordingly. The remaining patients lived across many postal codes, providing further evidence of community transmission across the larger metropolitan area.
A third cluster showed widespread transmission within a single SNF. Such facilities have been a hotbed for viral spread worldwide, and it is not surprising to observe this type of clustering.

Limitations
This study has some limitations, including that SARS-CoV-2 genomes were all from patients who were hospitalized for COVID-19 and may be a biased representation of more severe cases. These samples were obtained early during the US pandemic, when testing was limited, and a high proportion of individuals with asymptomatic infection or mild symptoms are absent in this and similar studies. 46,50 These missing SARS-CoV-2 infections will affect the collective assessment of transmission both in the US and globally. When attempting to infer causality, Villabona-Arenas et al 51 provided examples of pitfalls that can occur by performing epidemiological analysis on viral genomes alone, especially when the virus is novel. The possibility remains that multiple seed events in Los Angeles, Europe, and New York occurred simultaneously, thus confounding the ability to draw directionality from the data.
Considering the timing of the COVID-19 spread and the known transmission patterns from Europe to New York, we consider this unlikely. What may be more plausible, and should be considered, is that travelers from Europe seeded New York and Los Angeles simultaneously. Lu et al 18 also highlight how phylogenetic analysis can be misleading, as clusters thought to represent community spread can include multiple introductions from genomically undersampled locations. Their study was biased by the fact that data were collected primarily during the spring festival period surrounding the Chinese New Year, the period of largest annual human migration event in the world. 52 Expectedly, a significantly larger portion of cases than normal were imported from outside regions. There was no such event in Los Angeles at the time of the early outbreak, and the data in this study were generated

JAMA Network Open | Infectious Diseases
Genomic Characteristics and Transmission Routes of SARS-CoV-2 in Southern California several weeks after state-ordered limitations on travel and gatherings had been enacted. Although we have a limited sample number (133 patients), the integration of CSMC SARS-CoV-2 genomes into Washington state, New York City and China (eFigure 4 in the Supplement) data sets, provided helpful insight into determining the introduction of SARS-CoV-2 into the Los Angeles community.

Conclusions
In this case series, consistent with other studies, the combination of the 4 variants (ie, C241T, C3037T, C14408T, and A23403G) coevolving together has been seen in other tracked populations in European isolates. 9,37 From our variant analysis, 2 of our highly altered sites, G25563T(ORF3a) and C1059T(nsp2), have been reported exclusively in US isolated sequences collected since March 2020, 7 a timeline that corresponds to this study's sample collection date. These variants were found to be closely associated within a cluster containing mainly SARS-CoV-2 genomes from New York, suggesting that these genomes were introduced from a strain that emerged from the US East Coast population. From the variants found in our samples, 4 variants, 5′-UTR (241C>T), 3037C>T, 14408C>T, and 23403A>G, agree with other studies that found that these variations coevolved. 37 Such a high proportion of our patients having all 4 variation indicates the seeding of our population by a strain originating in Europe. This finding is further validated in our local phylogenetic tree, which separates into 2 main clusters, our global tree in which our population closely resembles SARS-CoV-2 genomes from New York, 9 followed by a smaller percentage from Washington state, together identifying possible routes for the dissemination of SARS-CoV-2 into the Southern California populace. Given that Seattle, Washington, was the first documented US appearance of SARS-CoV-2, the introduction of the virus from Washington state 13,20 is consistent with our phylogenetic tree and the time frame of our data sampling, concordant with our hypothesis. However, despite our earlier estimates, an even larger portion of our sample population had a significant resemblance to genomes from New York, the epicenter of the SARS-CoV-2 outbreak in the US. 9,12,44 The appearance of the majority of our samples within different subclades of New York isolates suggests that SARS-CoV-2 likely spread from multiple introductions from New York. Furthermore, the CSMC population interspersed with Washington state and China isolates suggests multiple dissemination routes from Asia and the US Northern West Coast to Southern California, appearing as a major cluster in our local population. Although we restricted our analyses to these 3 geographical origins, we found high genomic diversity among the CSMC SARS-CoV-2 isolates. The large impact of COVID-19 on the Los Angeles community likely originated from independent disseminations of the virus from multiple routes, with some geographical strains having greater prevalence than others.