Molecular clusters of clonal template DNA are generated onboard the HiSeq 2500 and MiSeq instruments. These instruments then take 40 and 27 hours, respectively, to generate 151 base paired–end reads (ie, each individual DNA fragment is sequenced or read from both ends). Bioinformatics analysis then follows, starting with individual sequence reads. Further details are available in the eSupplement. STEC indicates Shiga-toxigenic Escherichia coli.aA sequencing library is a collection of DNA fragments from a sample that are ready for sequencing. These fragments have short adapter molecules with known sequence ligated to each end and a sample-specific bar-code sequence used to identify the source of the fragment after sequencing.bA draft genome is a usable collection of sequences from a genome, which may still contain ambiguities and uncertainties about the order of fragments.
Each point on the scatter plot shows the GC content (x-axis) and total depth of coverage (y-axis, log10-scale) colored by taxon for each environmental gene tag (EGT) in the outbreak metagenome. Numerical values for the EGTs presented in each panel are available in the eSupplement.
The E coli O104:H4 outbreak genome reconstructed from environmental gene tags (EGTs) within the outbreak metagenome is shown. The EGTs have been arranged into a linear pseudochromosome, with a total length of 5.26 million bases. Each point on the chart represents an individual EGT. The total depth of coverage across all samples is shown on the y-axis. Each EGT is color coded to indicate the number of German samples in which it is present. Core regions of the E coli genome, representing sequence shared with nonoutbreak E coli strains, are recognizable by having a greater coverage depth and being present in a greater number of samples. Accessory regions of the genome, corresponding to outbreak-strain-specific genes are generally present at lower coverage than core regions; for example, an EGT of 4.5 kb encoding an aminoglycoside-resistance gene (pictured top, left). The Shiga-toxin-encoding prophage region is clearly visible at around 3.1 megabases with a coverage depth of around 2 times the mean coverage. An EGT of 2 kb from this region encoding the Shiga-toxin type 2 A and B subunits is pictured top, middle. The EGTs belonging to plasmids are shown at the far right of the plot. An EGT of 4.9 kb belonging to a plasmid, pAA, encoding part of the aggregative adhesion fimbrial cluster type 1 is pictured top right.
Loman NJ, Constantinidou C, Christner M, et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of shiga-toxigenic Escherichia coli O104:H4. JAMA. doi:10.1001/jama.2013.3231
eTable 1. Clinical diagnosis and information recovered from STEC-positive samples using conventional microbiology
eTable 2. Information recovered from STEC-positive samples using diagnostic metagenomics
eTable 3. Information recovered from other pathogens using diagnostic metagenomics
eTable 4. Outbreak-specific genes in the draft genome of the outbreak strain obtained by metagenomics
eFigure 1. Genome depth of coverage plots for selected samples
eFigure 2. Genome coverage plots focusing on the region of the E coli O104:H4 STEC 280 genome around the Shiga-toxin encoding phage
Customize your JAMA Network experience by selecting one or more topics from the list below.
Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ. A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104:H4. JAMA. 2013;309(14):1502–1510. doi:10.1001/jama.2013.3231
Importance Identification of the bacterium responsible for an outbreak can aid in disease management. However, traditional culture-based diagnosis can be difficult, particularly if no specific diagnostic test is available for an outbreak strain.
Objective To explore the potential of metagenomics, which is the direct sequencing of DNA extracted from microbiologically complex samples, as an open-ended clinical discovery platform capable of identifying and characterizing bacterial strains from an outbreak without laboratory culture.
Design, Setting, and Patients In a retrospective investigation, 45 samples were selected from fecal specimens obtained from patients with diarrhea during the 2011 outbreak of Shiga-toxigenic Escherichia coli (STEC) O104:H4 in Germany. Samples were subjected to high-throughput sequencing (August-September 2012), followed by a 3-phase analysis (November 2012-February 2013). In phase 1, a de novo assembly approach was developed to obtain a draft genome of the outbreak strain. In phase 2, the depth of coverage of the outbreak strain genome was determined in each sample. In phase 3, sequences from each sample were compared with sequences from known bacteria to identify pathogens other than the outbreak strain.
Main Outcomes and Measures The recovery of genome sequence data for the purposes of identification and characterization of the outbreak strain and other pathogens from fecal samples.
Results During phase 1, a draft genome of the STEC outbreak strain was obtained. During phase 2, the outbreak strain genome was recovered from 10 samples at greater than 10-fold coverage and from 26 samples at greater than 1-fold coverage. Sequences from the Shiga-toxin genes were detected in 27 of 40 STEC-positive samples (67%). In phase 3, sequences from Clostridium difficile, Campylobacter jejuni, Campylobacter concisus, and Salmonella enterica were recovered.
Conclusions and Relevance These results suggest the potential of metagenomics as a culture-independent approach for the identification of bacterial pathogens during an outbreak of diarrheal disease. Challenges include improving diagnostic sensitivity, speeding up and simplifying workflows, and reducing costs.
The outbreak of Shiga-toxigenic Escherichia coli (STEC), which struck Germany in May-June 2011, illustrated the effects of a bacterial epidemic on a wealthy, modern, industrialized society, with more than 3000 cases and more than 50 deaths.1 During an outbreak, rapid and accurate pathogen identification and characterization is essential for the management of individual cases and of an entire outbreak. Traditionally, clinical bacteriology has relied primarily on laboratory isolation of bacteria in pure culture as a prerequisite to identification and characterization of an outbreak strain. Often, however, in vitro culture proves slow, difficult, or even impossible, and recognition of an outbreak strain can be difficult if it does not belong to a known variety or species for which specific laboratory tests and diagnostic criteria already exist. For example, during the German outbreak, infection was caused by an unusual serotype (STEC O104:H4) that had not previously been seen in the context of epidemic disease and could not be detected easily with the standard microbiological methods in use at the start of the outbreak for diagnosing STEC infection.
The term metagenomics is applied to the open-ended sequencing of nucleic acids recovered directly from samples without target-specific amplification or enrichment.2 A list of terms used in this article appear in the Box. Metagenomics has been used in a clinical diagnostic setting to identify the cause of outbreaks of viral infection.3 Drawing on examples from virology and on recent advances in sequencing technologies,4,5 we sought to extend the scope of metagenomics as a clinical discovery platform, exploiting this approach to identify and characterize an outbreak-associated bacterial strain directly from clinical samples without the need for laboratory culture. We explored the potential of this approach on human fecal samples collected during the German STEC outbreak of 2011, performing high-throughput sequencing on 2 Illumina instruments (MiSeq and HiSeq 2500).
Box. Terms for the Study
Coverage: The number of times a portion of the genome is sequenced in a sequencing reaction; often expressed as “depth of coverage” and numerically as 1X, 2X, 3X, etc.
Environmental gene tags: Short sequences of DNA that contain genes in whole or in part that can be used to identify and characterize the organisms from which they originate.
Metagenomics: Open-ended sequencing of nucleic acids recovered directly from samples without culture or target-specific enrichment or amplification; usually applies to the study of microbial communities.
Read: A discrete segment of sequence information generated by a sequencing instrument; read length refers to the number of nucleotides in the segment.
For a complete list of genomic terms, see the Appendix in this issue.
Stool samples were collected at the University Medical Centre Hamburg-Eppendorf during the STEC outbreak of May-July 2011. High-throughput sequencing was performed in August-October 2012. Bioinformatics analyses were performed in November 2012 and February 2013. None of the samples have been analyzed in any previously published study, although clinical and microbiological data from some of the patients was analyzed in 2 previous studies.1,6,7 This study was approved by the ethics panel of the University Medical Centre Hamburg-Eppendorf. Because all samples were made anonymous and no human DNA sequences were released into the public domain, patient consent was waived by the panel.
On arrival in the laboratory, the samples were homogenized and then divided into aliquots. One aliquot from each sample was subjected to routine diagnostic microbiological processing; the others were stored at −20°C until used in metagenomics analyses.
Culture media and conditions used for conventional pathogen detection complied with the recommendations of the American Society for Microbiology,8 with some minor additions. For detection of STEC during the outbreak, stool samples were spread on sorbitol MacConkey agar (Oxoid) and ESBL agar (Biomérieux) and incubated at 36°C for up to 48 hours. A 10-μL loop of bacteria from the lawn of grown colonies was suspended in 500 μL of TE buffer, treated with heat at 95°C for 10 minutes, and centrifuged for 2 minutes at 10 000 g; 3 μL of the supernatant was subjected to stx polymerase chain reaction (PCR).9 Up to 20 E coli colonies from stx -positive cultures were isolated on Columbia blood agar (Oxoid) and individually tested for the presence of stx genes. The stx -positive strains were further characterized by PCR genotyping to identify O104:H4 outbreak isolates.10 After the outbreak, retrospective analyses were performed on frozen stocks from the stool samples, including quantitative culture, an Stx enzyme-linked immunosorbent assay (ELISA), and an stx PCR. The Ridascreen Verotoxin Enzyme Immunoassay (r-Biopharm AG) was performed on supernatants of overnight enrichment cultures in tryptone soy broth according to the manufacturer's instructions. Quantitative PCR was performed on DNA extracted from samples according to a published protocol.9
Campylobacter spp were detected by selective culturing on Karmali agar (Oxoid) at 42°C under microaerophilic conditions for 48 hours. Species identification of Campylobacter isolates was performed by MALDI-TOF mass spectrometry fingerprinting.11Salmonella enterica was detected by overnight enrichment in selenite broth at 36°C followed by selective culturing on xylose-lysine-desoxycholate and Salmonella-Shigella agar (Oxoid) at 36°C for 24 hours. Species identification of S enterica isolates was performed by MALDI-TOF mass spectrometry fingerprinting11 and serological detection of group-specific antigens.
Presence of Clostridium difficile toxins A and B in stool samples was detected with the C diff Quik Chek Complete test (Techlab) according to the manufacturer's instructions. C difficile isolates were recovered by selective culturing on CLO agar (Biomérieux) at 36°C under anaerobic conditions for 48 hours, identified by MALDI-TOF mass spectrometry fingerprinting,11 and also tested for toxin production with the C diff Quik Chek Complete test, according to the manufacturer's instructions.
The 300-mg aliquots of each stool sample were mixed with 1.4 mL of ASL buffer (Qiagen) and transferred to a SK38 stool-grinding tube (Precellys). Samples were homogenized for 2 × 30 at 6000 rpm in a Precellys 24-tissue homogenizer, incubated for 10 minutes at 95°C, and then centrifuged for 2 minutes at 12 000 rpm. The DNA was extracted from a 1.2-mL sample of each supernatant using the QIAamp stool kit (Qiagen) according to the manufacturer's instructions. Samples were quantified with a Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies) and the total amount of DNA for each sample varied between 140 ng and 3 μg.
Calculations suggested that 48 samples could be analyzed to the desired depth of coverage on a single HiSeq 2500 in rapid-run mode at Illumina Inc. These samples were prepared for sequencing at the University of Birmingham. Bar-coded DNA fragment libraries were generated with 0.25 ng input of DNA using a Nextera XT (Illumina) sample preparation kit and the 24 indices from the Nextera XT Index Kit following the manufacturer's instructions. The distribution of fragment sizes within libraries was analysed using a BioAnalyzer (Agilent). Average fragment lengths varied from 430 to 990 base pairs (bp). Two pools were prepared (24 samples in each pool), containing equal volumes of each of the final, single-stranded normalized libraries. Each pool was sequenced on a single MiSeq run (2 × 151 paired-end sequencing). The resulting information on the cluster number and stoichiometric distribution of each sample in the pools was then used to prepare 2 new pools, which together contained DNA from 39 samples in equimolar concentrations (roughly equivalent to the throughput of a single MiSeq run) and DNA from the 5 samples that had yielded pathogens other than STEC in a 10-fold excess concentration. The 2 pools were sequenced using a HiSeq 2500 pilot instrument, with 1 pool per flow cell, and 2 × 151 rapid paired-end sequencing was performed. A density of 800 000 to 1 000 000 clusters per mm2 was targeted to achieve a run throughput of 180 GB in 40 hours.
Ten samples were also sequenced on an Illumina MiSeq instrument at the University of Birmingham. A separate Illumina library was prepared from each of the samples. Extracted genomic DNA was fragmented with a BioRuptor instrument (Diagenode) using a 100-μL volume and 30 cycles. The fragments were end-repaired, ligated to adapters from the Illumina Multiplexing Sample Preparation Oligonucleotide kit, and then size-selected (300-600 bp) using the Beckman SPRIworks Fragment Library System I (Beckman Coulter). The size-selected fragments were amplified (18 cycles using Phusion DNA Polymerase) and DNA was purified with Agencourt AMPure XP beads (Beckman Coulter). The average fragment size of the final libraries was 380 to 480 bp, as assessed with a 2100 BioAnalyzer High Sensitivity DNA Kit (Agilent). Libraries were quantified with a Quant-iT PicoGreen dsDNA kit and diluted to 10 pM. Eight of the libraries were sequenced on individual runs on the Illumina MiSeq instrument (300 cycles, 2 × 150 bp on a paired-end protocol); 1 sample (4096) was subjected to 2 MiSeq runs. The instrument took 27 hours to complete each run.
The bioinformatics workflow included 3 phases (Figure 1): the assembly phase, the alignment phase, and the phylogenetic phase (eSupplement).
In phase 1, the assembly phase, we adopted a de novo assembly approach to identify and characterize the genome of the outbreak-specific strain. We initially screened out human DNA sequences and then assembled all the microbial sequence reads into a collection of environmental gene tags (EGTs) (ie, short sequences of DNA that contain genes in whole or in part that can be used to identify and characterize the organisms from which they originate). We analyzed these reads by GC content and by taxonomic affiliation (Figure 2).
We aligned reads from individual samples from the outbreak to the outbreak-specific metagenome and discarded any EGTs that were not found in at least 20 samples. We then took sequence reads from a collection of fecal samples from healthy individuals available through the MetaHIT project12 and aligned these against the EGTs in the outbreak metagenome. We subtracted any EGTs that matched MetaHIT reads to enrich for the outbreak-specific reads likely to represent the outbreak strain. This set of outbreak-specific EGTs was used to recruit additional EGTs from the reference assembly in an iterative process, using connections determined by paired-end information from the sequence reads to reconstruct a draft genome of the outbreak strain.
In phase 2, the alignment phase, we adopted a mapping-against-reference approach, using a completed reference genome from the 2011 outbreak,13 to determine the depth of coverage of the E coli outbreak strain in each sample.
In phase 3, the phylogenetics phase, we exploited the Metaphlan tool from the Human Microbiome Project14 to identify pathogens other than the outbreak strain from samples taken during the outbreak. This program performs a taxonomic assignment of short sequencing reads, using a database of lineage-specific markers.
Forty-five archived samples were chosen for metagenomic analysis on the basis of the findings from routine microbiology. Forty STEC-positive samples from 34 patients were chosen to represent STEC-positive cases (Table 1 and eTables 1-2) with a range of clinical conditions (diarrhea, hemolytic-uremic syndrome; both early and later after onset) and colony counts retrieved from stools (high numbers, intermediate numbers, extremely low numbers). Four patients were sampled twice and 1 patient was sampled 3 times.
Five samples came from patients who presented with diarrhea, but turned out not to have STEC infections. Two of these samples were positive for C difficile on routine testing; 1 sample was culture-positive for Campylobacter jejuni and 2 were culture-positive for S enterica (Table 2 and eTable 3).
During phase 1, the assembly phase of the analysis (Figure 1), we assembled microbial sequences from the German outbreak samples into more than 1.5 million EGTs. More than half of the bases in this assembly fell into EGTs that were greater than 1.5 kilobases in length. When visualized by taxonomic assignment and GC content, these fell into numerous clusters, widely dispersed in taxonomic and sequence space (Figure 2). Nonetheless, it was clear that EGTs from the German outbreak samples were dominated by the Enterobacteriales, the order that contains E coli.
When we selected EGTs that had to be present in at least 20 German outbreak samples, this led to considerable simplification of taxonomic clustering, but still failed to identify any outbreak-associated strains unambiguously. When we then subtracted EGTs that had matches in samples from healthy individuals, we were left with just 450 outbreak-specific EGTs. When subjected to a taxonomic analysis, nearly two-thirds (65%) were assigned to the Enterobacteriales. Apart from 6 other sequences from diverse taxa, the remaining one-third was not assigned to a specific bacterial taxon.
These outbreak-specific EGTs from the Enterobacteriales were used as seeds in a clustering process that drew on reads in the original set of sequences from the outbreak metagenome to reconstruct the accessory genome of the E coli outbreak strain. We performed a functional annotation of this genome, which confirmed the presence of numerous important strain-specific genes, including the Shiga-toxin genes, an aggressive adherence fimbriae (type 1) locus, the O-antigen determining cluster, and antibiotic-resistance genes, including an extended-spectrum beta-lactamase of type CTX-M-15 (Figure 3 and eTable 4).
During phase 2, the alignment phase, we mapped reads from the German outbreak samples against a reference genome sequence of the STEC outbreak strain, obtaining abundant coverage of the genome of the outbreak strain (>10-fold) from 10 samples and at least modest coverage (>1-fold) in 26 samples (Table 1 and eFigure 1). Sequences from the Shiga-toxin genes (stxAB) were detected in the metagenomes of 27 of the 40 STEC-positive samples (67%), including 6 samples that were negative in the Stx ELISA. In 13 of the STEC-positive samples, we found a difference in copy number between the Stx phage genome and other strain-specific chromosomal loci (Table 2 and eFigure 2). By using homology searches to retrieve informative sequences from each sample, we were also able to confirm the flagellar H antigen serotype (H4) and the MLST sequence type for the outbreak strain (Table 1 and eTable 2).
During phase 3, the phylogenetic phase of the analysis, we recovered genome sequences at greater than 1-fold coverage of C jejuni and C difficile from the metagenomes of samples positive for these pathogens on routine microbiological investigation (Table 2 and eTable 3). We also recovered C difficile -specific reads from a second C difficile -positive sample and Salmonella -specific reads from 1 of the 2 Salmonella -positive samples, but in both cases without complete genomic coverage.
In that second sample that had been reported as positive for C difficile by conventional microbiology, we recovered around 1000-fold more reads that mapped to Campylobacter concisus (a fastidious bacterium that has been described as an emergent pathogen of the human intestinal tract15) than mapped to C difficile (eFigure 2). We also recovered C difficile -specific reads from several of the STEC-positive samples (Table 1). We were able to draw molecular epidemiological inferences from the analysis of sequences from potential pathogens other than STEC (eTable 3).
Using metagenomics, we have been able to recover a draft genome sequence of the German STEC strain without the need for laboratory culture. We found that in most patients with STEC-positive samples, the outbreak strain of E coli accounted for a sizeable proportion of microbial sequences. We were also able to recover C jejuni, C difficile, and S enterica sequences from STEC-negative samples. Furthermore, we have also shown that this approach can detect unknown unknowns. For example, in a sample that was positive for C difficile using conventional approaches, we recovered more than 1000-fold more sequences from another potential pathogen, C concisus than from C difficile. We also found C difficile sequences in several of our STEC-positive samples.
Our discovery of multiple potential pathogens in some samples casts doubt on the reliability of inferring a causal link between the detection of a single potential pathogen and causation of disease, particularly when using a selective diagnostic approach. Such findings also beg the question of how far changes in microbial community composition and synergistic interactions between potential pathogens play a role in the development of pathology.16,17
We also made some unexpected observations on the abundance of the bacteriophage that encodes the Shiga toxin. Among the STEC-positive samples, we found variable coverage of the Shiga-toxin-phage genome relative to sequences from the STEC chromosome (eFigure 2). Potential explanations for this over- and underrepresentation of the phage genome include detection of sequences from bacteriophage particles released during bacterial cell lysis, dynamic gain, and loss of integrated prophages across enteric populations of E coli or multiple prophage insertions or duplications within individual E coli genomes. Further investigation will be needed to clarify the relative contributions of these processes.
The data presented herein do not allow a formal evaluation of metagenomics as a diagnostic tool. However, with a sensitivity of 67% (compared with culture) on STEC-positive samples, it is clear that this technology cannot yet deliver adequate performance for prospective use in a clinical setting. Nonetheless, our findings do illustrate the potential of metagenomics in pathogen discovery and detection and highlight the need for future prospective evaluations against standard approaches. Furthermore, although metagenomics relies on relatively sophisticated analytical pipelines and high-end instrumentation, with reagent costs in the tens of thousands of dollars, such effort and expense may be justified when faced with an outbreak of a pathogen that eludes standard diagnostic procedures. In addition, obtaining a draft genome sequence of an outbreak strain may facilitate the development of simpler and cheaper diagnostic tests of the required sensitivity and specificity, as was shown during the STEC outbreak.18
In conclusion, these results illustrate the potential of metagenomics as an open-ended, culture-independent approach for the identification and characterization of bacterial pathogens during an outbreak of diarrheal disease. Challenges include speeding up and simplifying workflows, reducing costs, and improving diagnostic sensitivity, all of which are likely to depend in turn on improvements in sequencing technologies.4
Corresponding Author: Mark J. Pallen, MA, MD, PhD, Division of Microbiology and Infection, Warwick Medical School, University of Warwick, Coventry, United Kingdom, CV4 7AL (email@example.com).
Author Contributions: Dr Pallen had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Drs Loman, Constantinidou, and Christner contributed equally to this work. Drs Aepfelbacher and Pallen contributed equally to this work.
Study concept and design: Loman, Rohde, Aepfelbacher, Pallen.
Acquisition of data: Loman, Constantinidou, Christner, Rohde, Chan, Quick, Weir, Smith, Betley, Aepfelbacher.
Analysis and interpretation of data: Loman, Rohde, Chan, Quince, Pallen.
Drafting of the manuscript: Loman, Constantinidou, Chan, Quick, Smith, Aepfelbacher, Pallen.
Critical revision of the manuscript for important intellectual content: Loman, Christner, Rohde, Weir, Quince, Betley, Aepfelbacher.
Statistical analysis: Loman, Quince.
Obtained funding: Loman, Pallen.
Administrative, technical, or material support: Loman, Constantinidou, Christner, Rohde, Chan, Quick, Weir, Smith, Betley, Aepfelbacher.
Study supervision: Rohde, Aepfelbacher, Pallen.
Conflict of Interest Disclosures: The authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Dr Rohde reported receiving speakers fees from Novartis and Gilead; and receiving travel reimbursement from Novartis and Merck Sharp Dohme. Ms Weir and Drs Smith and Betley are employees of and own stock in Illumina Inc, which manufactures the MiSeq and HiSeq 2500 instruments. No other authors reported disclosures.
Funding/Support: This work was supported in Germany by the Medical Faculty of the University Medical Center Hamburg–Eppendorf. Work in Birmingham, England, was supported by a grant from the UK's Biotechnology and Biological Sciences Research Council supporting the xBASE project, by a grant from the UK's National Institute for Health Research awarded to the Surgical Reconstruction and Microbiology Research Centre (MRC), and by an MRC Special Training Fellowship in Biomedical Informatics to Dr Loman. The HiSeq2500 sequencing was supported by Illumina Inc.
Role of the Sponsor: Neither the UK's Biotechnology and Biological Sciences Research Council, the MRC, nor the UK's National Institute for Health Research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript.
Additional Contributions: We are indebted to the laboratory staff in the clinical microbiology laboratory at the University Medical Centre Hamburg-Eppendorf who performed conventional microbiological analyses as part of routine management of patients, to Richard Brown, BSc, and Gemma Kay, PhD, for technical support in the laboratory at University of Birmingham, and to Holly Duckworth, BSc, and Peter Saffrey, PhD, for technical support in the sequencing laboratory at Illumina. The persons listed were not compensated for their contributions beyond their normal salaries.