Multiomics Characterization of Preterm Birth in Low- and Middle-Income Countries

This diagnostic/prognostic study describes the use of cell-free transcriptomics, urine metabolomics, and plasma proteomics for identifying the biological measurements associated with preterm birth.


Study Design
Inclusion criteria for samples, which consisted of plasma and urine were: collection ≤ 20 weeks of GA, and determination of GA by ultrasound ( < 37 weeks' GA for PTBs and > 37 weeks' GA for term births). Medically-indicated preterm deliveries were excluded.
Our study population consisted of 81 pregnant women selected from the following GAPPS-and AMANHI-supported birth cohorts: (1) the GAPPS Preterm and Stillbirth Study in Matlab, Bangladesh (PreSSMat Study, icddr,b, Matlab, Bangladesh), a prospective cohort study designed to assess biological, environmental, and social determinants of adverse pregnancy outcomes; (2) the GAPPS Preventing Preterm Birth Initiative in Zambia (ZAPPS Study, UNC-CH/UTH, Lusaka, Zambia) [8], a prospective cohort study and biorepository designed to characterize the factors associated with PTB and outcomes in Zambia; and (3) the Alliance for Maternal and Neonatal Health Improvement (AMANHI) biorepository study in Bangladesh Sylhet, Pakistan Karachi and Pemba Tanzania. All pregnant women provided written informed consent for participation in the original study, and for future utilization of specimens. For the current studies ethical exemptions were sought from the respective incountry IRBs and regulated under necessary material transfer and data transfer agreements.
At all AMANHI and GAPPS cohorts, trained phlebotomists collected blood samples for centrifugation and aliquoting of serum, plasma, and buffy coat for storage and future analyses. In addition, maternal urine was collected in parallel. With a view to facilitate the future of omics study, special care was taken to ensure sample storage at −80 • C in each biobank. Unique study identification numbers were assigned to all samples, which were linked to each participant. Outcome assessment was done by birth surveillance through phone calls and household visits [12].
Collection and processing of all sample types was performed following standard operating procedures at all study cohorts [7]. Blood collected in EDTA tubes was cold centrifuged at 3, 000 rpm for 10 mins within 4 hrs. Plasma was separated and stored at −80 • C until shipment. 1.0 mL of plasma for transcriptome, 0.5 mL of plasma for proteome, and two aliquots of 2 mL each of urine for metabolome analysis were shipped from each biorepository. Samples were shipped on dry ice as a single batch and under continuous temperature monitoring.
©2020 Jehan F et al. JAMA Network Open eMethods 2 Biological Modalities 2.1 Transcriptomics cfRNA was extracted from 1mL of plasma using a Plasma/Serum Circulating and Exosomal RNA Purification mini kit (Norgen, cat510000) following manufacturer's instructions. The residue of DNA was digested using Baseline-ZERO DNase (Lucigen, DB0715K) and then cleaned by RNA Clean and Concentrator-5 kit (Zymo, R1013). RNA was eluted to 12 µL in the elution buffer.
Eight mL of the eluted RNA was used for sequencing library preparation using SMARTer Stranded Total RNAseq kit v2 -Pico Input Mammalian (Clontech, cat634413) according to the manufacturer's instructions. Short read sequencing was performed using the Illumina NovaSeq S2 2-Lanes (2 × 75 bp) platform to the depth of more than 10 million reads per sample. The sequencing reads were mapped to human reference genome (hg38) using STAR aligner [11]. Duplicates were removed by PICARD [14] and then gene counts were quantified using unique reads with htseq-count [3]. Prior results demonstrated a strong correlation between this assay and RT-qPCR measurements [19].

Metabolomics
Global metabolic profiling of urine samples was performed using a broad spectrum liquid chromatography coupled with mass spectrometry platform (LC-MS). Urine aliquots were prepared and analyzed as previously described [9]. Briefly, urine samples were thawed on ice and centrifuged at 17, 000rcf for 10 minutes. The supernatants were diluted by a factor of four with 75% acetonitrile and 100% water including 13 internal standards (IS) for HILIC-and RPLC-MS experiments, respectively. Samples for HILIC-MS experiments were further centrifuged at 21, 000g for 10 min at 4 • C to precipitate proteins.
Metabolic extracts were analyzed four times using HILIC and RPLC separation in both positive and negative ionization modes. Data were acquired on a Thermo Q Exactive HF mass spectrometer that was equipped with a HESI-II probe and operated in full MS scan mode. MS/MS data were acquired at different fragmentation energies (NCE 25, 35 and 50) on pool samples (QC) consisting of an equimolar mixture of all samples in the study. HILIC experiments were performed using a ZIC-HILIC column 2.1 x 100mm, 3.5µm, 200Å (Merck Millipore) and mobile phase solvents consisting of 10mM ammonium acetate in 50/50 acetonitrile/water (A) and 10mM ammonium acetate in 95/5 acetonitrile/water (B). RPLC experiments were performed using a Hypersil GOLD column 2.1 × 150mm, 1.9µm, 175Å (Thermo Scientific) and mobile phase solvents consisting of 0.06% acetic acid in water (A) and 0.06% acetic acid in methanol (B).
Data quality was ensured by (1) sample randomization for metabolite extraction and data acquisition, (2) multiple injections of a pool sample to equilibrate the LC-MS system prior to run the sequence (12 and 6 injections for HILIC and RPLC methods, respectively), (3) spike-in labeled IS during sample preparation to control for extraction efficiency and evaluate LC-MS performance, (4) checking mass accuracy, retention time and peak shape of IS in every samples and (5) injection of a pool sample every 10 injections to control for signal deviation with time.
Data from each mode were independently analyzed using Progenesis QI software (v2.3) (Nonlinear Dynamics). Metabolic features from blanks and that did not show sufficient linearity upon dilution in QC samples (r < 0.6) were discarded. Only metabolic features present in > 2/3 of the samples were kept for further analysis. Inter-and intra-batch variation was corrected using the LOESS (locally estimated scatterplot smoothing Local Regression) normalization method on pool samples injected repetitively along the batches (span = 0.75). Missing values were imputed by drawing from a random distribution of low values in the corresponding sample. Data from each mode were merged to obtain a dataset containing 6,630 putative metabolites. Dilution effect was corrected by using the probabilistic quotient normalization (PQN) [10]. Metabolic features were annotated by matching the experimental accurate mass (±5 ppm) to a local database containing 60, 000+ metabolites. This database was created by compiling metabolites from various public databases including HMDB, FoodB and DrugBank [25].

Proteomics
The proteomic analysis was performed by O-link Proteomics (Watertown, MA) with a highly multiplex proteomic platform using proximity extension technology [6]. For this study, eleven panels were used, each measuring 92 different proteins simultaneously in 1µL of plasma. Each protein was detected by a matched pair of antibodies that were coupled to unique and partially complementary oligonucleotides. When in close proximity, a new and protein-specific DNA reporter sequence was formed by hybridization and extension, which was then amplified and quantified by real-time PCR [20].
Relative amounts of protein were quantified as normalized protein expression (NPX). NPX was derived by subtracting the Ct value of the extension control reaction from the raw Ct-value (threshold cycle) to adjust for technical variations (dCT), then subtracting differences in Ct-values between plates (inter-plate control) from the dCt-value (ddCt-value) to adjust for inter-assay variability, and then subtracting the ddCt-value from a correction factor to adjust for background noise and invert the scale. An increase of 1 NPX corresponded to a doubling of the relative protein concentration (log 2 scale).
Quality control (QC) was performed at the assay and sample level [1]. At the assay level (internal controls) each sample was spiked with two non-human antigens (incubation control), an antibody coupled with a unique pair of DNA tags (extension control), and a double-stranded DNA amplicon (detection control) to monitor the three major procedural steps (immunoreaction, extension, and amplification/detection). At the sample level three controls were added to each plate. A synthetic sample containing 92 antibodies with one pair of unique DNA tags in fixed proximity was added in triplicate to monitor and compensate for inter-run and inter-plate variations (inter-plate control). A negative control was added in triplicate to monitor for background noise. Finally, a pooled plasma sample was added in duplicate to monitor for intra-and interassay variability and determine coefficient of variations. A plate passes QC if the standard deviation of internal controls was less than 0.2 NPX. Individual samples pass QC if values of internal controls deviated by less than 0.3 NPX from the plate median. In this study, the plate passed QC as did 97.7% of the samples. Of all assayed proteins 88.8% were detected in more than 75% of samples. The median intra-assay coefficient of variation was 7%. Prior studies have demonstrated strong associations between this assay and ELISA analysis (e.g., [5,24,16,15].)

Quality Control
Additional control was performed by visualization of all subjects using unsupervised analysis, supplemented by objective and quantitative analysis using a supervised algorithm as described in eFigure 1. While cohort-specific signatures and signatures associated with storage time were observed, overall data quality was consistent across all modalities. (A) To investigate cohort-specific data signatures, principal component analysis (PCA) was used to create a two-dimensional representation of the entire cohort for each biolog-ical modality as well as all modalities combined. This analysis demonstrated that the largest source of variation in the data was not driven by fundamen-tal differences between the cohorts, underscoring the decreased likelihood that there was bias induced by different sampling or processing protocols. Super-vised linear discriminant analysis (LDA) confirmed the existence of more subtle cohort-specific signatures that were not significant enough to be visualized in an unsupervised PCA. (C) The presence of cohortspecific signatures was con-firmed using random forest analysis (subject to crossvalidation for prediction of the sampling site of previously unseen patients exclusively based on each bi-ological modality. Overall, this confirmed the presence of consistent yet limited cohort-specific variations in the datasets. (D) The impact of sample storage time was quantified with random forest analysis subjected to cross-validation in which the number of days between sample collection and laboratory anal-yses was used as a continuous prediction target. The random forest results on previously unseen patients were statistically significant only in the case of the urine metabolomics dataset, indicating the potential for sample degradation over time. However, this did not confound the design of this study as gestational age (GA) at delivery did not correlate with storage time ( p > 0.41, r = −0.092). Previous bioinformatics work detailing multiomics data integration fall within two major categories: multi-staged, in which measurements of the same biological factors (e.g., genes) are available for alignment of the feature space [22]; and meta-dimensional, in which direct connections between the measured features are not available a priori [21], [13]. Given the diversity of the biological modalities that must be integrated in this study and the lack of preexisting biological connections between all measured factors, a meta-dimensional approach was designed in which each modality is first analyzed independently, and then combined with a higher level integration layer to increase predictive power. Multivariate predictive modeling was performed using a random forest algorithm as implemented in [18] using default parametrization. A comparison against other machine learning algorithms using a similar cross-validation strategy is presented in eFigure 3A. To ensure the generalizability of the models, a Leave-One-Out Cross-Validation (LOOCV) strategy was used to test the predictions on previously unseen patients. In this setting, a model was trained on all available patients except for one. The model was then tested on the blinded subjects. This process was repeated for all subjects until a blinded prediction was calculated for all patients. Final results were reported using these blinded predictions. Cross-validation folds were synchronized be-tween the models built on individual omics datasets and the integrated model to leave out the same data points at all levels of the analysis. Importantly, this guaranteed that not only the aggregate model, but also its input features (i.e. the final predictions from each dataset) were blinded to the same subject during cross-validation. For the prediction of PTB, to account for cohort-specific signatures, we implemented an additional variable-filtering step to reduce the overall search space. Specifically, for a data matrix X O of J features from O omics platforms corresponding to cohort I, we train a RF model Ξ I = RF Train (X O −iJ I ) where X O −iJ I = {i = i∧X i J I ∈ X IJ I } denotes the removal of patient i from the analysis for cross-validation and J I is the set of features that are selected by statistical testing between term and preterm cases on X −iJ I . After training, the blinded prediction p O Ii = RF Predict (Ξ I , X O IiJ I ) can be calculated for each omics dataset and combined into the final prediction vectorŷ

Combined
where ω o k I is the classification performance of omics dataset k, on cohort I, which was calculated using an internal nested cross-validation layer. A comparison against other integration strategies (simple merging of all datasets and stacked generalization) using a similar cross-validation strategy is presented in eFigure 3B.
To calculate a lower bound for the analysis pipeline (to confirm that the strong results are not due to a coding problem that results in information leakage in the cross-validation scheme), a negative example using random data was used. The poor performance of the model on random data (eFigure 4) confirms that that the strong performance in the real dataset is not due to model overfitting. To validate the computational pipeline, data from all patients were randomly assigned to either a case or a control group. The three biological modalities were used to predict these random labels. The pipeline used in Figure 2 was not able to predict the randomly created labels subject to cross-validation. This indicates that the strong performance of the algorithm was not due to model overfitting.

Mixed-Effect Modeling
To account for cohort-specific variations, we employed a linear mixed effect model with cohort encoded as a random effect. Particularly Y is = β 0 + S 0s + β 1 X i + e is , where e is ∼ N (0, σ 2 ) and Y is is a binary vector indicating PTB for patient i in cohort s, with a fixed-term intercept β 0 , fixed-term slope β 1 , fixed-term predictor variable X i , random-effect intercept S 0s for cohort s, and observationlevel error e is for patient i in cohort s with variance σ 2 .

Multiomics Visualization
The features from the three omics data were visualized using a dimension reduction strategy designed to balance the size and modularity of each dataset. The top 10 PCs of each omics dataset were used as a 30-dimensional latent space. The correlation matrix between each measurement and this latent space was visualized using the tSNE dimension reduction algorithm [18]. This ensures equal contributions to the visualization layout by all datasets.

Clinical Covariates
Field workers were trained to collect detailed phenotypic and demographic data from the women and their families through scheduled household visits during pregnancy and post-partum. Clinical covariates were manually harmonized across all five cohorts. Of all variables collected, only the weight of the baby and gestational age at delivery were significantly correlated with the final outcome of the model predicting PTB (Supplemental Table S1 and Supplemental Figure  S5). This confirmed that the model was not confounded by the other measured clinical covariates.

Ex Vivo Whole-blood Immuno-assay
The presence of inflammatory mediators among the features most correlated with PTB is consistent with previous studies suggesting that dysfunctional immune adaptations during pregnancy is central to the pathogenesis of PTB. However, the predictive model also highlighted a set of proteomic features with no known inflammatory properties, that were highly correlated with features from the inflammatory module. These proteins included, protein-arginine deiminase type II (PADI2), a peptidylarginine deiminase responsible for protein citrullination and implicated in parturition and sensing infections [17,4]; transferrin receptor (TfR) which is implicated in iron transport; angiopoietin-like 4 (ANGPTL4) which regulates glucose homeostasis and lipid metabolism (48); and RARRES2, an adipokine increased in metabolic syndrome and gestational diabetes [23,26]. To determine whether observed correlations between these proteins and the inflammatory module reflected biologically-relevant inflammatory properties, we examined the capacity of each of these factors to stimulate human peripheral blood leukocytes using an ex-vivo mass cytometry assay. Mass cytometry, an advanced flow cytometry technique, is capable of measuring up to 50 markers in hundreds of thousands of single cells, resulting in detailed functional profiling of all major immune cell types. Using this assay, the activity of major intracellular signaling responses previously implicated in maternal immune adaptations during pregnancy (including pSTAT1, pSTAT3, pSTAT5, pSTAT6, pP38, pMK2, pERK, prpS6, pNFkB, and total IkB) were assessed at baseline and after a 15 minutes stimulation in all major innate and adaptive immune cell-types. Whole blood was collected from healthy, nonpregnant volunteers and stimulated for 15 minutes at 37 • C with lipopolysaccharide (1 ug/mL, InvivoGen) and interferon alpha (100 ng/mL, PBL Assay Science),Transferrin Receptor (1 ug/mL, R&D Systems), ANGPTL4 (1 ug/mL, R&D Systems), PADI2 (5 ug/mL, Abnova), RARRES2 (1 ug/mL, R&D Systems), CCL3(1 ug/mL, Invitrogen), G-CSF (100 ng/mL, R&D Systems), and IL-6 (100 ng/mL, R&D Systems) or left unstimulated.
Samples were processed using a standardized protocol for fixation (Smart Tube Inc), barcoding, and staining with antibodies for mass cytometry by time of flight analysis (CyTOF), as described previously [2].

Data Repositories and Source Code
The measured features from all three omics datasets, the algorithms and source codes for reproduction of the results, as well as an interactive website capable of visualizing the entire dataset, the feature evaluation scores for PTB and GA at sampling, and pathway enrichment analysis is available at: https://nalab.stanford.edu/multiomicsmulticohortpreterm/