Application of a Neural Network Whole Transcriptome–Based Pan-Cancer Method for Diagnosis of Primary and Metastatic Cancers

This cross-sectional diagnostic study evaluates the accuracy of a machine learning method that uses the whole transcriptome to identify gene markers in primary and metastatic tumors.


eAppendix 1. Model Selection
As the 66 training cohorts had a wide range of representative samples , using accuracy as a metric of performance of the classifiers would not necessarily demonstrate the ability of the classifier to discriminate between all 66 output classes.
Therefore, the performance of the trained models was evaluated using the F1-score as a measure. The F1-score is the harmonic mean of precision and recall, with 1.0 being the most desirable value (precision and recall of 1.0 for all output categories), and 0 representing a classifier that has 0 precision and 0 recall on all output categories.

Algorithmic Model Selection
Several supervised learning algorithms were evaluated, including a baseline linear comparator discussed below. For the selection of an initial classifier based on the training data, we evaluated the performance of 4 supervised learning methods, namely multi-class support vector machines (SVM), random forests (RF), extra trees ensemble (ET), and an ensemble of neural networks. Figure 1 shows the crossvalidation results of the best performing model in each classification algorithm category. The optimal models for these comparators were selected with grid search across a space of model-specific parameters, using 5-CV to identify the best model.
Shallow (1 hidden layer) feed-forward neural networks were trained with varying sizes of the hidden layer. In the selection of the optimal parameter set for the neural networks, our search space spanned a range of values for the learning rate and the regularizers (L1 and L2).
SVMs, RFs, and ETs were implemented using the scikit-learn package in python, and training was done on CPU machines. The custom neural nets were defined and trained using the lasagne library [6] in Python, and training was done using NVIDIA GEForce GTX TITAN X GPUs. The F1-Score was used as the main metric of assessment, to account for class imbalances in the training and test sets.

Baseline Linear Comparator
Using pairwise ANOVA, 3000 genes were found to be significantly enriched for distinguishing between the 66 training tissue and cancer types. Cancer samples in the metastatic cohort were evaluated by calculating simple pair-wise spearman correlation for these genes, for every test sample versus the TCGA samples (separated by cancer type). The highest correlated cancer/normal type was chosen as the predicted class by this approach.

SCOPE Ensemble Construction
Firstly, we found that synthetic minority oversampling (SMOTE) [5] resulted in improved classification of rare classes as compared to training with the class imbalanced dataset (eFigure 3).
Secondly, comparison of different data normalizations showed that using rank transformed RPKMimproved classification accuracy for the overall model. Based on these results, we extended the classifier into an ensemble of neural networks combining these data transformation techniques. The additional neural networks were selected using 5-CV. We also observed that each of these 5 machines was better at classifying different classes within the training dataset, reflecting that they have each learnt different (unknown) modalities for classification of the 66 different tissue and cancer types. Based on this observation, we adopted weighted majority voting approach was used to obtain predictions for new samples from the ensemble. This resulted in a re-weighted vector of 66 values from each network. These reweighted predictions were then pooled using a majority-voting approach, providing an averaged probability score and a similarity confidence for each class.
The architecture and details of data pre-processing of each network are as described in eTable 3.
Learning rate was 0.001, with early stopping when validation cost increased for more than 3 epochs, or was stable for 3 continuous epochs. Maximum number of allowed epochs was 1000, thus training went until 1000 epochs or early stopping threshold, whichever came first.  [3].

Metastatic disease and CUPs
The Personalized OncoGenomics (POG) project at the BC Cancer Agency (BCCA) was established in 2011 with the aim to sequence patients with cancers that no longer respond to standard treatment [4]. The project analyses the genomic and gene expression (transcriptomic) data of each patient, in order to identify drugs that can target the individual cancer. The vast majority of patients enrolled into POG presented with metastatic disease with a known primary diagnosis and location.
Biopsies are collected from the metastatic site in most cases. In a minority of cases, the tissue is taken from the original site of disease presentation (noted as such in main text, Table 2). This could be due to the intrinsic biology of the cancer in the  The table also shows the number of training samples available for each cancer type.
Additionally, the POG cohort contained 16 cases where the primary site of origin could not be determined by initial pathology analysis. Genomic and transcriptomic analysis as part of the POG project determined the corresponding cancer type for 15 of these cases, which was used as gold standard for assessing the prediction from the classifier. The classification was performed retroactively after the closest suitable cancer type had been determined based on detailed pathway-level and genomic analyses of the cancer.

eAppendix 4. Metrics Used
Since the trained cancer and normal cohorts had variable representation in the training set (3 for adjacent normal subcutaneous melanoma, to 1095 for breast cancer), we used the F1-score as a measure of overall performance of the classifier.
The F1-score is the harmonic mean of precision and recall, with 1.0 being the most desirable value (precision and recall of 1.0 for all output categories), and 0 representing a classifier that has 0 precision and 0 recall on all output categories.
For a given input, the ensemble generates a pooled confidence score for each of the 66 output classes. Predicted classes are jointly ordered by the confidence score and number of machines in agreement. This max vote-pooling method was used to obtain a quantitative confidence score for each category. This confidence score was taken as a proxy for differential diagnosis when assessing metastatic samples. Thus, in the event that the prediction from the ensemble classifier was split between different cancer types, the correctness of the prediction was assessed by comparing the diagnosed cancer type against the pool of confident predictions. LGG Tumor  Tumor  Gastrointestinal   AGR2, CEACAM5, CHGB, CTRB1, CTRB2, GCG,  INS, PNLIP, PPY, REG1A, REG4, S100P, SFRP2,  SPINK1, SST, TFF1, TFF2, TTR   PCPG  Adjacent  Normal  Endocrine  CYP11B1, CYP17A1, DLK1, GSTA1, STAR   PCPG  Tumor  Endocrine  CHGA, CHGB, DBH, DLK1, NPY, PENK   PRAD  Adjacent  Normal  Urologic  ACPP, KLK2, KLK3, KLK4, NPY, OLFM4, PIP,  SEMG1  PRAD  Tumor  Urologic  ACTG2, AZGP1, DES, FOLH1, FOXA1, KLK2, KLK3,  KLK4, NKX3-1, NPY, PLA2G2A, SLC45A3  SARC  Tumor  Soft Tissue  DLK1   SKCM  Adjacent  Normal  Skin  DCT, MLANA, PRAME, TYR   SKCM  Tumor  Skin   APOD, DCT, EDNRB, KRT6B, MLANA, PLP1,  PRAME, S100A1, SERPINE2, SOX10, TYR, TYRP1,  VGF   STAD  Adjacent  Normal  Gastrointestinal  ACTG2, APOA1, APOA4, CLDN18, DES, GKN1,  HSPB6, PGA4, PGC, PI3, REG3A   STAD  Tumor  Gastrointestinal  ACTG2, CEACAM6, CST1, MALAT1, PGC, REG4 The x-axis has been ordered by increasing class size, indicated in the first panel, and performance is shown on the CV-folds (black) and on the held-out set (yellow). As can be seen, the difference between CV-fold performance and held-out performance is large for small classes, but peters off as the class size approaches >100. When the classifier is augmented by addition of synthetic samples in the training folds (last panel), we see that there is an overall increase in performance for the rare classes, and the gap between mean-CVprecision and heldout precision is minimized. The line of best fit (loess) is indicated for each model, with standard error bounds in grey. The performance in different CV folds is shown by the black point (mean) with 1 standard deviation bars. eFigure 3. Performance of SCOPE on the Held-out Set.

All 66 classes are shown, (a) showing performance between tumors and normal counterpart classes, and (b) showing cross-calling patterns originating from the normal class samples in the held-out set.
Performance of individual neural networks that make up SCOPE (n=5) are shown in grey points. a) The average and 1 standard deviation spread of class-specific F1-score across these is shown in black (black point = mean, error bars = standard deviation). The two panels separate the normal (n=26) and tumor (n=40) tissue classes.
b) The average and 1 standard deviation spread of class-specific F1-scores across the ensemble machines is shown by the colored error bars. "_TS" and "_NS" indicate tumor and normal tissue classes respectively. Width of curves indicates proportion of cross-called samples from the originating class (direction indicated by arrow), with the smallest width corresponding to 15% of samples. As is evident, cross-calling is mostly between normal tissues from the same organ-system of origin.