Multimodal Machine Learning Workflows for Prediction of Psychosis in Patients With Clinical High-Risk Syndromes and Recent-Onset Depression

This prognostic study evaluates whether psychosis transition can be predicted in patients with clinical high-risk syndromes or recent-onset depression by multimodal machine learning that optimally integrates clinical and neurocognitive data, structural magnetic resonance imaging, and polygenic risk scores for schizophrenia.


Inclusion and exclusion criteria, study protocol and CONSORT chart
The study inclusion and exclusion used to recruit patients with clinical high-risk states for psychosis, patients with recent-onset depression (ROD), and healthy volunteers across the seven PRONIA sites are described in eTable 1. The geographical distribution of the study sites, the follow-up scheme of the study, and the CONSORT chart are shown in eFigure 1.
For the present study, the PRONIA consortium screened 5547 individuals for study eligibility between 2/15/2014 and 5/1/2017 across seven academic early recognition services which cover a European catchment population of 5,384,000 persons. Among the screened cohort, 263 patients met inclusion criteria for CHR states, and 246 patients fulfilled criteria for a ROD. Ninety-six candidate CHR patients and 79 ROD patients had to be excluded from this cohort because (1) inclusion criteria could not be validated, (2) patients met exclusion criteria (eTable 1), including maximum medication thresholds (eTable 2), (3) did not finish the baseline examination, or were lost to follow-up after the 6-month visit. From the remaining study population of 167 CHR and 167 ROD individuals, a risk calculator discovery cohort (N=246) was extracted that comprised patients who had been followed for at least 18 months (PRONIA+18M sample) and therefore provided us with higher certainty concerning their transition-related outcomes. 1 The rest of the study population (N=88, PRONIA-18M sample), which consisted of non-transition cases only, was used as an independent validation sample to further validate the specificity of the trained risk prediction models. The rationale for this split of the PRONIA database, was two-fold: First, we were able to assess the generalizability of our models in a population which was lost earlier to follow-up and thus more realistically approximated the real-world situation of early recognition services. Second, all predictive data domains were available in the PRONIA-18M validation cohort, in contrast to the external samples described below, which only partially overlapped with the PRONIA feature space. This complete overlap allowed us to validate the human raters' prognostic performance, the unimodal and multimodal risk calculators, the cybernetic model, and the prognostic sequence.
To train and validate the structural MRI-based risk calculator, the study analyzed the baseline imaging data of a total 326 CHR and ROD individuals (discovery sample: 126 CHR/116 ROD patients, validation sample: 39 CHR/45 ROD patients), which were acquired at baseline based on a minimally harmonized structural MRI protocol (eTable 6) and which passed the quality control as described in section 1.3. For the genetic risk calculator, the DNA of 298 patients (discovery sample: 116 CHR/110 ROD patients, validation sample: 39 CHR/33 ROD patients) could be successfully extracted from the blood samples and genotyped as described in section 1.4. In addition, we extracted 334 healthy controls (HC) from the PRONIA database, which were matched one-to-one for age, sex, and site with the 334 clinical participants. Baseline study groups were compared in eTable 3.

Clinical and neurocognitive features
For the clinical and neurocognitive modelling part of the analysis, we used established assessment tools capturing phenotypes that were previously reported to be predictive of transition to psychosis. 2 The description of variables measuring these phenotypes are provided in eTable 5Error! Reference source not found., and, in summary, covered the following five phenotypic domains: i. Sociodemographic variables consisting of age and sex, ii. Interview-based psychopathological symptom assessments (Structured Interview for Psychosis-Risk Syndromes 3 with positive, negative, disorganized and general subscale items; Schizophrenia Proneness Instrument -Adult and Child/Youth version, 4 with items measuring cognitive attentional impediments (B items), cognitive disturbances (C items), disturbances in experiencing the self and surroundings (D items), perception disturbances (F items), and optional items with a positive predictive value of equal

MRI data acquisition and processing
When setting up the PRONIA study, we decided to record data that represents the MR scanner sequence heterogeneity encountered in real clinical settings. The aim of this strategy was to strengthen the generalizability and clinical applicability of the risk calculators developed in our machine learning analyses. Thus, a minimal MRI harmonization protocol was implemented across the PRONIA consortium that required the sites to only (1) acquire isotropic or nearly isotropic voxel sizes of preferably 1 mm resolution, (2) set the Field Of View (FOV) parameters as needed in order to guarantee the full 3D coverage of the brain including all parts of the cerebellum, and (3) define the relaxation time (TR) and echo time (TE) as well as other imaging parameters in a way that would maximize the contrast between cortical ribbon and the white matter and enhance the signal-to-noise ratio in the images. The parameters defining the structural MR sequences across the 7 PRONIA sites are listed in eTable6.
At each PRONIA site, all 660 images (329 clinical participants, 331 healthy controls) were visually inspected, automatically defaced, and anonymized using an in-house Freesurfer-based script prior to data centralization. Then, the open-source CAT12 toolbox (version r1155; http://dbm.neuro.unijena.de/cat12/), an extension of SPM12 16 designed for the processing and analysis of structural brain images, was used to segment images into grey matter (GM), white matter, and cerebrospinal fluid maps, and then to high-dimensionally register them to the stereotactic space of the Montreal Neurological Institute (MNI-152 space). The manual of the CAT12 toolbox (http://www.neuro.unijena.de/cat12/CAT12-Manual.pdf) details all processing steps applied to the structural images. In summary, they consisted of (1) the 1 st denoising step based on Spatially Adaptive Non-Local Means (SANLM) filtering; 17 (2) an Adaptive Maximum A Posteriori (AMAP) segmentation technique, which models local variations of intensity distributions as slowly varying spatial functions and thus achieves a homogeneous segmentation across cortical and subcortical structures; 18 (3) the 2 nd denoising step using Markov Random Field approach which incorporates spatial prior information of adjacent voxels into the segmentation estimation generated by AMAP; 18 (4) a Local Adaptive Segmentation (LAS) step, which adjusts the images for white matter (WM) inhomogeneities and varying gray matter (GM) intensities caused by differing iron content in e.g. cortical and subcortical structures. The LAS step is carried out before the final AMAP segmentation; (5) a partial volume segmentation algorithm that is capable of modeling tissues with intensities between GM and WM, as well as GM and cerebrospinal fluid (CSF) and is applied to the AMAP-generated tissue segments; (6) a high-dimensional DARTEL registration of the image to a MNI-template generated from the MRI data of 555 healthy controls in the IXI database (http://www.braindevelopment.org). The registered GM images were multiplied with the Jacobian determinants obtained during registration to produce GM volume (GMV) maps. GMV maps were resliced to 3 mm isotropic voxel resolution and smoothed with a 4 to 8 mm full-width-at-half maximum (FWHM) Gaussian kernel, with the kernel width being optimized in the subsequent machine learning analysis (see section 1.5.3). The Quality Assurance framework of CAT12 was used to quantitatively check the quality of the GMV maps (eFigure 10, A). This procedure produced a weighted quality score (from excellent [A] to failed [F]) for each image, which for 95.7% of the images used in the current study fell between a B+ (81.6%) and B (14.1%). We excluded two images because they were rated with a C [satisfactory] due to white matter segmentation errors.

Genotyping
DNA could be extracted from the whole blood samples of 499 participants (25 PT and 273 NT patients [89.2% of the clinical study cohort], 209 healthy controls) who had provided blood for, and consented to, genetic testing. DNA was genotyped using Illumina's Infinium Global Screening (GSA) Array-24 BeadChip version 2 + Psych content (GSA). The GSA includes > 650,000 markers and offers an unparalleled genomic coverage and imputation performance. The Psych content comprises 50,000 variants associated with common psychiatric disorders such as schizophrenia, bipolar disorder, and autism spectrum disorders. After standard, stringent quality control using PLINK 19 (e.g. sample call rate > 0.98; variant call rate > 0.98; Minor Allele Frequency > 0.01; removal of variants deviating from Hardy-Weinberg equilibrium, p < 10E-6; sex check and heterozygosity outlier analysis), a total of 505,687 variants remained in the dataset.
Subsequently, the dataset was imputed using the Haplotype Reference Consortium panel (rv1.1; www.haplotype-reference-consortium.org; n=32,470 samples) using Positional Burrows-Wheeler Transform as implemented in the Sanger Imputation Server (imputation.sanger.ac.uk). To include reliable variants for polygenic risk score analysis we excluded imputed variants with lower prediction accuracy (i.e., info score<0.6). After successful imputation, more than 7,420,913 variants were available in the PRONIA dataset.
Finally, we computed Polygenic Risk Scores for schizophrenia (PRS-SCZ) by means of the "clumping plus threshold" method employing the summary statistics from the SCZ genome-wide association studies meta-analysis as provided by the Psychiatric Genomic Consortium (PGC2, https://www.med-.unc.edu/pgc/download-results/; n>36,989 patients and n>113,075 controls). Clumping was performed using PLINK to filter for independent genetic signals (r 2 <0.1) within the target dataset having the highest association with SCZ summary statistics considering non-overlapping 250kb size regions across the genome. PRS-SCZ were computed as the sum of the risk alleles weighed by the association estimates for SCZ (beta of PGC-SCZ2) including all common variants (i.e., with Minor Allele Frequency > 1%) in the clumped dataset at a given P value threshold. PRS were generated over a range of ten P values from genome-wide significant threshold to the full model (i.e., 5.0E-8, 3.2E-7, 2.1E6, 1.4E-5, 8.8E-5, 5.6E-4, 3.7E-3, 2.4E-2, 1.5E-1, 1) thus including an increasing number of variants (i.e., 155,232,389,773,1640,1874,11255,35586,119001,324299). PRS were standardized to have a direct comparison across sample PRS values considering variable P value thresholds. Moreover, since PRS values can be also influenced by population structure we extracted the first 10 Principal Components (PC) from the post QC genotype data. Pruning of genotyping data was applied prior to computation the PC to limit the effect of Linkage Disequilibrium (LD) across markers and thus to better represent the population structure in the eigenvectors. Genome-wide pruning was performed with PLINK considering window sizes of 50 variants and steps of 5 variants while a threshold of 0.5 in the R 2 correlation across paired variants was considered. Both the 10 PC to consider population structure and the 10 PRS-SCZ to model the genetic liability for schizophrenia were included in the machine learning analysis described below.

Overall strategy
In the first step of our machine learning strategy (see eFigure 2 and section 1.5.3), we trained an optimal sequential prognostic workflow for psychosis risk stratification that integrated unimodal and multimodal (stacked) algorithmic predictions with expert-based prognostic estimates in the PRONIA discovery sample (PRONIA+18M). To this end, we developed a sequential prognostic algorithm that identified the optimal sequence of predictive components to be combined into a stacked model, 20 as well as the optimal decision thresholds determining the cases to be propagated to the next predictive component in the sequence (see section 1.5.4). We tested the optimal prognostic algorithm as well as its predictive components by assessing their generalizability to geographically distinct patient samples following the internal-external validation approach recommended by Steyerberg and Harrell. 21 In our case, this approach took the form of a repeated nested leave-site-out cross-validation (LOSOCV) scheme, in which the inner pooled cross-validation cycle (CV1) trained and optimized the prognostic workflow's component models, and the workflow itself, while the outer leave-site-out cross-validation cycle (CV2) iteratively tested their generalizability to each held-back validation site (eFigure 3). This LOSOCV scheme captured the geographic transportability of our models within a European context because it measured model generalizability against multiple sources of population variance caused by the multi-center design of PRONIA (e.g. varying pathways to and organizational forms of mental healthcare systems, diverse genetic backgrounds of patients recruited at the different PRONIA sites, different scanner hardware and software, etc.). First, we assessed whether significant site-related variation existed in the follow-up duration, the age and sex distributions of our study population, as well as the time to transition in patients who developed psychosis (eTable 9). Then, we tested explicitly whether our models' prognostic estimates were moderated by image quality (eFigure 10), baseline treatments or treatment history (eTable 7), follow-up duration and availability (eTable 8, eFigure 11), as well as by site-related effects (eTable 10). Furthermore, our internal-external validation approach was extended to include a conservative validation of model specificity by applying all risk calculators and the prognostic workflow trained in the PRONIA+18M sample to the PRONIA-18M cohort (eFigure 2), which included non-transition patients with shorter follow-up intervals compared to the discovery cohort ( Table 1, main manuscript).
The second step of the analytical strategy was to gain a deeper insight into the signatures of the prognostic workflow models by: (1) Extracting their reliable and significantly predictive elements using cross-validation ratio and signbased consistency mapping (see section 1.5.5), (2) Assessing whether the predictive signatures of clinical-neurocognitive, sMRI-and PRS-based models represented a deviation from normality by comparing the predicted outcome groups to age-, sex-and site-matched health volunteers (see section 1.5.6), (3) (a) Exploring whether our risk calculators and clinical raters were influenced by study-group related clinical-neurocognitive, PRS-based, and imaging-related differences between the CHR and ROD patients, (b) Evaluating whether the inclusion of the ROD group into the training of unimodal and multimodal risk calculators improved prediction performance in the CHR sample. These analyses are described in section 1.5.7. (4) (a) Evaluating whether the estimates generated by the workflow components were not only predictive of psychosis but also predictive of psychiatric diagnostic outcomes more broadly, (b) Assessing whether these estimates delineated distinct CHR syndrome trajectories which were constructed from the patients' longitudinal data using unsupervised machine learning methods and linear mixed models (see section 1.5.8).
In the third analytic step, we assessed the significance of our models' performance because of the limited number of transition cases and the imbalance of the outcome classes in our study sample. To maximize the sample size for this and the subsequent analysis step (external validation), we first retrained all risk calculators and evaluated their LOSOCV performance using the entire PRONIA cohort (N=334 cases). Then, we performed outcome label permutations and compared the permuted component risk calculators' and the workflow's leave-site-out performance against the respective models trained using the original labels. 22 Section 1.5.9 details this permutation analysis and Table 2 in the main manuscript reports the results.
In the final step of our analysis strategy, we tested the external validity of several workflow components, such as a condensed version of the clinical-neurocognitive risk calculator, the sMRI-based risk calculator, and a stacked risk calculator that analyzed the predictions of both former models. To this end, we applied the risk calculators retrained on the entire PRONIA sample to patients drawn from completely independent projects-the ZInEP, 23 FePsy, 24 and BEARS-Kid studies. 25

Data domain-specific feature preprocessing
As outlined in eFigure 3, each data domain underwent specific preprocessing steps as part of model optimization. Importantly, to exclude any information leakage between the training, test, and validation samples, NeuroMiner wraps all preprocessing steps entirely into a nested cross-validation design. Within this design the CV1 training samples are used to compute preprocessing parameters (e.g. the mean and standard deviation of a z-normalization model), the CV1 test data are employed to select the optimally predictive model parameters, and the CV2 validation data serve to determine the leave-siteout generalizability of the models. Thus, the preprocessing and subsequent machine learning steps are jointly optimized and become integrated parts of the predictive model. eFigure 4 exemplifies this process with respect to the training of the sMRI-based risk calculator. For the interview-based, patient-reported and neurocognitive data, the preprocessing involved the following procedures: (1) feature-wise scaling of the data matrix to the range [0,1], followed by (2) k Nearest Neighbor (kNN)-based imputation using the Jaccard distance as dissimilarity measure, and (3) standardization by mean-centering each feature and dividing it by the respective feature's standard deviation. The percentage of missing values in data matrix measured 3.2%.
For the sMRI data, the GMV maps were resliced to a 3 mm isotropic voxel resolution and smoothed with 4, 6 and 8 mm Full-Width-at-Half-Maximum Gaussian kernels (eFigure 4). Then, the resliced and smoothed maps were age-corrected using partial correlation analysis. The smoothed, resliced and age-corrected GMV maps were thresholded for between-scanner voxel reliability, as described previously. 26 In brief, we applied generalizability theory 27 to the GMV maps of six travelling healthy volunteers to produce a g coefficient map that measured the voxels' between-site reliability. This map was then resliced and smoothed as described above and applied to the patients' smoothed, resliced and age-adjusted GMV maps to remove unreliable voxels at the 25%, 50% and 75% percentile thresholds. Finally, the dimensionality of the thresholded maps was reduced by means of Principal Component Analysis (PCA). 28 PCA projected the image information to 15, 20 and 25 eigenvariates. The patients' eigenvariate scores were standardized and then forwarded to the wrapper-based machinelearning algorithm described below.
For the genetic data domain, we used partial correlation analysis to correct the PRS score matrix feature-wise for ancestry-related variance using the 10 PCA-based ancestry components. The corrected PRS scores were then standardized before entering the subsequent machine learning analysis steps.

Machine learning algorithms and feature selection
Wrapper-based feature selection strategies 29 using linear, non-kernelized L2-regularized, L1-loss support vector machines (SVM; provided by the LIBLINEAR library 30 in NeuroMiner) were used to find the optimal set of predictive features for the three unimodal risk calculators. We employed cost-sensitive SVM formulations to automatically account for the highly unbalanced class distribution in our study population. Models were trained using the CV1 training data first and then selected based on their prediction performance in the entire CV1 data (training + test data) partition. More specifically, for each hyperparameter combination and CV1 partition the wrapper-based feature selection process was repeated, thus producing a bag of models with variable feature sets. We picked optimally predictive models from this bag by first computing the SVM models' average balanced accuracy (BAC), defined as: across the k=50 (=5 repetitions × 10 folds) models available in the bag for each hyperparameter combination. Then, we combined these performance measures into a hyperparameter performance profile as shown in eFigure 4. i. Three Gaussian smoothing kernel widths of 4, 6 and 8 mm, ii. three g map percentile thresholds at 25%, 50%, and 75%, iii. three dimensionalities defined for data reduction using PCA: 15, 20 and 25 principal components, and iv. five SVM C parameters picked from the range of 2 -4 to 2 +4 with an exponent stepping of 2, totaling up to 135 parameter combinations.
The wrapper-based feature selection strategies varied per data domain: For the optimization of the clinical-neurocognitive risk calculator, we performed a greedy sequential forward search (SFS) 29 at each SVM C regularization parameter with early stopping at 10% of the features selected. The rationale was to find a restricted feature set among the 141 input variables (see eTable 5) that would most economically predict outcome labels, and hence would be more suitable for real-world application than the full feature set. After validating the model using LOSOCV, we used sign-based consistency mapping (section 1.5.5) to identify significant variables and retrained the risk calculator using only those. This reduced model was then externally validated in the ZInEP and BEARS-Kid samples and became part of a clinically scalable prognostic workflow (see section 1.

5.4).
Similarly, we applied SFS to the processed imaging data to select parsimonious combinations of principal components that were most predictive of transition to psychosis. To mitigate the risk of overfitting, we did not apply the SFS algorithm at all 135 parameter combinations of the optimization problem. Instead, in each CV1 partition, we selected the best three SVM models across this parameter range and optimized them using SFS. To further reduce the risk of overfitting, this feature selection process stopped when the prediction performance increase started leveling out, as detected by means of a knee-point detection algorithm, or when 50% of the features had been selected-we used 50% instead of 10% because, unlike for clinical and neurocognitive data that are laborious to collect, there is no clinical need to maximally reduce the feature set from an MRI scan.
Finally, we used greedy sequential backward elimination (SBE) 29 in the ancestry-adjusted PRS data to remove noise and variance unrelated to the prediction task. The SBE algorithm is less aggressive than the greedy forward search approach used for the clinical-neurocognitive and imaging domains and thus more suitable for feature sets characterized by high a degree of collinearity. The SBE search was stopped when 50% of the PRS features had been removed from the feature pool.
To further reduce the risk of wrapper overfitting, we did not select the winning model from the BAC CV 1 Train+CV 1 Test profile, but picked the best three ensemble models, which were retrained using the entire CV1 partition data and then combined into a bag of SVM ensembles (see eFigure 4). This ensemble bag was then applied without any further modification to the CV2 validation data of given held-back PRONIA site, thus producing decision scores for each CV2 validation patient, which were further accumulated across the CV2 leave-site-out partitions to produce a bag of leave-site-out decision scores. This bagged ensemble finally predicted a CV2 patient's outcome based on the class membership probability calculated by means of majority voting. 31 The median across the ensemble of decision scores constituted the Out-Of-Training (OOT) decision score of each CV2 validation patient.
Then, we built three different stacking models that integrated modality-specific models as follows: i. a model combining the full clinical-neurocognitive, sMRI-based, and PRS-based models, ii. a model combining clinical-neurocognitive, sMRI-based, PRS-based, and human expert predictions, and finally iii. a stacked model combining the condensed clinical-neurocognitive model and the sMRI model for external validation (see section 1.5.10).
For the training of the stacked models, we employed linear, kernelized L2-regularized, L1-loss SVMs as provided by the LIBSVM library without wrapper-based feature selection. 32 Input features of the stacking algorithm were the CV1-test decision scores, which were extracted across the three unimodal outcome predictors and aggregated into a new training data matrix using the same CV1 cross-validation structure as for the training of the unimodal classifiers. This new CV1-training score matrix was scaled to the range [-1, 1] using its column-wise minima and maxima and the resulting scaling model was then applied to the respectively aggregated CV1-test and CV2-validation score matrices. Due to cases with missing data in the PRS and structural MRI domains, we applied kNN-based imputation to the training, test and validation matrices as described above for the preprocessing of the clinical-neurocognitive data domain. Iterating through the CV1 cycle, the scaled and imputed training data matrices were forwarded to the SVM algorithm which found the optimal C parameter within the range of 2 [−4 ∈ℤ → +4] using the same optimization process as in the unimodal classification analyses. The final outcome prediction for the CV2 validation patients was carried out as described above. Furthermore, following the principles of expert-augmented machine learning, 33 we extended this algorithmic stacking procedure by incorporating the raters' prognostic estimates into the decision score matrix. These estimates were a coded as yes/no replies to the question 'Do you think that the patient is likely to transition to psychosis?', which our raters received at the end of the baseline examination. Inspired by Norbert Wiener's foundational work on cybernetics, 34,35 we will use the term 'cybernetic risk calculator' throughout this work to differentiate this type of stacked risk calculator from purely algorithmic multi-modal prediction.

Case propagation-based sequential stacking
Building on our previous work, 26 we implemented a sequential stacking algorithm in NeuroMiner to generate cost-effective prognostic machine learning tools for clinical care. The algorithm optimizes the application sequence of predictive models that maximizes prognostic accuracy and reduces the percase examinations needed to achieve this performance. More specifically, the objective of the sequence optimizer is to search through a space of every combination of stacking possibilities (see eTable 17), while intelligently choosing individuals for whom prognostic (or diagnostic) accuracy would improve from the addition of new modalities in the stack. The combination of stacking possibilities was divided into four different sequential nodes that represented unimodal predictions, predictions based on two data modalities, predictions based on three modalities, and predictions based on four modalities (i.e., the decision scores produced by the clinical-neurocognitive, PRS-based, and sMRI-based risk calculators, as well as by the estimates provided by the clinical raters). To identify individuals who need additional sequences, we employed a method that focuses on maximizing the decision score margin between the cases of the opposite classes within specific percentile windows.
Specifically, for a given predictive sequence (eTable 17), the algorithm first ranks the training cases according to their decision scores in the prediction node, and then, starting from the decision boundary with a 5%-step width, determines the optimal upper and lower decision score percentiles for which case propagation to the next prediction node would maximize the decision score margin between cases of opposite classes. Establishing increasing percentile windows around the decision boundary is expected to place a gradient from individuals with most ambiguous decision scores to individuals with a very unequivocally predicted class membership. This procedure is repeated across all subsequent prognostic nodes of the given sequence, for all prognostic sequences to be tested, and for each CV1 data partition so that a performance profile can be computed across the three-dimensional hyperparameter cube of the algorithm, i.e. the sequence pool, the lower (L) and the upper (U) case propagation percentiles. In the current work, we defined 64 candidate sequences (eTable ) and the L(-)/U(+) thresholds around the decision boundary ([±15%, ±25%, ±50%] of cases), thus resulting in 64 × 3 × 3 = 574 hyperparameter combinations. The winning prognostic sequence was defined by the hyperparameter combination that was most frequently chosen by the algorithm based on the BAC as optimization criterion (eFigure 14, I). This sequence was further analyzed in terms of the number, percentage of propagated PT and total cases, predicted classes, and prognostic performance per examination node (eFigure 14, II).
Furthermore, to facilitate translation into clinical care, we tested whether the integration of sparsity regularization in the sequence optimization process would reduce the number of examinations needed to achieve similar levels of prognostic performance as in the non-regularized sequence. Examination cost was operationalized as the fraction of cases reaching the final examination step of the given sequence. More specifically, following the regularization approach proposed by Boardman and Trappenberg 36 , we down-weighted BAC by examination cost as follows: Where defined the number of individuals among the total of cases that reached the final examination step of sequence S. The regularization parameter was always kept at 1, while regularization strength Γ was set to either 0.5 or 1.0, thus imposing increasing levels of examination sparsity on the optimal sequence identification process (eFigure 15, A). The OOT performance of the two new optimized prognostic sequences was measured using leave-site-out cross-validation in the PRONIA+18M and complete PRONIA cohorts and displayed in the in terms of false positive rate, sensitivity, specificity and BAC in eFigure 15, C1-C2. Furthermore, we compared the CV2-level BAC measures of the two new sequences to the BAC of the non-regularized workflow using the Wilcoxon paired signed rank test (eFigure 15, C2). In addition, we analyzed the fraction of patients reaching each examination node (eFigure 15, B) to assess the effect of regularization strength on examination sparsity. Finally, we tested whether the replacement of the full clinical-neurocognitive model with the condensed version negatively impacted on the regularized and non-regularized workflows' false positive rate in the PRONIA-18M sample (eFigure 15, C1).

Visualization of predictive patterns elements using ensemble learning
Two computational approaches, as implemented in NeuroMiner, were used to visualize the predictive patterns elements in the unimodal and stacked risk calculators trained on the PRONIA+18M sample. First, we calculated the pattern element stability, termed cross-validation ratio (CVR), by computing the mean and standard error of all SVM weight vectors concatenated across the entire nested crossvalidation structure (eFigure 3). This cross-validation ratio as a measure for pattern stability was inspired by the bootstrap ratio commonly used in the Partial Least Squares literature and described in Krishnan et al. 37 Similarly to the bootstrap ratio, the CVR of pattern element , be it a clinical-neurocognitive variable, a polygenic risk score, an image voxel or the decision score of a unimodal risk calculator, was defined as: Where is the size of the SVM ensemble, 1 is the number of CV1 permutations, 1 the number of CV1 folds, 2 the number of CV2 repetitions, 2 the number of CV2 folds, ̂ the th element of the th normalized weight vector ̂= /‖ ‖ in the SVM ensemble, [38][39][40] ̂ the standard deviation of ̂. Akin to Z-scores, the CVR vectors or images (see Figures in main manuscript) were thresholded at CVR=±3 to delineate stable pattern elements across the cross-validation experiment.
Furthermore, we implemented a sign-consistency-based method to statistically probe the relevance of clinical-neurocognitive variables, polygenic risk scores or image voxels in our SVM ensembles. To this end, we adopted and extended the approach proposed by Gómez-Verdejo et al. 41 toward wrapper-based feature selection strategies. The variable importance of the th pattern element in the three unimodal risk calculators was defined as: The first part of the equation measures the consistency of the weights assigned by the SVM ensemble to given pattern element. The importance is reduced by the second part of the equation, which measures the fraction of SVMs that de-selected the given pattern element during the wrapper-based optimization process (see section 1.5.3). Hence, = 1, when the weights of th pattern element all share the same sign and the element has been selected by all classifiers in the ensemble, or = 0, when positive and negative weights occur equally across the ensemble or given pattern element has been omitted by all classifiers in the ensemble. We defined a hypothesis test for , with: Significance thresholds for this hypothesis test were derived using the Z-statistic, defined as: We used the normal cumulative distribution function to pick the right-tailed P value corresponding to the respective Z-score of the variable importance of . The P values were corrected for multiple comparisons using the false-discovery rate 42 and statistical significance was defined at α=0.05.

Predictive signature validation
As in our previous work, 26 we used the HC sample to assess whether the multivariate signatures of our risk calculators (i.e., the clinical-neurocognitive variables, voxels of GMV increments/reductions or ancestry-adjusted PRS increments/reductions selected as the most predictive of PT) could be interpreted as patterns indicating a deviation from normality. For the clinical-neurocognitive risk calculator, we drew a sample of 244 HC participants from the age-, sex-and site-matched HC cohort (eTable 3), for which less or equal than 25% of the 141 variables used in the training of the clinical-neurocognitive risk calculator were missing. 43 The following preprocessing steps were performed to compare the patients' data to HC: First, the data matrix of the HC participants was scaled to [0,1] and missing values were imputed using the same nearest neighbor-based method as in the main risk calculator analysis. Then, the mean ( ̅ ) and standard deviation ( ) were computed for each of the reliably PT-predictive variables in the scaled and imputed data matrix of the HC sample (see CVR profile of clinical-neurocognitive variables in Figure 1, A, main manuscript). Finally, the HC-based scaling and imputation models were applied to the data of the clinical study participants. The HC-based means and standard deviations were used to standardize the scaled and imputed data of the HC, CHR, and ROD participants using the formula = ( − ̅ )/ ( ), thus creating a measure of deviation from normality across these variables. Several variables (DANVA, CTQ item 13, DSST, SVF and PVF, ROCF, SOPT, WAIS) were inverted to facilitate score interpretation, i.e. a higher positive score corresponded to a higher PT-related abnormality of the given variable. The resulting abnormality profiles were visually compared between HC, predicted PT and predicted NT patients (eFigure 7). Latter two groups were produced by the prognostic assignments of the clinical-neurocognitive risk calculator. Then, the three groups were statistically compared using Multivariate Analysis of Variance (MANOVA) provided by the Statistical Package for the Social Sciences (SPSS v.25, IBM Inc.). The right side of eFigure 7 shows the results of the between-group ANOVAs which were conducted following the significant omnibus test. The P values of these ANOVAs were corrected for multiple comparisons using the false-discovery rate. In significant ANOVAs, we performed pairwise post hoc analyses and adjusted the obtained P values for multiple comparisons using Tukey's Honestly Significant Differences (HSD) method.
For the PRS-based risk calculator, we conducted a similar MANOVA by defining the 10 ancestryadjusted PRS scores for schizophrenia computed at different genome-wide significance thresholds (see section 1.4) as dependents in the analysis. A three-group dummy matrix consisting of the two predicted outcome groups and a subgroup of 209 healthy volunteers for whom PRS were available was entered as a fixed factor into the analysis. After assessing the omnibus test significance using Wilk's λ, univariate ANOVAs were computed for each PRS score and an alpha correction was carried out using the FDR. In significant ANOVAs, post hoc comparisons were performed to evaluate pairwise group differences. Tukey's HSD method was employed to correct the alpha level of pairwise comparisons. Additionally, we repeated the analysis using the observed transition labels and the baseline study groups to test the specificity of findings. All PRS-related MANOVA results are shown in eFigure 8.
For the sMRI-based risk calculator, we performed voxel-based morphometry (VBM) with threshold-free cluster enhancement (TFCE) 44 in Statistical Parametric Mapping (SPM12) 16 to compare the predicted outcome samples to our site-, age-and sex-matched sample of 331 healthy volunteers. To this end, we smoothed the study participants' GMV maps with a 6 mm Gaussian kernel, proportionally scaled them to their global GMV and masked the images with a binarized version of the CVR map shown in Figure 1 (main manuscript) to restrict the analysis to stable parts of the sMRI-based risk calculator's signature. The smoothing kernel width of 6 mm was chosen based on the parameter combination that was most frequently selected across the sMRI-based SVM ensemble. The smoothed, scaled and masked GMV maps entered a non-parametric analysis of variance as implemented in the TFCE toolbox for SPM12 (http://www.neuro.uni-jena.de/tfce/). For each of the three comparisons [predicted PT vs. HC], [predicted NT vs. HC], and [predicted PT vs. predicted NT] a voxel-wise null distribution of TFCE values was computed using 5000 random label permutations. The resulting P value maps were log-transformed, corrected using the false-discovery rate and visualized in eFigure 9.

Diagnostic moderators of prognostic performance
To evaluate whether differences between the CHR and ROD groups had influenced the prognostic performance of our unimodal, stacked and cybernetic risk calculators, we first replaced the transition outcomes with the diagnostic labels in all classification experiments conducted in the complete PRONIA sample and re-trained all risk calculators with the identical algorithmic setup used for the prognostic analyses (eTable 12). Then, we computed a cross-correlation matrix of the prognostic and differential diagnostic OOT decision scores and inspected the pairwise explained variances R 2 and corresponding P values (eTable 13). An alpha correction was carried out using FDR.
Furthermore, we evaluated whether training our risk calculators on a broader risk population, which encompassed both CHR and ROD patients, lead to a better prediction performance in the CHR sample in contrast to training and cross-validating only in the CHR population. We hypothesized that the widening of the risk spectrum due to the inclusion of the lower-risk ROD patients (1.8% transitions compared with 13.8% in the CHR group) would lead to a more accurate and generalizable prediction of the CHR patients' transition risk across all data domains. To test this hypothesis, we conducted two sets of analyses in the complete PRONIA cohort and used the identical algorithmic setup as for the main analyses. First, we removed all ROD patients from our sample, retrained and cross-validated all unimodal risk calculators and the stacked model using only the CHR patients. Then, we tested whether the reduced model performance observed in these analyses was simply caused by the reduction of training sample size. To this end, we replaced the 167 ROD patients in the complete PRONIA cohort with 167 age, sex and site-matched HC individuals drawn from our HC cohort (eTable 3). Again, we retrained all unimodal and stacked risk calculators and measured the prognostic performance in CHR sample using LOSOCV. The results of these ROD depletion and substitution analyses are reported in eTable 14 alongside the original performances of the risk calculators in the CHR group. Furthermore, to statistically test our hypothesis, we performed a Quade test for each unimodal risk calculator and the stacked model. If the given omnibus test was significant at α=0.05, we conducted pairwise post hoc tests of performance differences between the original model and the 'ROD-depleted' classifier, the original model and the 'ROD-replaced' classifier, and the 'ROD-depleted' and 'ROD-replaced' models. Obtained P values were corrected for multiple comparisons using the false-discovery rate and visualized in eFigure 12.

Prognostic generalization of raters and risk calculators to outcome targets beyond PT
Similar to Koutsouleris et al. 2018, 26 we assessed whether our raters' estimates and the risk calculators' predictions of PT generalized to other prediction targets. First, we tested whether their psychosis transition (PT) estimates were sensitive to non-remission or de-novo occurrence of CHR syndromes, as an intermediate outcome between transition and CHR symptom remission (eTable 15). More specifically, we created a new three-group outcome label that differentiated between transition to psychosis (PT), non-remission/de-novo occurrence of CHR syndromes (P-CHR), and asymptomatic courses (NP-CHR). The P-CHR outcome was defined by the presence of PRONIA CHR criteria in at least one follow-up examination conducted after the 6-month follow-up interval. Consequently, NP-CHR patients had to show no CHR criteria in any of the visits after the 6-month timepoint. Then, we used a non-parametric permutation approach to test whether the raters' outcome estimates or the OOT decision scores of the risk calculators trained to differentiate between PT and NT cases were also significantly separating outcomes in the three classification tasks 'PT vs. PR', 'PT vs. NP-CHR', and 'P-CHR vs. NP-CHR'. We established significant discriminative effects by computing the null distributions of BAC using 5000 conjoint permutations of all outcome predictions and comparing the observed BAC of respective prediction models in given classification tasks to their respective null distributions. Based on the conjoint permutation approach, it was also possible to evaluate whether our binary risk prediction models were differentially associated with the three-outcome label: we computed the null distributions of BAC differences (ΔBAC) and compared the observed ΔBAC to those distributions. Alpha corrections were carried out for these two analysis steps using the FDR and significance was established at α=0.05.
Secondly, we evaluated whether the prognostic estimates of our raters and risk calculators, as well as the observed outcome classes were associated with more fine-grained CHR and functional syndrome trajectories. Similar to Koutsouleris et al. 2018, 26 these syndromes were extracted from the clinical baseline data using unsupervised machine learning methods. More specifically, we first scaled the patients' psychometric baseline data, including items from the SIPS, SPI-A, and the GF:S, GF:R questionnaires, to the range [0, 1] and imputed missing data using the approach described in section 1.5.2. We applied orthogonal non-negative matrix factorization 46 to the scaled and imputed matrix in order to reduce it to four factor scores (eFigure 13, A). After sorting the features according to their factor weights, we interpreted the four factors as paranoid-perceptual disturbances (F1), disturbances of volition and affect (F2), functional disturbances (F3) and cognitive disturbances (F4). Then, scaling, imputation and factorization models were applied without further modification to the clinical data available for the T1, T2, T3 and T4 timepoints to produce follow-up scores for each of the four factors domains. Finally, a matrix of univariate linear mixed-effects models was computed to detect differences between prognostic assignments (eFigure 13, B1) and observed outcomes (eFigure 13, B2). In each of the 24 (=4 factors × 6 predictors) analyses, we defined the respective factor score as response variable and added intercept, group (predicted or observed) and its interaction with timepoint as fixed factors, while subject and site-level intercepts were entered as random effects into the model. An alpha correction of P values for the main and interaction effects was performed using the FDR. Statistical significance was defined at α=0.05.
Thirdly, we evaluated whether the raters' and risk calculators' transition estimates were predictive of non-psychotic diagnostic outcomes over the follow-up period of the study. For each of the affective, anxiety, substance dependency and eating disorders sections of the DSM-IV-TR 45 we distinguished between four possible outcomes, consisting of 'no diagnostic criteria met', 'remission from baseline diagnosis, 'non-remission from baseline diagnosis, 'de-novo occurrence of diagnosis'. These outcomes were delineated in the following order: First, 'de-novo occurrence of diagnosis' was defined as meeting DSM-IV criteria for any condition in given diagnostic section during follow-up but not at baseline. If criteria for de-novo occurrence were not fulfilled, 'non-remission from baseline diagnosis' for given diagnostic section was considered. Non-remission was defined when symptomatic criteria were met both at baseline and at least at one follow-up examination (T1, T2, T3, or T4; see eFigure 1) for any diagnostic entity in given section. Finally, if these two outcomes were not observed, remission was considered, defined as criteria for any disease in given diagnostic domain being met at baseline, but not anymore during any follow-up timepoint. Major depressive disorder was treated as a separate domain because of its high prevalence in the study population. In addition, we analyzed whether the observed transition vs. non-transition outcomes were associated with these non-psychotic outcomes. eTable 16 summarizes the results of these χ 2 analyses across diagnostic domains (rows) and transition predictors (columns). An alpha correction for multiple comparisons was carried out using the FDR and statistical significance was determined at α=0.05.

Permutation analysis
Given the relatively low number of transition cases in the PRONIA database, we tested the hypothesis that the SVM-based risk calculators had learned prediction rules that accidentally generalized within our sample. We, therefore, tested the statistical significance of our risk calculators' prediction performance by performing label permutation analyses. 22 More specifically, we used LOSOCV (as stated in section 1.5.1) to retrain and cross-validate all unimodal, stacked, cybernetic and sequentially stacked risk calculators on the entire PRONIA sample in order to maximize the available sample size. Then, we constructed a permutation structure consisting of = 1000 random transition label shuffles and used 1000 permuted LOSOCV experiments to probe our models. Specifically, at each label permutation, we retrained the SVM ensembles using the same cross-validation setup and respective feature sets as in the original label analyses. Hence, we accumulated the random models' predictions into a permuted ensemble prediction for each CV2 patient. This procedure generated a null distribution of OOT classification performance for each risk calculator and this distribution was used to compute the significance of the observed performance, as: The obtained P values were corrected for multiple comparisons using the FDR. The corrected significance threshold was defined at α=0.05. Corrected P values were reported in Table 2 of the main manuscript.

External validation of risk calculators
We validated an abbreviated, pragmatic version of our clinical-neurocognitive risk calculator by first extracting the 7 features that were significant in the sign-based consistency visualization analysis (see 1.5.5, Figure 1, main manuscript, and eFigure 2) from the original space of 142 features. We retrained the SVM ensemble using these features and the identical training parameters as in the original analysis, except for the wrapper-based feature selection, which was skipped because of our feature pre-selection selection strategy. Then, we tested this condensed risk calculator in the independent cohort of 88 PRONIA non-transition cases to probe our model's specificity. After completing this procedure, we retrained the condensed risk calculator using the entire PRONIA dataset and applied it to two independent patient samples extracted from the ZInEP 23 and BEARS-Kid 47 cohorts (eFigure 2). The ZInEP (Zürcher Impulsprogramm zur Nachhaltigen Entwicklung der Psychiatrie) study was a prospective naturalistic study of patients with CHR states, aged 13 to 35, who were similarly recruited as in PRONIA using the SIPS 3 and SPI-A/-CY 4,48 instruments. Patients were enrolled in the Zurich area, Switzerland, and examined following a multi-level approach which included psychopathological, neuropsychological, and magnetic resonance imaging measures. Patients were followed every 6 months for a maximum followup duration of 36 months. In contrast, the BEARS-Kid project (Bi-national Evaluation of At-Risk Symptoms in Children and Adolescents) was a study of CHR symptoms in individuals aged 8 to 40 who were sampled from the 384,000 persons included in the population registry of Canton Bern, Switzerland. Study participants were followed over a period of 24 months and assessed with the SIPS and SPI-A/-CY amongst others clinical instruments. For the purpose of externally validating the clinical-neurocognitive risk calculator, we extracted a sample of 462 patients with various mental conditions from the full BEARS-Kid cohort (see eTable 19). Importantly, the transition rate in this sample measured 2.8% and hence was significantly lower compared to the CHR sample (11.0%) of the ZInEP study (df=2, χ 2 =16.2, P<.001). This difference allowed us to test the generalizability of the clinical-neurocognitive risk calculator at two distinct pre-test risk enrichment levels, which could occur in a general youth mental health service vs. a service specifically targeting patients at risk of psychotic disorders. 49 To qualitatively compare these two cohorts to the PRONIA sample, we selected sociodemographic and clinical variables that closely matched the variables analyzed in Table 1, main manuscript, and performed group-level comparisons of transition vs. non-transition cases, which were reported in eTable 18 (ZInEP) and eTable 19 (BEARS-Kid). Alpha corrections were carried out as in the PRONIA group-level analyses using the FDR and statistical significance was defined at α=0.05. In both samples all five SIPS variables needed for the application of the risk calculator were available, but not the CTQ item ('People in my family looked for each other') and the DANVA item ('No. of correctly identified facial expressions'). Prior to the application of the SVM ensemble, these missing variables were imputed by the trained data preprocessing module of the clinical-neurocognitive risk calculator (see 1.5.2). Furthermore, we tested the external validity of our sMRI-based risk calculator using the imaging data provided by ZInEP (n=146) and the FePsy 24 (n=37) studies. The FePsy cohort analyzed in the present work is identical to the one previously described in Koutsouleris et al 50 . The transition rate in this sample measured 43.2% and thus was significantly higher (df=2, χ 2 =21.3, P<.001) than among the ZInEP patients (11.0%). The structural images of these two samples were pre-processed using the same CAT12 toolbox running with the parameters employed for the pre-processing of the PRONIA images (see 1.5.2). No specific calibration of the pre-processed images to the PRONIA data was performed. Then, the sMRI-based SVM ensemble retrained on the entire PRONIA cohort was used to generate outcome predictions for both external samples. We report the OOCV performance of the classifier in Table 2, main manuscript.
Because both psychopathological and imaging data were available for the 146 patients of the ZInEP cohort, we were able to test the external validity of a reduced stacked risk calculator that was trained on the two unimodal risk calculators described above. Thus, we were able to evaluate whether a multi-modal prognostic system outperformed its constituent unimodal components in independent and external datasets. First, we trained the stacked prognostic system on the PRONIA patients with 18-months follow-up data and then validated the system using the 88 PRONIA non-transition cases using the same preprocessing and machine learning settings employed for the training and cross-validation of the fully stacked or cybernetic risk calculators. Then, we performed the external validation of the stacked risk calculator by retraining it on the entire PRONIA cohort and applying it to the predictions generated by the unimodal clinical and sMRI-based risk calculators in the ZInEP sample. Finally, we compared the predictive performance of the clinical-neurocognitive, sMRI-based, and stacked models in the ZInEP data using Quade test (Figure 2,

. Antipsychotic Medication Thresholds Based on the Previous Version of the S3 Guidelines of the German Association for Psychiatry, Psychotherapy and Psychosomatics
Candidate CHR and ROD patients were excluded if they had received antipsychotic medication (1) for more than 30 cumulative days at or above the minimum target dosage threshold for the treatment of first-episode psychosis, or (2) within the past 3 months before psychopathological baseline assessments at or above the minimum target dosage threshold for the treatment of first-episode psychosis. Abbreviations: DI dosage interval, 2 maximum recommended dosage according to prescribing information.

eTable 3. Group-Level Comparison of CHR, ROD and HC Individuals
Groups are compared across sociodemographic, functioning, psychopathological and selected neurocognitive domains. P values were corrected for multiple comparisons using the False-Discovery Rate (PFDR). Significant comparisons in bold. Rey-Osterrieth Complex Figure   Maximum correct responses before error by 6 elements -trial 1 Self-ordered pointing task SOPT Max corr responses before error 6 elemes 02

Variables
Maximum correct responses before error by 6 elements -trial 2 Self-ordered pointing task SOPT Max corr responses before error 6 elemes 03 Maximum correct responses before error by 6 elements -trial 3 Self-ordered pointing task SOPT Max corr responses before error 8 elemes 01 Maximum correct responses before error by 8 elements -trial 1 Self-ordered pointing task SOPT Max corr responses before error 8 elemes 02 Maximum correct responses before error by 8 elements -trial 2 Self-ordered pointing task SOPT Max corr responses before error 8 elemes 03 Maximum correct responses before error by 8 elements -trial 3 Self-ordered pointing task SOPT Max corr responses before error 10 elemes 01 Maximum correct responses before error by 10 elements -trial 1 Self-ordered pointing task SOPT Max corr responses before error 10 elemes 02 Maximum correct responses before error by 10 elements -trial 2 Self-ordered pointing task SOPT Max corr responses before error 10 elemes 03 Maximum correct responses before error by 10 elements -trial 3

. Correlations Between Decision Scores of Unimodal and Cybernetic Risk Calculators and Patients' Maximum Follow-up Intervals and Number of Follow-up Examinations
Correlations strengths were measured using Spearman's rho (ρ). P values were corrected for multiple comparisons using the False-Discovery Rate (PFDR). See also eFigure 11 for a visual analysis relating the number of follow-up examinations and the longest follow-up duration to the predictive performance of these four different risk calculators.

Variables Clin-NC PRS-based sMRI-based Stacked Cybernetic
Maximum followup interval One-way analyses of variance (F) were carried out for the analysis of site-related follow-up duration and age effects. Kruskal-Wallis test (H) was used to analyze site differences in the time to transition variable, while χ 2 test was performed to test for an interaction between site and sex categories. P values were corrected for multiple comparisons using the False-Discovery rate (PFDR) and statistical significance was determined at α=0.05. Significant comparisons in bold. As Udine did not observe PT during the follow-up period, the χ 2 tests were carried out independently for true-positive and false-negative predictions (sensitivity, Udine excluded) and true-negative and false-positive predictions (specificity, Udine included). P values were corrected for multiple comparison using the False-Discovery Rate separately for the sensitivity and specificity-related analyses. Significant associations in bold.

Heterogeneity of Raters' and Models' Predictive Performance
Cross-site means and standard deviations of sensitivity, specificity, balanced accuracy (BAC) and area-under-the-curve (AUC) were computed for the raters' prognostic estimates and the OOT predictions of unimodal, stacked, cybernetic and sequentially stacked risk calculators. For the computation of the mean (SD) of sensitivity, balanced accuracy, and AUC, we excluded Udine because the site lacked patients who developed psychosis during the follow-up period.

Site-level Sensitivity
Site-level Specificity

eTable 14. ROD Depletion and Substitution Analyses Assessing the Performance Effects Induced by the ROD Group in the Prediction of Psychosis Transitions in the CHR Sample
ROD depletion and replacement analyses were carried out to assess the performance effects induced by training unimodal and multi-modal risk calculators with ROD patients. Both sets of analyses were carried out in the complete PRONIA cohort using the identical parameter settings of our main analyses. First, out-of-training performance was measured only in the CHR sample using Leave-One-Site-Out Cross-Validation (LOSOCV, see section 1.5.1). Second, in the depletion analysis, we removed ROD patients from the training population and re-run the analysis just in the CHR patients. Third, in the substitution analysis, we replaced ROD patients with 167 age, sex and site-matched HCs and measured the prediction performance of these prognostic classifiers again only in the CHR patients. For the sMRI-based risk calculator, we were able to use the FePsy and ZInEP cohorts to externally validate the losses of prognostic performance observed in the PRONIA CHR cohort. See section 1.5.7 for further methodological details and eFigure 12 for a statistical analysis of performance differences (BAC) between the original models and their ROD-depleted or HC-substituted counterparts.

eTable 16. Comparison of Nonpsychotic Diagnostic Outcomes Between Patients With a Predicted Transition vs Predicted Nontransition to Psychosis During the Follow-up Period of the Study
Patients were included in the analysis if they had a baseline and at least one SCID-IV follow-up examination. Comparisons were carried out to explore whether raters or risk calculators were not specifically predicting transition to psychotic disorders but also other relevant diagnostic outcomes. In addition, the comparisons were carried out for the observed outcome groups. The definition of diagnostic outcomes is detailed in section 1.5.8. All P values were corrected for multiple comparisons using the False-Discovery-Rate (PFDR). The winning sequence was marked in red. See also eFigure 14 for an in-depth analysis of the winning sequence.  to the respective CV1 training and test data and calculating the average BAC in given partition. To reduce the risk of overfitting, the top three of the most predictive models were selected from the CV1 test performance profile, which was computed by averaging across the 5 x 10 CV1 test data partitions. This SVM model ensemble (comprising 3 * 5 CV1 permutations * CV1 10 folds = 150 SVM models) was then applied to the CV2 validation data provided by the respective held-back site.

eFigure 5. Predictive Signature Underlying the sMRI-Based Risk Calculator
A: the reliability of predictive pattern elements is visualized by means of cross-validation ratio mapping. B: Sign-based consistency mapping depicts patterns elements that significantly contributed to the risk calculator's predictive pattern. Cool colors indicate voxels with increased brain volumes in PT patients while warm colors represent increased brain volume in NT patients. Analysis of group differences in variables which were reliably predictive of PT in the clinical-neurocognitive risk signature (see Figure 1, main manuscript). See section 1.5.6 for methodological details. Z score profiles were visually compared between HC (gray line), predicted PT (red line) and predicted NT patients (blue line). Latter two groups were produced by the prognostic assignments of the clinical-neurocognitive risk calculator. The three groups were statistically compared using Multivariate Analysis of Variance (MANOVA) producing a significant omnibus test (Wilk's λ=0.282; F=20.89; P<.001). The right side of the figure shows the results of the between-group ANOVAs which followed the significant omnibus test (F scores, black line; -log10(PFDR value) scores, purple line). Pairwise post hoc analyses were carried out for significant ANOVAs and obtained P values were adjusted for multiple comparisons using Tukey's HSD method. Group comparison in the polygenic risk scores (PRS) between HC (gray line), predicted transition (red line) vs. predicted non-transition patients (blue line). Latter two groups were produced by the PRS-based risk calculator (A). Similar comparisons were conducted between observed outcomes (PT, NT) and HC (B), as well as between CHR, ROD, and HC groups (C). The lower graph in each plot shows the F scores (black line) and respective P values of a Multivariate Analysis of Variance (MANOVA) that was carried out to assess main effects of group across 10 genome-wide significance thresholds. The omnibus test was significant at P<.001 (Wilk's λ=0.598; F=14.27). Post hoc analyses were performed to compare both predicted outcome groups to the healthy volunteers and obtained P values were corrected for multiple comparisons using Tukey's HSD method. Like eFigure 9, the analysis showed that both predicted outcome groups differ significantly from the HC population. In similar MANO-VAs performed in the observed outcome group vs. HC (B) and the study groups vs. HC (C) we did not observe reduced polygenic risk in nontransition cases or ROD patients compared with the HC participants.

eFigure 9. Univariate Volumetric Comparisons Between Healthy Volunteers and Patients Labeled With a Transition or Nontransition to Psychosis by the sMRI-Based Risk Calculator
All analyses were carried out in SPM12 (MATLAB 2019b). Structural images were smoothed with an 8 mm Full-width-at-halfmaximum Gaussian kernel before entering a non-parametric analysis of pairwise group differences performed using the Threshold-Free Cluster Enhancement toolbox for SPM12 (http://www.neuro.uni-jena.de/tfce/). To assess the significance of differences 5000 permutations of group memberships were performed and respective null distributions of the TFCE scores were constructed. The figure shows brain regions where differences between predicted outcome groups and healthy controls were significant at α=0.05 after alpha correction using the False-Discovery-Rate (FDR).