Use of Machine Learning for Predicting Escitalopram Treatment Outcome From Electroencephalography Recordings in Adult Patients With Depression

This prognostic study of patients with major depressive disorder estimates how accurately an outcome of escitalopram treatment can be predicted from electroencephalographic data.


Inter-site Data Harmonization and Pre-processing
All EEG datasets were standardized to the following parameters: 58 electrodes common to the equipment across all sites, 0.05-100Hz bandpass filter, average reference, 512 Hz sampling rate. All the filtering was done using 2 nd order Butterworth filters applied to the data twice-in forward and reverse direction, to eliminate any phase distortion.
The standardization was performed with the EEGLAB toolbox (v12.0.2.6b) 4 . Complete descriptions of the EEG standardization procedure across sites employed by the current study can be found in Farzan et al. 5 .
Data pre-processing was conducted with EEGERP toolbox 5 . Channels contaminated by large sporadic artifact were identified by human analyst and deleted. EEG data were bandpass-filtered between 1 and 80 Hz, and notch-filtered at 60 Hz. Independent component analysis was used to remove eye movements and blinks, electromyography, and electrode discontinuity artifacts. The deleted EEG channels were interpolated using spherical spline interpolation 6 .

Feature Computation
Once the data was harmonized and pre-processed, we proceeded to feature computation. For each feature, we segmented the data into continuous intervals of equal length, computed the feature for each interval independently, and averaged the results to summarize the feature in a single value per patient. We used 2-second-long intervals for all the features except for MSE, for which interval duration of 20 seconds was chosen based on previous studies 7, 8 .
The four classes of features used in this study are defined as follows: 1. Electrode-level spectral features We computed electrode-level spectral features by using EEGLAB function spectopo to obtain the power spectrum for each electrode. The log-transformed absolute power was obtained for each channel for each of the frequency bands defined above. Asymmetry between left and right hemispheres was also considered.
To that end, all the electrodes located off the midline (50 electrodes in our montage) were split into 25 pairs symmetrical with the respect to the midline, e.g. FP1/FP2, AF3/AF4, etc. For each pair, the absolute power at the left electrode was divided by the absolute power at the right electrode, resulting in 25 features for each frequency band, describing the lateralization of EEG power.

Source-level spectral features
Current source density analysis was performed using the eLORETA algorithm as implemented by the LORETA-KEY software 9 . The transformation matrix (58 channels to 6239 voxels) was derived by coregistering electrode co-ordinates (10-10 international system) to the MNI152 MRI head template 10 using a relative regularization parameter of 1. The solution space of eLORETA was restricted to cortical and some hippocampal and amygdala grey matter. The MNI152 template brain volume was divided into 6239 cortical gray matter voxels at 5-mm 3 resolution 10 . From scalp-recorded electrical potential distribution, LORETA computes the three-dimensional intracerebral distributions of current density for each of the frequency bands specified above. The following regions were selected based on the previous literature: the anterior cingulate cortex (ACC), rostral ACC (rACC), and the medial orbitofrontal cortex (mOFC) 11,12 .

Multiscale-entropy-based features
Multiscale entropy analysis was performed using previously described methods 13,14 . Using the sample entropy equation, multiscale entropy was examined across all the 58 electrodes with the coarse-graining process for 70 scales. Sample entropy quantifies the variability of time series by estimating the predictability of amplitude patterns across a time series. In our analysis, two consecutive data points were used for data matching (m=2) and data points were considered to match if their absolute amplitude difference was less than 15% (i.e., r=0.15) of the standard deviation of the time series (similar to the previous literature 13,14 ).
We have also computed asymmetry between left and right hemispheres, in a fashion similar to computing the asymmetry for the power at specific frequency bands (see above): multiscale entropy in the left hemisphere was divided by the multiscale entropy in the right hemisphere for the 25 channel pairs: FP1/FP2, AF3/AF4, …, and O1/O2.

Microstate-based features
Microstate analysis followed the standard procedure outlined elsewhere 15,16 and was implemented using CARTOOL 17 . Prior to the application of microstate analysis, pre-processed EEG data were bandpassfiltered from 1 to 30 Hz. The topographical maps at the local maxima peaks of the global field power curve were clustered to derive the four prototypical microstate classes 18 . In this study, the topographical atomizeagglomerate hierarchical clustering algorithm 19 was applied to cluster samples from each EEG dataset into four states (microstate maps). Finally, topographical maps at each local maxima point of the global field power curve were assigned to the microstate class of highest correlation using spatial Pearson's productmoment correlation coefficient 20 . Three features were calculated for each of the four microstate classes: (i) average duration, (ii) frequency, and (iii) coverage. Average duration refers to the average amount of time a microstate class remains stable when it appears, in milliseconds; frequency refers to the occurrence of each microstate class per second; and coverage is the percent of recording covered by each microstate class.
Features for the Early Change source are calculated as (post-pre)/pre for multiscale-entropy-based, microstate-based and source-level spectral features and (post-pre) for electrode-level spectral features.
The total number of features for the Early Change source was 6424. For the Combined source, the total number of features was (6424+6424) 12,848.

Feature Ranking
We used a t-test-based filtering method to rank the features according to their predictive power. The method repeats the steps described below for 100 iterations: 1. Generate a training set by randomly selecting 80% of responders and 80% of non-responders from the whole data sample.
2. For each feature, test whether the feature's value is significantly different between responders and nonresponders in the training set only, using t-test. If the test passes the significance threshold at α<0.05, the iteration counts as one vote for the feature, otherwise-as 0 votes.
The total number of votes for each feature over the 100 iteration therefore varies between 0 and 100, with more votes indicating the features that are more robustly predictive of the treatment outcome.

Classifier Construction and Performance Estimation
SVM is a binary classifier that utilizes a hyperplane (linear SVM) or hypersurface (non-linear SVM) that best separates the input feature space into the two pre-defined groups (in our case, responders and non-responders) 21 . An RBF kernel uses nonlinear mapping to transform data into a higher dimensional space and determine an optimal hypersurface for classification 22 . The optimal hypersurface separates two groups with the largest margin (i.e., distance between the hypersurface and the closest data points). In this study we use LIBSVM toolbox's 23 implementation of RBF SVMs.
RBF version of SVM constitutes a family of classification algorithms parameterized by two hyperparameters, denoted here as C (also known as penalty) and γ. Each particular combination of values of C and γ essentially yields a different classifier with different learning behavior. To find a combination of (C, γ) that yields an optimal classifier for our data, we use a grid-search method 23 over a restricted range of C and γ values: 1. We restrict the range of potential values to C ∈ {2 -3 , 2 -1 , 2 1 , 2 3 }, and γ ∈ {2 -12 , 2 -10 , 2 -8 , 2 -6 } 2. Using 10-fold cross-validation, we estimate balanced accuracy of predicting the treatment outcome for each of the resulting 16 parameter combinations.
3. We select the combination yielding the best estimate.
We use the above procedure to estimate the balanced accuracy of SVM prediction of treatment outcome for the four feature sources (Baseline, Week 2, Early Change and Combined) and several different vote thresholds T ∈ {50, 60, 70, 80, 90}.

Assessment of Bounds on Achievable Classification Accuracy
Prediction accuracy of any classifier is limited by the intrinsic noise of the target variable. In our case, the target variable-treatment response-is derived from MADRS scores, which are known to contain significant noise. To assess the impact of the imperfect reliability of MADRS scoring procedure on our study, we performed the simulation described below.
Consider a hypothetical perfect classifier that achieves 100% classification accuracy on all the data in our disposal. This accuracy is achieved w.r.t. the binary target variable (responder/non-responder) that was computed from MADRS scores obtained from the patients. However, imperfect reliability of the scoring procedure means that if we were to repeat the scoring for the same patients and the same levels of depression severity, we would obtain somewhat different results. This, in turn, would lead to a somewhat different assignment of responder/non-responder labels to the patients.
If we were to test our perfect classifier on this alternative target labels, we would observe a classification accuracy of less than 100%. This degraded classification accuracy reflects the inherent limitation on possible performance posed by the noise in the target variable rather than shortcoming of any particular classifier design. It also provides an upper bound on the performance achievable by the best possible model trained on an infinite amount of data.
To carry out the above simulation in practice, we first need a procedure for realistic modeling of MADRS scoring noise. We used a (somewhat optimistic) estimate of the MADRS scoring reliability reported by Williams and Kobak 24 .
In this study the researchers conducted two identical scoring sessions on each of the study participants and compared the scores. The similarity between two datasets, as measured by Intra-Class Correlation (ICC), was 0.93.
We have generated 10,000 surrogate instances of the MADRS scores by adding a random Gaussian noise to the original scores (for both week 0 and week 8). The noise had zero mean and the variance was selected in such a way that the average ICC between the original and the surrogate scores was about 0.93. For each surrogate instance, we computed the classification error of the "perfect" classifier -the one that attains 100% accuracy on the original data.
Introducing the noise into MADRS data on average degraded the maximal achievable classification accuracy from 100% to 86.7% (with standard deviation of 2.8%).