Association of Environmental Uncertainty With Altered Decision-making and Learning Mechanisms in Youths With Obsessive-Compulsive Disorder

Key Points Question Is decision-making associated with environmental uncertainty in youths with obsessive-compulsive disorder (OCD)? Findings In this cross-sectional study of 103 individuals, hierarchical reinforcement learning models fitted to 2 clinical data sets indicated that youths 12 to 19 years of age with OCD revealed atypical trial-by-trial performance on a probabilistic reversal learning 2-choice task. However, on a deterministic set-shifting task, youths with OCD did not show marked differences from healthy controls. Meaning Obsessive-compulsive disorder in youths was associated with impaired decision-making during probabilistic tasks but not deterministic tasks, contributing to growing evidence that youths with OCD may have difficulty coping with environmental uncertainty.


WCST
The WCST used in this study was run on a laptop via the Psychology Experiment Building Language programme 8 .
The WCST contains up to 128 trials. Participants were shown 4 decks with a different combination of colours, numbers, and shapes. They were instructed to sort cards appearing at the bottom of the screen, using a computer mouse, according to one of three rules at a time, either number, colour or shape. The rule must be discovered using trial and error via visual feedback received after each card is sorted. Cards were sorted by clicking on the chosen deck using the laptop mousepad. If a card is sorted correctly, the feedback shown would be 'Correct'. If the card is sorted incorrectly, the feedback shown would be 'Incorrect'. There was no time limit for a card to be sorted on each trial, but participants were told to answer as quickly and as accurately as possible.
After 10 cards have been successfully sorted consecutively, one set is completed and the sorting rule changes. The process continues until the participant either sorts all 128 cards or they complete 9 sets. The task took approximately 10 minutes to complete.
Standard Statistical Analysis for PRL p(switching following SNF) and p(staying following VPF) were calculated by counting the number of times each participant switched following SNF and stayed following VPF and dividing these values by the number of times participants had the opportunity to carry out these switching and staying behaviours. p(correct) was also calculated by dividing the number of correct responses per subject by the number of trials in Reversal and Acquisition phases.
Multiple multivariate mixed-binomial regressions were conducted using the glmer function from the lme4 R package 9 to assess the effects of Group (OCD vs CTL) and Phase (Acquisition vs Reversal) on the p(correct), p(switching following SNF), and p(staying following VPF) measures. A univariate binomial regression was used to analyse the effects of Group on p(perseverative errors) and did not include Phase as an independent variable as we could only study perseverative responses in the Reversal phase. RT was analysed using a multivariate mixed-linear model using the 'lmer' function. We verified that the residuals of reaction times followed a normal distribution, using a Shapiro-Wilkes test, and displayed homoscedascity, using a Breusch-Pagan test.
Homoscedascity of residuals obtained from each regression model were assessed using the Breusch-Pagan test. When the assumption of homoscedascity of residuals was violated, a sandwich variance estimate function from the 'sandwich' R package 16 was applied to the regression model(s). This enabled the extraction of standard errors that were robust to non-constant variance. P-values and 95% confidence intervals were calculated using these new standard errors.

Analysis for both tasks
For both tasks, Wald 95% confidence intervals for all regression models were obtained using the 'confint' function in R.
Also for both tasks, overdispersion in the binomial models was assessed using the 'check_overdispersion' function in the 'performance' R package 17 . Quasibinomial models were implemented instead of binomial models when models showed overdispersion.

Computational Models for PRL
To better understand processes underlying learning and decision-making on the PRL, a family of model-free reinforcement learning (RL) models were fitted to data. Modelling first involves formulating a mathematical function equipped with different parameters of interest to analyse trial-by-trial data. Values of these parameters are then estimated when the model is fit to participants' data, enabling quantification and classification of behaviour.
RL models are a special class of computational models that involve an agent learning from past experiences with the goal of identifying which actions in specific states lead to maximisation of rewarding outcomes 18 . Model-free RL models do not account for specialised task structures (such as a possible reversal in a PRL task) and instead learn based on the history of previous outcomes. Thus, fitting an RL model to data better approximates 'true' human behaviour compared to averaging scores to obtain a measure of behaviour (e.g., average proportion of perseverative errors) and enables improved insight into how learning evolves over time. In addition, compared to other classes of models, RL has been found to best capture PRL behaviour in marmosets 19 and has been widely used to model behaviour in studies investigating adolescent [20][21][22] and OCD samples 20,23 .
In total, five RL models were fitted to data and compared to determine the best-fitting model. Model code was adapted from a prior study 23 .

Model 1
Model 1 served to discern whether a simple RL model with two parameters best described the data and accounted for differences between groups. The model comprised a learning rate parameter (α) and a reinforcement sensitivity parameter (τ). A value function (Qt) was assigned to each task stimulus representing the expected rewards associated with them. A high Qt denoted a higher chance of reward associated with a stimulus while a lower Qt indicated a reduced chance of a reward. Qt for each stimulus was updated on a trial-by-trial basis via prediction errors that represent differences between expected outcome, Qt (associated with previous trial's outcome), and actual current outcome, Rt.
For example, if the expected outcome for a stimulus is 0, and selecting the stimulus on a given trial results in a reward, Q t for the specific stimulus would increase in value. Larger prediction errors thus led to greater updating of Q t. On every trial (t), the value of free learning rate parameter α determined the extent to which Qt was adjusted according to the prediction error. Concretely, this was done according to the Rescorla-Wagner rule: Qk,t+1 = Qk,t + α(Rt -Qk,t) -Equation 1.1 Where k represents a specific stimulus (stimulus 1 or 2) and t represents the current trial. R would equal 1 following a rewarded outcome, and 0 following an unrewarded outcome. The term Rt -Qk,t represents the prediction error. α was able to vary between 0 and 1. Values of α closer to 1 indicate increased sensitivity to prediction errors, while lower values of α indicate low sensitivity to prediction errors. τ (reinforcement sensitivity) is an inverse temperature parameter associated with the value functions assigned to each stimulus. This parameter was plugged into a softmax rule which was used to determine the probability (p) of choosing a stimulus k on trial t: -Equation 1.2 τ determined the extent to which participants' actions were driven by Qt associated with the chosen stimulus. A high τ leads to more exploitative behaviour whereby a participant chooses to mostly maximise their rewards (i.e., participants select the choice with the higher Q value more often). A low τ enables more exploratory behaviour (reduced selecting of the choice associated with the higher Q value).

Model 2
Model 2 was as Model 1 except α was fractionated to account for rewarding (αrew) and punishing outcomes (αpun) as there is evidence for different neural systems subserving learning from reward and punishment 24 . Hence, this model contained 3 free parameters, αrew, αpun, and τ.
Higher values of αrew leads to greater sensitivity to and quicker learning from positive prediction errors (i.e., rewarding trials) while higher values of αpun results in expedited learning from negative prediction errors (punishing outcomes).
The model was fit to data to investigate valence-specific learning. The softmax function including the τ parameter was implemented similarly to Model 1.

Model 3
Model 3 was identical to Model 2 but with the addition of τstim (stimulus stickiness), which is an inverse temperature parameter that reflects the tendency for a participant to respond to the same stimulus chosen in a previous trial regardless of feedback received. Greater values of τstim denote increased tendency to 'stick' to a choice while low values represent a tendency to switch away from the choice. Thus, τstim enabled us to account for perseverative behaviour. This parameter was added to the softmax function as follows: p k,t = exp(τQ k,t +τ stim s k,t ) ∑ =1 exp(τQ i,t +τ stim s i,t ) -Equation 1.5 τstim was multiplied by S which represents whether the stimulus chosen on the current trial was the same as the one chosen on the previous trial (1 if choices are repeated, 0 if not). If S = 1, τstim would influence the extent to which the choice is likely to be repeated. If S = 0, the function would reduce to the softmax function from Model 1. Thus, this model contained 4 parameters in total αrew, αpun, τ, and τstim.

Model 4
Model 4 was as Model 3 but with 3 parameters (α, τ, τstim) as it contained a single learning rate controlling the adaptation of the value functions.
The model served to decouple acquisition (pre-reversal) and reversal via the experience decay factor parameter ρ that enables the balance between previous experience and new information to increasingly tip in favour of past experiences.
The 'experience weight' (nc,t) of a current choice, c, reflects how often a stimulus has been chosen. It is updated according to ρ: The intuition behind ρ is that over time experience accumulated during acquisition could make reversal more difficult leading to perseveration. ρ was allowed to range between 0 and 1. When ρ = 0, predictions are always driven by most recent outcomes, whereas when ρ = 1 all trials are weighted equally leading to perseveration of responses post-reversal.
The value function of a choice on every trial (similar to Qt), vc,t, is updated according to the outcome (rewarded or unrewarded), λ, and the pay-off decay factor φ, which is equivalent to the learning rate in Model 1.

Model-Fitting and Parameter Estimation
Models were fit to trial-by trial data using a hierarchical Bayesian approach, specifically by estimating the posterior Inter-subject variability, σ, was sampled from half-normal prior distributions, enabling estimates to be constrained to be positive. σα, σα-rew, σα-pun, στ-stim, σφ, σρ ~ half-Normal(0,0.05) στ, σβ~ half-Normal(0,1) Subject-level parameters were sampled from normal distributions whose means were the group level parameter values and whose variances were from the inter-subject variability parameter values.
For example, in the case of αrew: αrew,subject = αrew,group(subject) + σαrew(subject) -Equation 1.9 All priors were obtained from an earlier study which used identical models 23 .
All models were fitted to data using Markov Chain Monte Carlo (MCMC) sampling implemented in RStan v. 2.21.1.
Eight randomly-initialised MCMC chains were used. Convergence of chains was confirmed using the potential scale reduction statistic R. A cut-off R value of 1.2 23 was used to check that the chains were well-mixed for each parameter.
In addition, model-fitting was repeated to explore the effects of medication on behaviour, specifically by specifying separate prior distributions for OCD, MED-, and MED+ (medicated patient).

Parameter Recovery
As these models had previously been fitted to data obtained from a PRL task with several reversals, we conducted parameter recovery to verify the validity of the winning model (see Model Comparisons section below for how we compared models) and that parameter values were meaningful (and not occurring by chance) 26 . The winning model was first used to simulate synthetic data from 100 'participants'. The free parameters were replaced with the mean fitted parameter values per group estimated from fitting the model to actual human data. We then ascertained whether the true parameter values could be recovered by fitting the winning model to the simulated data and checking whether the true and generated parameter values fell within their corresponding 95% highest density interval (HDI)see Group Differences section below for more details about HDI.

Computational Models for WCST
The main reinforcement learning model fitted to WCST data was originally conceptualised as a sequential learning model by Bishara et al. (2010) 27 . This modelling approach has been used to successfully reveal implicit cognitive strategies employed by various neuropsychiatric patient groups, among them individuals with substance dependence 27 (Bishara et al., 2010), prefrontal cortex lesions 12 and schizophrenia 28 . However, the model has yet to be implemented in research involving patients with OCD. The model contained 4 free parameters, namely reward rate (r, how quickly attention weights change to rewarding feedback), punishment rate (p, how quickly attention weights change to punishing feedback), decision consistency (d, how much deck choice is influenced by attention weights), attentional focusing (f, only important on trials with ambiguous feedback and represents the degree to which the update is focused only on the category/rule with the largest attention weight).
The dependent variables fed into the model are 1) an outcome variable which represents whether a trial was rewarded or not (1 or 0) and 2) a matching matrix which quantifies which categories (colour, number, or shape) associated with a chosen deck match the test card. For instance, if the chosen deck matched the test card based on colour but not number or shape, the matching matrix, m, for that trial would be defined as where t refers to the current trial. Parameters r and p determine how rapidly attention weights alter based on rewarding and punishing feedback signals respectively.
In the example above, the feedback is unambiguous as the chosen deck matches the test card on only one category.
However, in some cases where more than one category is matched (for example, both colour and shape), the feedback signal relies on the free parameter f to modulate how focused or wide the attention is for each category's feedback.
When f approaches 0, attention is split evenly between the matching categories: As f increases, the feedback signal is split proportionally between current attention weights. For example, if the attention weight for colour is higher than shape, the feedback signal would follow suit and perhaps be represented by: The following equations represent how the feedback signal is modulated by the attention weights and the matching matrix.
When outcome on the current trial is correct, the feedback signal is computed only with the matching attention weights, and when the outcome is incorrect, only the non-matching attention weights contribute to the feedback signal.
Finally, the probability of choosing a specific deck is defined as where the d parameter influences the predicted probability of choosing a deck per trial. As d becomes lower, choices become more random and less dependent on attention weights (more exploratory). As d becomes higher, choices are heavily constrained by attention weights (more exploitative). ′ is simply the matching matrix, ,transposed to enable matrix multiplication (dot product) with .
The full model described above with 4 free parameters was compared to 3 degenerate models. Each degenerate model had one parameter fixed to assess the contribution of each parameter to capturing behaviour on the task.
The 1 st alternative model (RPD0) fixed the f parameter to be 0, the 2 nd alternative model (RP1F) fixed d to be 1, and the final model (RRDF) assumed a single common learning rate for both reward and punishment.
Model code was adapted from a prior study 27 .

Model-Fitting and Parameter Estimation
As with the PRL models, WCST models were fit to trial-by trial behavioural data using hierarchical Bayesian Individual subject parameters were sampled from normal distributions, and the mean and variance of the prior distributions were sampled from the group-level and inter-subject variability distributions.
Computation of the posteriors were conducted using Markov MCMC sampling using JAGS software 29 . Four randomly initialised MCMC chains were run during model-fitting. Convergence was checked using the statistic R.
Lastly, model-fitting was repeated to explore the effects of medication on behaviour. Group differences between CTL, MED-, and MED+ were analysed. Parameter recovery was not run here as the winning model had already been fully validated in a previous study 12 .

Model Comparison
Models for both tasks were compared using bridge sampling via the "bridgesampling" R package 30 . This method enables selection of the best-fitting model by accounting for the prior probability and marginal likelihood of each model (the likelihood of the data given a specific model). The marginal likelihood is calculated via the product of 1) the likelihood of the data given a fitted model (how well does the model fit the data) and 2) the probability of parameters given the model, which penalises over-complex models and guards against overfitting. We assumed equal prior probability for each model per task.

Group Differences
Posterior distributions of parameters from both tasks' winning models were interpreted using the 95% and 90% highest posterior density interval (HDI), also known as the Bayesian credible interval. All values within the interval have a higher probability density (i.e. higher credibility) than any value outside the HDI. Parameter comparisons between groups were calculated by subtracting the posterior distributions of the CTL group-parameters from the posterior distributions of the OCD group-parameters, generating the group mean differences per parameter. This procedure was also done for MED-vs. MED+ vs. CTL. The 95% and 90% HDIs of the posterior distribution for the group mean differences were calculated and inspected to check whether they reliably included zero (indicating no difference between groups) Multiple comparisons corrections were not applied since a strength of Bayesian hierarchical models is that they make comparisons more conservative through "shrinkage" of estimates drawn from a higher-level distribution. This makes estimates lie closer together, leading to a higher likelihood for intervals (HDIs) to include 0 31 .

Standard PRL Results
Visualisations for the PRL group results are shown in eFigures 1 and 2. The mixed regression analyses were repeated controlling for age, gender, and IQ. These results are summarised in eTable 1.

eTable 4. Standard Regression Results for PRL Without Covariates (CTL vs MED-vs MED+)
Key

Modelling Results For Participants that Completed Both Tasks
We also modelled data obtained from 20 OCD and 17 CTL participants that completed both the WCST and PRL task. Results from these exploratory analyses revealed that OCD showed credibly higher reward rates (within a 90% HDI) and lower punishment rates (within a 95% HDI) than CTL on the PRL task, but the groups showed no differences on the reinforcement sensitivity and stickiness parameters. On the WCST, there were no differences between groups on all 3 parameters. See eFigure 5 for visualisation.

eFigure 7. Modelling Results From 20 OCD and 17 CTL
Note: Error bars in red indicate differences in posterior distributions between OCD and CTL are credible (not including 0) within a 95% HDI (highest density interval), while error bars in yellow indicate credible differences within a 90% HDI.