AI in Surgical Curriculum Design and Unintended Outcomes for Technical Competencies in Simulation Training

Key Points Question Is use of artificial intelligence (AI) in a simulated surgical skills curriculum associated with unintended performance outcomes? Findings This cohort study of 46 medical students and 14 experts found that the AI-enhanced simulation curriculum demonstrated significant unintended changes in 52 performance metrics outside the curriculum that were not observed in the control cohort. These unintended changes included significantly improved procedural safety (eg, healthy tissue damage) but significantly worsened movement (eg, dominant hand velocity) and efficiency (eg, rate of tumor removal) metrics. Meaning This cohort study found that use of AI in designing a simulated surgical skills curriculum was associated with unintended learning outcomes, with both positive and negative consequences for learner competency, suggesting that intervention from human experts may be required to optimize educational goals.


Introduction
In surgical education, technical competency is one of the key factors of efficacy in the curriculum, and it is an independent factor that is directly associated with postoperative patient outcomes. 1,24][5] An intelligent tutoring system is a pedagogical tool powered by an AI model that can provide learners with tailored performance assessment and feedback. 6Developing such systems requires large amounts of data from users with varying levels of skill on standardized procedures that can be obtained from virtual reality simulators. 4,7mulation is an important component of competency-based medical education. 8Technical skills acquired in simulation training have been demonstrated to improve operating room performance and lead to better health outcomes for patients. 9,10Curriculum competencies are determined by a committee of subject matter experts, academics, educators, and researchers through a transparent and evidence-based approach. 11With the vast amount of performance and neurophysiologic data collected during virtual reality simulation and the ability to apply AI technology, AI tutors using machine learning algorithms may be useful tools to provide novel educational insights and define quantifiable competency metrics. 3,12It is therefore warranted to apply the same standards of transparency and academic rigor in evaluating the competencies selected and taught by AI systems.
In our previous work, 4 an AI-enhanced curriculum was developed to teach a neurosurgical technique in virtual reality simulation to medical students.The learning objectives of this curriculum were chosen by an algorithm, a support vector machine, that analyzed a pool of 270 metrics and selected 4 that were most significantly associated with expertise. 3,4These included 2 metrics related to safety (bleeding rate and maximum force applied by the nondominant hand) and 2 metrics related to movement (maximum acceleration of the nondominant hand and the instrument tip separation distance). 6Feedback on these metrics was delivered by the Virtual Operative Assistant (VOA), an intelligent tutoring system that evaluates learners' competency level in safety and movement, and provides personalized post-hoc audiovisual feedback. 4In a randomized clinical trial, the effectiveness of this AI-enhanced curriculum was compared with remote instruction by experts on the technical skills of medical students in simulation training. 13The findings of this study suggested that feedback on AI-selected learning objectives could have an effect beyond the performance criteria taught by the AI tutor. 14However, the scale and mechanism of this extended effect has not been previously reported in simulation training and it remains unclear whether such effects have a beneficial or detrimental effect on students' skill.
To assess the pedagogical value of AI-selected competencies, we explored their extended associations with other performance criteria and investigated how this changed students' competency compared with the level of skilled surgeons.To do so, we used a cohort study design by following up medical student's exposure in the VOA group and comparing their performance outcomes with a control cohort and a skilled cohort.We hypothesized that medical students exposed to the VOA will demonstrate significant changes on several performance criteria related to the AI-selected competencies, and that the observed extended effects are partially responsible for achieving expert performance benchmarks.

Study Participants
We conducted a planned secondary analysis using retrospective data from 2 previously conducted studies, a cohort study of 14 experts and a randomized clinical trial involving 46 medical students.In the first study, skilled consultants performed a simulated neurosurgical procedure with no feedback to establish expert benchmarks. 3In the second study, medical students were randomized into 2 groups and learned to perform the same task with or without instructions from the AI-enhanced curriculum. 14

Study Procedure and Simulation
All participant performed 5 simulated neurosurgical tumor resection procedures within a fixed time and either received an intervention (VOA group) or no intervention (control and skilled groups) after the completion of each attempt. 3,14Participants received standardized verbal and written instructions on the simulated task, the goals of the procedure, and the instruments used.
Additionally, they performed an orientation module to navigate the 3-dimensional virtual reality space and test each instrument's functions.The intervention for the VOA group involved receiving post-hoc audiovisual instructions on 4 learning objectives based on the learners' lacking competency.This curriculum follows a stepwise competency assessment in which learners must first achieve expert classification for safety metrics in step 1 before moving to step 2 to learn instrument movement metrics. 14Procedures were performed on the NeuroVR (CAE Healthcare) virtual reality simulator that records the state of 54 variables in the operation, such as tumor size or the 3-dimensional position of each instrument, at a 50 Hz rate (t = 20 ms). 16,17The raw data were collected and used to generate assessment metrics.

Performance Metric Extraction and Expertise Benchmarks
Raw data from the initial and final attempts were used to extract 270 assessment metrics for each procedure.These metrics were selected based on their ability to differentiate experts from novices and are equally divided into 3 groups depending on the state of the operation: during tumor resection, while suctioning blood, or over the entire scenario. 3Because of large variability in the duration and amount of blood loss, 90 metrics from the suctioning blood state were excluded from analysis.Data from the skilled group were used to determine expertise benchmarks for each metric by measuring the mean score with 1 SD following previously published protocols. 18,19The benchmark provides a reference to evaluate whether the extended changes in medical student performance metrics have a positive or negative impact on their competency.All metrics with a significant withinparticipant difference from baseline in the VOA group that did not significantly change in the control group were the primary outcome measures.

Statistical Analysis
Collected data were examined for normality with the Shapiro-Wilk test, and Wilcoxon rank test was used when t test assumptions were not satisfied.Outliers were identified using boxplot analysis and Levene test was used to assess variance.In procedures in which there was no blood loss, metrics related to bleeding were not applicable.Any metric in any group with more than half the data missing was excluded in the statistical analysis.Within-participant comparison of medical students' performance in 180 metrics in both the VOA and the control group was performed by 2-sided paired samples t test (α = .05)to identify metrics that demonstrated a significant change between baseline (first attempt) and after the intervention (fifth attempt).Between-participants comparison of performance metrics at baseline and after the intervention was conducted with independent samples t tests.P values were not adjusted for multiple testing to avoid being overly conservative and subsequently missing important treatment outcomes. 20,21Data were analyzed using MATLAB release 2022a (The MathWorks) and SPSS Statistics version 28 (IBM) statistical software from June to September 2022.

Results
All 46 medical students (median [range] age, 22 [18-27]  second year of medical school) and had minimal surgical experience. 14The skilled group had a median (range) of 12.5 (1-25) years of practice experience as neurosurgical staff, with most participants (9 participants [64%]) primarily involved in cranial surgery. 3Further participant demographic information is presented in the Table .Participants in the VOA group responded successfully to performance feedback and achieved expert benchmarks in the curriculum's learning objectives (Figure 1).At the end of the AI-enhanced curriculum, learners in this group demonstrated significant performance change in 42 metrics during the entire procedure and 20 metrics during tumor resection.Within these metrics, control participants also demonstrated significant performance change in 10 metrics during the entire procedure and no metrics during tumor resection.Therefore, instruction on 4 AI-selected learning objectives was associated with a significant extended performance change in 32 metrics over the entire procedure and 20 metrics during tumor resection that was not observed in the control group.
Complete analyses of all affected metrics in the 2 conditions are provided in eTable 1 and eTable 2 in Supplement 1.
Damage to healthy brain tissue is an important safety metric in neurosurgical tumor resection surgery. 22In this study, participants in the VOA group demonstrated a significant reduction in rate of VOA participants also demonstrated significant changes in movement metrics of their dominant hand.Notably, by the end of the AI curriculum, students performed with a significantly lower velocity (mean difference, −0.13 [95% CI, −0.17 to −0.09] mm/t, P < .001)and lower acceleration (mean difference, −2.25 × 10 −2 [95% CI, −3.20 × 10 −2 to −1.31 × 10 −2 ] mm/t 2 ; P < .001) in their dominant hand compared with the control group.This unintended effect persisted over the whole procedure and during tumor resection, and it diverged performance away from the lower threshold of the expertise benchmark in both metrics (dominant hand velocity: mean [SD], 0.27 [0.04] mm/t; dominant hand acceleration: 6.35 × 10 −2 [1.02 × 10 −2 ] mm/t 2 ) (Figure 3A-B).

Discussion
This cohort study is the first study to demonstrate unintended surgical skill acquisition with both positive and negative consequences following an AI-enhanced curriculum, to our knowledge.In the competency-based framework of postgraduate medical training, AI provides a tool to identify and teach quantifiable metrics of expertise.Harnessing this power in neurosurgical simulation led our team to design and test the first curriculum with AI-selected competencies.The previous randomized trial involving medical students compared the efficacy of AI tutoring with remote expert The skilled group's mean and SD in each metric are represented by the solid and dashed lines, respectively.Because the variables are recorded at a frequency of 50 Hz by the simulator, each unit of time (t) is equal to 20 ms.Error bars indicate 95% CIs.instruction using performance outcomes that included only the 4 metrics taught by the VOA. 14This study is focused on only the cohort who were exposed to the AI-enhanced curriculum and explores 270 novel aspects of their performance to investigate the extent of this mode of instruction.This study also builds on the previous report that identified metrics of expertise by comparing medical students' performance outcome with computed expertise benchmarks.Similar to other clinical learning environments that result in formal and informal learning, 23,24 this novel curriculum demonstrated both intentional and unintentional learning outcomes.
Metrics affected by this AI-enhanced curriculum fall into 1 of 3 categories.The first group includes the intrinsic metrics taught by the intelligent tutoring system.Because the feedback provided by the tutor was focused on these preselected learning objectives, observing the expected outcome provides evidence for feedback efficacy.The second group involves implicit metrics, such as instrument divergence or changes in bipolar force, that despite receiving no direct feedback demonstrated significant change by virtue of their close association with the intrinsic learning objectives.The last group are extrinsic metrics, such as the rate of healthy tissue removal, that have a more complex relationship with the intended learning objectives that cannot be easily implied.
Among extrinsic metrics, changes in dominant hand movement were an interesting observation because the students were instructed by the tutor to monitor the acceleration of their nondominant hand and to keep their instruments close together.6][27] Our results demonstrate an interesting intermanual skill transfer whereby training the nondominant hand was associated with movement changes of the dominant hand.The underlying mechanism behind this observation is not clearly understood; however, functional asymmetry in the brain may offer an explanation.A study using functional near-infrared spectroscopy in individuals learning a manual skill with their nondominant hand showed difference in participants' cortical activation patterns with nondominant hand training, notably bilateral premotor cortex activation, compared with those who trained with their dominant hand. 28This suggests that perhaps the awareness required to control nondominant hand movements results in a change of movements of the dominant hand due to functional dominance and network asymmetry between hemispheres. 28,29It is notable that all 4 AI-selected learning objectives required expert nondominant hand performance; 2 were direct measurements from the bipolar forceps, while the other 2 required bimanual control.
It is possible that the modality of feedback presentation contributes to the extended effects observed in this study.As part of the VOA feedback, participants were exposed to four 60-second videos of both competent and novice performance covering each learning objective.Although instructions in each video were specific to 1 metric, participants could potentially gain further information to improve their performance by seeing the expert demonstrations.However, this fails to explain the divergence from benchmarks in some efficiency and movement metrics.
The extended effects of this AI curriculum show both productive and counterproductive changes to skill acquisition.Although VOA's unintentional learning outcomes resulted in students scoring within expert safety benchmarks for this procedure, it diverged some efficiency metrics away from the expert's benchmarks.In general, VOA participants demonstrated a safer approach with more focused and steady control of the instruments that also resulted in less healthy tissue damage.However, they became significantly slower in movements of their dominant hand and were less efficient in removing the tumor.Such unintended effects must be considered carefully and require a cost-benefit evaluation by experts to determine the acceptability of an intelligent tutoring system.Furthermore, given the extent of unintended outcomes observed in this study, future studies may benefit from measuring the slope of each metric's learning curve to investigate the learning rate elicited by an AI-enhanced curriculum using extrinsic performance outcomes.
One method to evaluate the gains and losses of skill due to unintended effects is to refer to participants' previously reported expert-rated Objective Structured Assessment of Technical Skills.
These results demonstrated that the VOA group achieved a significantly higher overall performance score and a higher economy of movement compared with the control group. 14This is interesting because despite the observed loss of efficiency in rate of tumor removal and slow bimanual movements, blinded expert evaluators still rated the VOA participants as having a better economy of movement than the control group.This discrepancy demonstrates 1 limitation of a highly granular assessment of expertise based on individual metrics because performance in each domain is composed of the interaction among its multiple constituting metrics.Perhaps with increased capacity to collect surgical performance data, future studies can use advanced AI algorithms to redefine subjective areas of performance (eg, economy of movement or flow) as functions consisting of the interaction between multiple metrics and project performance on the Objective Structured Assessment of Technical Skills scale.This would not only provide instructors with a more meaningful understanding of their student's performance but can focus their assessment and feedback on specific measurable criteria.

Future Directions
1][32][33] Rolling out intelligent systems for patient care or formal postgraduate training should occur in phases and requires the same rigor of scientific practice required for pharmacologic therapies or other medical devices.5][36] AI applications will continue to expand in medical education and hold promise to enhance health care by providing timely data-driven analytics. 37However, we need to be cognizant of the potential for their unintended effects and promote transparency to maximize beneficial outcomes for learners and patients.

Limitations
This study has some limitations.One limitation is the number of surgeons in the skilled group and their range of experience in neuro-oncological surgery.However, most participants in the skilled group primarily practiced cranial neurosurgery, and subpial resection is a fundamental technique that is mastered throughout postgraduate training. 38Furthermore, previously validated models trained on this sample could not only distinguish expertise but project residents' postgraduate training year in neurosurgery. 3,5,391][42] The results of a previous study demonstrated that VOA participants could achieve a greater overall expertise score than a junior resident but not that of a senior resident or a staff surgeon. 5,14Whether this method of training results in more realistic performance goals for junior learners or it delays the mastery learning curve warrants longitudinal studies that investigate long term effects of exposure to AI-enhanced curricula.
Students in the VOA group were limited to only post-hoc performance feedback provided by the AI tutor, which lacks important contextual and directional components.For example, unidirectional instructions to reduce force or acceleration at the end of each attempt are likely to result in cumulative rather than time-specific behavioral changes. 43,44Furthermore, pausing the operation interrupts negative momentum and provides an opportunity for immediate reflection on poor performance that has greater learning value and can prevent intraoperative errors. 45To address this, we developed an Intelligent Continuous Expertise Monitoring System capable of error projection and real-time feedback delivery. 5This system's efficacy is being evaluated and compared with live instruction from expert tutors in a randomized clinical trial. 46,47This trial will be able to determine the value of pausing live performance to deliver real-time instructions and investigate whether similar unintended effects are observed with continuous performance feedback.

Conclusion
This cohort study found that learning bimanual surgical skills in simulation with metric-specific instructions on AI-selected competencies was associated with unintended changes in other competency domains.These extrinsic outcomes had both positive and negative associations with learners' expertise level compared with the skilled consultant's benchmarks.Considering these unintended changes, assessment of AI-enhanced curricula requires a cost-benefit evaluation by subject matter experts to determine their acceptance in formal training.Finally, this AI-enhanced curriculum may benefit from contextual and timely feedback either from a live instructor or a realtime performance assessment system.

Figure 1 .
Figure 1.Performance in the Learning Objectives of the Virtual Operative Assistant (VOA) Curriculum

Figure 2 .
Figure 2. Extended Associations of the Virtual Operative Assistant (VOA) Curriculum With Safety Competencies

Figure 3 .
Figure 3. Extended Associations of the Virtual Operative Assistant (VOA) Curriculum With Movement and Efficiency Metrics 15th studies were approved by the McGill University Health Centre Research Ethics Board, Neurosciences-Psychiatry, and all participants signed an approved informed consent form before trial participation.This article follows the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline and the Best Practices for Machine Learningto Assess Surgical Expertise.15

Table . Demographic
Characteristics of Participants Abbreviations: AI, artificial intelligence; NA, not applicable.aRange, 1 to 5, with higher scores indicating greater motivation to pursue a surgical specialty in postgraduate training.