Key Points español 中文 (chinese)
Can a machine learning algorithm differentiate participants according to their stage of practice in a complex simulated neurosurgical task?
In this case series study, 50 individuals (14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated in 250 simulated tumor resections. An accuracy of 90% was achieved using 6 performance features by a K-nearest neighbor algorithm and 2 neurosurgeons, 1 fellow or senior resident, 1 junior resident, and 1 medical student were misclassified.
The findings suggest that machine learning algorithms may be capable of classifying surgical expertise with greater granularity and precision than has been previously demonstrated in surgery.
Despite advances in the assessment of technical skills in surgery, a clear understanding of the composites of technical expertise is lacking. Surgical simulation allows for the quantitation of psychomotor skills, generating data sets that can be analyzed using machine learning algorithms.
To identify surgical and operative factors selected by a machine learning algorithm to accurately classify participants by level of expertise in a virtual reality surgical procedure.
Design, Setting, and Participants
Fifty participants from a single university were recruited between March 1, 2015, and May 31, 2016, to participate in a case series study at McGill University Neurosurgical Simulation and Artificial Intelligence Learning Centre. Data were collected at a single time point and no follow-up data were collected. Individuals were classified a priori as expert (neurosurgery staff), seniors (neurosurgical fellows and senior residents), juniors (neurosurgical junior residents), and medical students, all of whom participated in 250 simulated tumor resections.
All individuals participated in a virtual reality neurosurgical tumor resection scenario. Each scenario was repeated 5 times.
Main Outcomes and Measures
Through an iterative process, performance metrics associated with instrument movement and force, resection of tissues, and bleeding generated from the raw simulator data output were selected by K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine algorithms to most accurately determine group membership.
A total of 50 individuals (9 women and 41 men; mean [SD] age, 33.6 [9.5] years; 14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated. Neurosurgeons were in practice between 1 and 25 years, with 9 (64%) involving a predominantly cranial practice. The K-nearest neighbor algorithm had an accuracy of 90% (45 of 50), the naive Bayes algorithm had an accuracy of 84% (42 of 50), the discriminant analysis algorithm had an accuracy of 78% (39 of 50), and the support vector machine algorithm had an accuracy of 76% (38 of 50). The K-nearest neighbor algorithm used 6 performance metrics to classify participants, the naive Bayes algorithm used 9 performance metrics, the discriminant analysis algorithm used 8 performance metrics, and the support vector machine algorithm used 8 performance metrics. Two neurosurgeons, 1 fellow or senior resident, 1 junior resident, and 1 medical student were misclassified.
Conclusions and Relevance
In a virtual reality neurosurgical tumor resection study, a machine learning algorithm successfully classified participants into 4 levels of expertise with 90% accuracy. These findings suggest that algorithms may be capable of classifying surgical expertise with greater granularity and precision than has been previously demonstrated in surgery.
Despite technological advances in artificial intelligence and machine learning, delivery of health care is mediated largely by direct interaction between physician and patient. This scenario is particularly true for surgical interventions, which carry substantive patient risks and increased costs to health care systems.1 As a consequence, the burgeoning field of surgical data science represents efforts to improve interventional health care through increased data collection, quantification, and analysis.2 Similarly, the use of virtual reality simulators has been explored as a means of providing objective assessment of technical ability in medicine, with the added benefit of retaining realism, pathology, and active bleeding states in a controlled laboratory setting. These systems generate vast data sets that quickly challenge traditional statistical methods. Artificial intelligence and machine learning systems lend themselves well to the analysis of large data sets generated in surgical procedures in 2 important ways: first, by uncovering previously unrecognized patterns, they can expand the understanding of the composites of technical expertise and surgical error, and second, by grouping participants according to technical ability, they offer novel avenues for training and feedback in health care.
We sought to study the operative factors selected by a series of machine learning algorithms to most accurately classify participants by level of expertise in a virtual reality surgery. Using an advanced high-fidelity neurosurgical simulator allows participants to conduct a complex open neurosurgical brain tumor resection task in a risk-free environment.3,4 Our group has extensive experience in virtual reality surgical simulation; several studies have demonstrated that performance measures obtained from simulation can differentiate technical skills both between and within groups of expertise.5-9 Given the task complexity and the abundance of data generated during the simulated operation, we hypothesized that machine learning algorithms could identify previously unrecognized performance measures, as well as differentiate participants according to their stage of medical practice.
All neurosurgeons, neurosurgical fellows, and neurosurgical residents from a single Canadian university were invited between March 1, 2015, and May 31, 2016, to participate in the trial. Medical students rotating on a neurosurgical service or having expressed interest in being contacted for trials were invited. Data were collected at a single time point and no follow-up data were collected. Participants were classified a priori as expert (neurosurgery staff), seniors (neurosurgery fellows and residents in years 4-6), juniors (neurosurgery residents in years 1-3), and medical students. All participants signed an approved Montreal Neurological Institute and Hospital Research Ethics Board consent form before trial participation. All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Declaration of Helsinki.10 The study received local ethics board approval at the Montreal Neurological Institute and Hospital. This report is structured according to guidelines for best practices in reporting studies on machine learning to assess surgical expertise in virtual reality simulation.11,12
The NeuroVR (CAE Healthcare) is a high-fidelity neurosurgical simulator designed to recreate the visual and haptic experience of resecting a human brain tumor through an operative microscope. The platform was developed in 2012 by a team from the National Research Council of Canada in collaboration with an advisory network of surgeons from 23 Canadian and international teaching hospitals.4 Care was taken to provide the most realistic sensory feedback for the user by incorporating physical properties of human primary brain tumors.4 As such, the attention to detail and resources used in the creation of the NeuroVR make it one of the most advanced high-fidelity simulators available for neurosurgery.3
The Virtual Reality Tumor Resection Task
The trial was carried out at the McGill Neurosurgical Simulation and Artificial Intelligence Learning Centre in a controlled laboratory environment void of distractions. A human intrinsic subpial brain tumor resection task was designed by neurosurgeons with extensive experience in neuro-oncologic and epilepsy neurosurgery. The subpial technique is a challenging bimanual psychomotor skill acquired in neurosurgery and is primarily used in epilepsy and oncologic surgery, where preservation of adjacent eloquent structures is of paramount importance.13 Participants were given written and verbal instructions that the goal of the scenario was removal of the cortical tumor using the ultrasonic aspirator without damaging adjacent normal brain tissue and vessels. A bipolar instrument could be used to lift and retract the pial membrane to gain access to the tumor and cauterize possible bleeding points. Participants performed the scenario 5 times; however, for the analysis these tasks were averaged and not treated separately. The duration of the resection procedure was limited to 3 minutes.14Video 1 is a sample video of the task and Video 2 is a 3-dimensional tumoral reconstruction.
Raw Data Obtained From the Simulator
After each trial, the NeuroVR provides a comma separated value (CSV) file containing, in 20-millisecond increments, the activation, force applied, tip position, and angle of each instrument; the volume of tumor and surrounding healthy tissues removed; blood loss; and whether a given instrument was in contact with the tumor, a blood vessel, or healthy tissue. MATLAB, release 2018a (The MathWorks Inc) was used to process the data into operative performance metrics that can be used by a machine learning algorithm. Interpolation was used to render the data regular and fill occasional missing data points (due to slight fluctuations in computer processing). eFigure 1 in the Supplement has further examples.
Performance Metric Extraction
To begin, raw data were transformed into performance metrics to be used by the algorithm, with the intention of generating operative measurements that would be easily interpretable by teachers and students of surgery. This process includes transforming instrument movement from the original x, y, and z coordinates into 3-dimensional representations of velocity (first derivative of position), acceleration (first derivative of velocity), and jerk (first derivative of acceleration), as well as the separation between instrument tips. The acceleration and tip distance variables were further refined to reflect the rate of change while the instruments were speeding up and slowing down as well as converging and diverging. The rate of change in volume of tumor and healthy tissue, as well as the rate of change of bleeding, and the number of attempts to stop bleeding were generated. Next, the aforementioned variables were extracted during 3 operative conditions: during the course of the whole scenario, during the tumor resection (ie, only when the ultrasonic aspirator was activated with decreasing tumor volume), and during blood suctioning (ie, when the ultrasonic aspirator was not active and while blood in the operative view was decreasing). Finally, the mean, median, and maximum values of all metrics in all conditions were obtained. Table 1 lists all 270 metrics generated. Examples among the total 270 possible metrics generated include mean aspirator force while resecting the tumor, maximum rate of bleeding during the course of the whole scenario, and median tip distance while suctioning blood. Performance measures of the 5 scenarios were averaged together for each participant.
Metric Reduction and Normalization
Metrics failing to demonstrate a significant (P < .05) difference on a 2-sided t test between any 2 groups were excluded. No corrections for multiple tests were done, as the t tests were performed for data-reductive purposes. Subsequent inclusion of the metrics in the algorithm corrects for the probability of type I error at this stage. Metrics were normalized via z score transformation to ensure optimal algorithm functioning.
The following steps involve a repetitive process whereby algorithm optimization and final performance metric selection occur. The process is outlined in Figure 1. Forward (starting with 1 and increasing in number) metric selection was performed by randomly adding metrics and backward (starting with the maximum and decreasing in number) metric selection was performed by randomly removing metrics. Calculation of accuracy was accomplished by leave-1-out cross-validation. Leave-1-out validation involves training the machine learning algorithm on the entire participant data set except for 1 individual, whose group membership is then estimated. The process is repeated with different individuals excluded until all participants have been classified. The total number of correctly classified individuals represents the overall accuracy of a given algorithm. No external data set was used to obtain the algorithm accuracy.
Four classifier algorithms were used: K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine. Parameter optimizations were carried out using functions included in MATLAB, release 2018a, as well as code written by us.15-19
A total of 50 individuals (14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated in 250 simulated tumor resections. Demographic information is presented in Table 2. Consultant neurosurgeon subspecialization covered a wide breadth of practice, with most (9 [64%]) primarily involved in cranial surgery. A total of 7 neurosurgeons (50%), 10 senior residents (69%), 6 junior residents (60%), and 3 medical students (25%) indicated that they had used a surgical simulator previously.
Machine Learning Ability to Classify Participants
The K-nearest neighbor algorithm had an accuracy of 90% (45 of 50), the naive Bayes algorithm had an accuracy of 84% (42 of 50), the discriminant analysis algorithm had an accuracy of 78% (39 of 50), and the support vector machine algorithm had an accuracy of 76% (38 of 50). Figure 2 presents details on individual misclassification. Although beyond the scope of the initial hypothesis, in response to misclassifications between medical students and neurosurgeons, the algorithm optimization process was repeated with an emphasis on preventing misclassification between neurosurgeons and medical students, with resulting accuracies ranging between 88% (44 of 50) and 72% (36 of 50). This was accomplished by allowing the algorithm optimization process to stop if no misclassifications between neurosurgeons and medical students occurred, in addition to attaining a desired accuracy. eFigure 2 in the Supplement has further information regarding the individual misclassifications of these algorithms.
Machine Learning Optimized Parameters
The final K-nearest neighbor algorithm used included 2 neighbors with a cosine distance calculation. Novel data points were classified into the more skilled group in cases when 2 neighbors were from differing groups.
The best-performing naive Bayes algorithm used gaussian (normal) kernel smoothing with a width of 0.31408. The final discriminant analysis algorithm used a δ value of 0.00068926 and a γ value of 0.99808 with a pseudo-linear discriminant type. The final support vector machine algorithm used a gaussian kernel function with the formula G(xj, xk) = exp (−||xj − xk||χ2). Box constraint was 0.12958 and kernel scale was 3.1667 using the 1-vs-all coding method (in which 1 group is compared with all others).
Performance Metrics Selected by Machine Learning Algorithm
Of the 270 performance metrics generated from raw data, 122 were selected after reduction and normalization. The K-nearest neighbor algorithm used 6 performance metrics to classify participants (55, 63, 67, 123, 125, and 157), the naive Bayes algorithm used 9 performance metrics (9, 34, 39, 44, 51, 55, 60, 69, and 129), the discriminant analysis algorithm used 8 performance metrics (60, 63, 159, 168, 189, 194, 235, and 250), and the support vector machine algorithm used 8 performance metrics (9, 45, 48, 60, 106, 168, 176, and 265) (Table 1). Performance metrics selected by the algorithms spanned the following 4 principal domains: movement associated with a single instrument, both instruments used in concert, force applied by the instruments, and tissue removed or bleeding caused (Figure 3).
In this prospective study using a high-fidelity virtual reality simulated neurosurgical brain tumor resection procedure, we sought to assess whether machine learning algorithms could select performance measures to classify participants according to their level of neurosurgical training. This study comes at a time of ever-increasing time pressure facing physician-educators to balance their commitment to patients and learners.20 In parallel, in the United States the search continues for a reliable means of examining Part III of the Maintenance of Certification, namely, the assessment of knowledge, judgment, and skills unique to surgical and procedurally oriented medical specialties.21,22 Both require an objective, consistent, transparent, and defendable means of summative and formative assessments of psychomotor ability.
Simulators, while affording learners the opportunity to safely develop technical skills during the particularly dangerous and error-prone early phases of skill acquisition, do not obviate the need for learner feedback, which is often given by skilled instructors.23 Furthermore, although simulation has been incorporated into the certification process of the American Board of Surgery and the American Board of Anesthesiology, the former relies on human evaluators while the latter is meant only to stimulate self-reflection.22,24 Simulation-based technical skills training informed by artificial intelligence feedback systems may offer a solution.
As innovations in artificial intelligence continue, so do the efforts to maintain human understanding of the algorithm classification process. This field has been termed transparent or explainable artificial intelligence.25 By understanding the performance data used by the algorithm to render its decision, it is possible to design systems to deliver on-demand assessments at the convenience of the examinee and with minimal input from skilled instructors. Such systems may be subject to continuous improvement as increasing participant data are collected and integrated into the algorithm.
We found that the best-performing machine learning algorithm used as few as 6 performance metrics to successfully classify 45 of 50 participants into 1 of 4 groups of expertise. Although we chose to limit the performance measures to those that could be easily interpreted by a user, theoretically higher accuracies may be attained by including more abstruse metrics. Nevertheless, to our knowledge, no previous study using artificial intelligence to evaluate performance has demonstrated the ability to identify 4 groups in open surgery.26-37
Insofar as technical skills measured on a simulator are reflective of operating skill in the real world, our findings outline a novel approach to understanding technical expertise in surgery. Although 4 different machine learning algorithms were used, there still exists the possibility that all algorithms are overfitted to our data set, limiting their performance when faced with novel data.38 As such, these algorithms must be tested on an independent data set before making final conclusions about their accuracy. Furthermore, in 3 of 4 algorithms a single medical student was categorized as a neurosurgeon. In response to this misclassification, we sought to limit misclassifications between these 2 groups in the algorithm optimization process as a proof of concept. Although this modification came at a cost of reduced overall accuracy, explicitly preventing misclassifications between certain groups may be desirable in high-stakes certification examinations.
In addition, it is challenging to define populations of surgeons, fellows, and residents with equivalent skill to allow accurate classification. Neurosurgeon skill level was based on being a certified surgeon and resident skill level was based on their educational year, which does not adequately take into account subspecialization or other construct-validated objective assessments of skill sets. A more comprehensive evaluation of participants with an emphasis on demonstrated skills across assessment domains (eg, visual rating scales and training evaluations or assessment of visuospatial abilities) may result in improved algorithm performance.
Our study demonstrates the ability of machine learning algorithms to classify surgical expertise with greater granularity and precision than has been previously demonstrated. Although the task involved a complex neurosurgical tumor resection task, the protocol outlined can be applied to any digitized platform to assess performance in a setting in which technical skill is paramount.
Accepted for Publication: June 10, 2019.
Published: August 2, 2019. doi:10.1001/jamanetworkopen.2019.8363
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Winkler-Schwartz A et al. JAMA Network Open.
Corresponding Author: Alexander Winkler-Schwartz, MD, Neurosurgical Simulation and Artificial Intelligence Learning Centre, Department of Neurology and Neurosurgery, Montreal Neurological Institute and Hospital, 3801 University St, E2.89 Montreal, QC H3A 2B4, Canada (firstname.lastname@example.org).
Author Contributions: Drs Winkler-Schwartz and Yilmaz had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Drs Winkler-Schwartz and Yilmaz are co–first authors.
Concept and design: Winkler-Schwartz, Yilmaz, Mirchi, Bissonnette, Ledwos, Siyar, Azarnoush, Del Maestro.
Acquisition, analysis, or interpretation of data: Winkler-Schwartz, Yilmaz, Karlik.
Drafting of the manuscript: Winkler-Schwartz, Yilmaz, Siyar, Azarnoush.
Critical revision of the manuscript for important intellectual content: Winkler-Schwartz, Yilmaz, Mirchi, Bissonnette, Ledwos, Karlik, Del Maestro.
Statistical analysis: Winkler-Schwartz, Yilmaz, Bissonnette, Karlik.
Obtained funding: Del Maestro.
Administrative, technical, or material support: Del Maestro.
Supervision: Azarnoush, Del Maestro.
Conflict of Interest Disclosures: Dr Winkler-Schwartz reported receiving grants from Fonds de Recherche du Québec–Santé, Robert Maudsley Fellowship for the Royal College of Physicians and Surgeons of Canada, and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Dr Yilmaz reported receiving grants from AO Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform (2019; patent No. 05001770-843USPR). Mr Mirchi reported receiving grants from AO Foundation and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Ms Ledwos reported receiving grants from AO Foundation and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Dr Del Maestro reported receiving grants from AO Foundation, Di Giovanni Foundation, and Robert Maudsley Fellowship for Studies in Medical Education during the conduct of the study; receiving personal fees from AO Foundation and CAE outside the submitted work; having a patent pending to Methods and System for Generating a Training Platform; being a coauthor and neurosurgeon working with 3 engineers on a manuscript that was the first description of the NeuroTouch Simulator developed by the Medical Research Council of Canada, which was later taken over by CAE in 2016 who renamed the system NeuroVR; and being a visiting researcher at the Medical Research Council of Canada. No other disclosures were reported.
Funding/Support: This work was supported by the Di Giovanni Foundation, the Montreal English School Board, the Montreal Neurological Institute, the McGill Department of Orthopedics, the Fonds de recherche du Québec–Santé, and a Robert Maudsley Fellowship for Studies in Medical Education from the Royal College of Physicians and Surgeons of Canada. The Medical Research Council of Canada has provided a prototype of the NeuroTouch that was used in this study.
Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Additional Contributions: We thank all those who participated in this study.
FT. Simulation in neurosurgery and neurosurgical procedures. In: Levine
AJ, eds. The Comprehensive Textbook of Healthcare Simulation
. New York, NY: Springer New York; 2013:415-423. doi:10.1007/978-1-4614-5993-4_28
R, Del Maestro
RF. NeuroTouch: a physics-based virtual simulator for cranial microneurosurgery training. Neurosurgery
. 2012;71(1)(suppl operative):32-42.PubMedGoogle Scholar
et al. Assessing bimanual performance in brain tumor resection with NeuroTouch, a virtual reality simulator. Neurosurgery
. 2015;11(1)(suppl 2):89-98.PubMedGoogle Scholar
et al. Artificial intelligence in medical education: best practices using machine learning to assess surgical expertise in virtual reality simulation [published online June 13, 2019]. J Surg Educ
. doi:10.1016/j.jsurg.2019.05.015PubMedGoogle Scholar
et al; International Network for Simulation-based Pediatric Innovation, Research, and Education (INSPIRE) Reporting Guidelines Investigators. Reporting guidelines for health care simulation research: extensions to the CONSORT and STROBE statements. Simul Healthc
. 2016;11(4):238-248. doi:10.1097/SIH.0000000000000150PubMedGoogle ScholarCrossref
M. Trends and trajectories for explainable, accountable and intelligible systems: an HCI research agenda. Presented at: 2018 CHI Conference on Human Factors in Computing Systems; April 21-26, 2018; Montreal, QC.
AM. Automatic motion recognition and skill evaluation for dynamic tasks. Eurohaptics
. 2003;2003:363-373.Google Scholar
et al. Towards integrating task information in skills assessment for dexterous tasks in surgery and simulation. Presented at: 2011 IEEE International Conference on Robotics and Automation; May 9-13, 2011; Shanghai, China.
I. Fuzzy classification: towards evaluating performance on a surgical simulator. Stud Health Technol Inform
. 2005;111:194-200.PubMedGoogle Scholar