[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
Figure 1.
The Process of Generating a Final Optimized Machine Learning Algorithm With a Set of Selected Metrics
The Process of Generating a Final Optimized Machine Learning Algorithm With a Set of Selected Metrics

For algorithm optimization, each machine learning algorithm has a defined set of parameters by which it functions, the adjustment of which will modify its overall performance. An analogy for these parameters is the statistical methods that underlie P value adjustments (eg, Bonferroni and Benjamini-Hochberg). MATLAB, release 2018a (MathWorks Inc) was used to modify the intrinsic properties of 4 machine learning algorithms (K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine).

Figure 2.
Individual Misclassifications by Machine Learning Algorithms
Individual Misclassifications by Machine Learning Algorithms

Matrix demonstrating actual vs estimated group memberships by 4 different machine learning algorithms. Percentages reflect the total among rows.

Figure 3.
Number of Performance Metrics Selected for By 4 Different Machine Learning Algorithms
Number of Performance Metrics Selected for By 4 Different Machine Learning Algorithms

Performance metrics are categorized as those involving movements of 1 or both instruments, force applied to the underlying structures and damage to underlying brain, blood loss, and quantity of tumor removed.

Table 1.  
Performance Metrics Generated From Raw Simulator Data
Performance Metrics Generated From Raw Simulator Data
Table 2.  
Demographic Information of Participants
Demographic Information of Participants
Video 1. Virtual Reality Subpial Tumoral Resection

Recording from a single participant performing a virtual subpial resection task. The bipolar forceps are held with the nondominant hand and appear in image left. The ultrasonic aspirator is held in the dominant hand and appears image right.

Video 2. Three-Dimensional Representation of Virtual Reality Subpial Tumor

A 3-dimensional representation of the virtual reality subpial tumor resection task is presented. The pia (light pink) can be seen overlaying the tumor (dark blue) and adjacent blood vessel (bright red).

1.
Anderson  O, Davis  R, Hanna  GB, Vincent  CA.  Surgical adverse events: a systematic review.  Am J Surg. 2013;206(2):253-262. doi:10.1016/j.amjsurg.2012.11.009PubMedGoogle ScholarCrossref
2.
Maier-Hein  L, Vedula  SS, Speidel  S,  et al.  Surgical data science for next-generation interventions.  Nat Biomed Eng. 2017;1(9):691-696. doi:10.1038/s41551-017-0132-7PubMedGoogle ScholarCrossref
3.
Alaraj  A, Tobin  MK, Birk  DM, Charbel  FT. Simulation in neurosurgery and neurosurgical procedures. In: Levine  AI, DeMaria  S, Schwartz  AD, Sim  AJ, eds.  The Comprehensive Textbook of Healthcare Simulation. New York, NY: Springer New York; 2013:415-423. doi:10.1007/978-1-4614-5993-4_28
4.
Delorme  S, Laroche  D, DiRaddo  R, Del Maestro  RF.  NeuroTouch: a physics-based virtual simulator for cranial microneurosurgery training.  Neurosurgery. 2012;71(1)(suppl operative):32-42.PubMedGoogle Scholar
5.
Bugdadi  A, Sawaya  R, Olwi  D,  et al.  Automaticity of force application during simulated brain tumor resection: testing the Fitts and Posner model.  J Surg Educ. 2018;75(1):104-115. doi:10.1016/j.jsurg.2017.06.018PubMedGoogle ScholarCrossref
6.
Winkler-Schwartz  A, Bajunaid  K, Mullah  MAS,  et al.  Bimanual psychomotor performance in neurosurgical resident applicants assessed using NeuroTouch, a virtual reality simulator.  J Surg Educ. 2016;73(6):942-953. doi:10.1016/j.jsurg.2016.04.013PubMedGoogle ScholarCrossref
7.
Bajunaid  K, Mullah  MA, Winkler-Schwartz  A,  et al.  Impact of acute stress on psychomotor bimanual performance during a simulated tumor resection task.  J Neurosurg. 2017;126(1):71-80. doi:10.3171/2015.5.JNS15558PubMedGoogle ScholarCrossref
8.
AlZhrani  G, Alotaibi  F, Azarnoush  H,  et al.  Proficiency performance benchmarks for removal of simulated brain tumors using a virtual reality simulator NeuroTouch.  J Surg Educ. 2015;72(4):685-696. doi:10.1016/j.jsurg.2014.12.014PubMedGoogle ScholarCrossref
9.
Alotaibi  FE, AlZhrani  GA, Mullah  MA,  et al.  Assessing bimanual performance in brain tumor resection with NeuroTouch, a virtual reality simulator.  Neurosurgery. 2015;11(1)(suppl 2):89-98.PubMedGoogle Scholar
10.
World Medical Association.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects.  JAMA. 2013;310(20):2191-2194. doi:10.1001/jama.2013.281053PubMedGoogle ScholarCrossref
11.
Winkler-Schwartz  A, Bissonnette  V, Mirchi  N,  et al.  Artificial intelligence in medical education: best practices using machine learning to assess surgical expertise in virtual reality simulation  [published online June 13, 2019].  J Surg Educ. doi:10.1016/j.jsurg.2019.05.015PubMedGoogle Scholar
12.
Cheng  A, Kessler  D, Mackinnon  R,  et al; International Network for Simulation-based Pediatric Innovation, Research, and Education (INSPIRE) Reporting Guidelines Investigators.  Reporting guidelines for health care simulation research: extensions to the CONSORT and STROBE statements.  Simul Healthc. 2016;11(4):238-248. doi:10.1097/SIH.0000000000000150PubMedGoogle ScholarCrossref
13.
Hebb  AO, Yang  T, Silbergeld  DL.  The sub-pial resection technique for intrinsic tumor surgery.  Surg Neurol Int. 2011;2:180. doi:10.4103/2152-7806.90714PubMedGoogle ScholarCrossref
14.
Bugdadi  A, Sawaya  R, Bajunaid  K,  et al.  Is virtual reality surgical performance influenced by force feedback device utilized?  J Surg Educ. 2019;76(1):262-273. doi:10.1016/j.jsurg.2018.06.012PubMedGoogle ScholarCrossref
15.
MathWorks. fitcecoc: fit multiclass models for support vector machines or other classifiers. https://www.mathworks.com/help/stats/fitcecoc.html. Accessed January 9, 2019.
16.
MathWorks. fitcsvm: Train support vector machine (SVM) classifier for one-class and binary classification. https://www.mathworks.com/help/stats/fitcsvm.html. Accessed January 9, 2019.
17.
MathWorks. fitcdiscr: Fit discriminant analysis classifier. https://www.mathworks.com/help/stats/fitcdiscr.html. Accessed January 9, 2019.
18.
MathWorks. fitcnb: Train multiclass naive Bayes model. https://www.mathworks.com/help/stats/fitcnb.html. Accessed January 9, 2019.
19.
MathWorks. fitcknn: Fit k-nearest neighbor classifier. https://www.mathworks.com/help/stats/fitcknn.html. Accessed January 9, 2019.
20.
Spencer  J.  Learning and teaching in the clinical environment.  BMJ. 2003;326(7389):591-594. doi:10.1136/bmj.326.7389.591PubMedGoogle ScholarCrossref
21.
American Board of Medical Specialties. Steps toward initial certification and MOC. https://www.abms.org/board-certification/steps-toward-initial-certification-and-moc/. Accessed January 23, 2019.
22.
Ross  BK, Metzner  J.  Simulation for maintenance of certification.  Surg Clin North Am. 2015;95(4):893-905. doi:10.1016/j.suc.2015.04.010PubMedGoogle ScholarCrossref
23.
Issenberg  SB, McGaghie  WC, Petrusa  ER, Lee Gordon  D, Scalese  RJ.  Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review.  Med Teach. 2005;27(1):10-28. doi:10.1080/01421590500046924PubMedGoogle ScholarCrossref
24.
Epstein  RM.  Assessment in medical education.  N Engl J Med. 2007;356(4):387-396. doi:10.1056/NEJMra054784PubMedGoogle ScholarCrossref
25.
Abdul  A, Vermeulen  J, Wang  D, Lim  B, Kankanhalli  M. Trends and trajectories for explainable, accountable and intelligible systems: an HCI research agenda. Presented at: 2018 CHI Conference on Human Factors in Computing Systems; April 21-26, 2018; Montreal, QC.
26.
Murphy  TE, Vignes  CM, Yuh  DD, Okamura  AM.  Automatic motion recognition and skill evaluation for dynamic tasks.  Eurohaptics. 2003;2003:363-373.Google Scholar
27.
Megali  G, Sinigaglia  S, Tonet  O, Dario  P.  Modelling and evaluation of surgical performance using hidden Markov models.  IEEE Trans Biomed Eng. 2006;53(10):1911-1919. doi:10.1109/TBME.2006.881784PubMedGoogle ScholarCrossref
28.
Hajshirmohammadi  I, Payandeh  S.  Fuzzy set theory for performance evaluation in a surgical simulator.  Presence. 2007;16(6):603-622. doi:10.1162/pres.16.6.603Google ScholarCrossref
29.
Jog  A, Itkowitz  B, May  L,  et al. Towards integrating task information in skills assessment for dexterous tasks in surgery and simulation. Presented at: 2011 IEEE International Conference on Robotics and Automation; May 9-13, 2011; Shanghai, China.
30.
Liang  H, Shi  MY.  Surgical skill evaluation model for virtual surgical training.  Appl Mech Mater. 2011;40-41:812-819. doi:10.4028/www.scientific.net/AMM.40-41.812Google ScholarCrossref
31.
Loukas  C, Georgiou  E.  Multivariate autoregressive modeling of hand kinematics for laparoscopic skills assessment of surgical trainees.  IEEE Trans Biomed Eng. 2011;58(11):3289-3297. doi:10.1109/TBME.2011.2167324PubMedGoogle ScholarCrossref
32.
Huang  J, Payandeh  S, Doris  P, Hajshirmohammadi  I.  Fuzzy classification: towards evaluating performance on a surgical simulator.  Stud Health Technol Inform. 2005;111:194-200.PubMedGoogle Scholar
33.
Sewell  C, Morris  D, Blevins  NH,  et al.  Providing metrics and performance feedback in a surgical simulator.  Comput Aided Surg. 2008;13(2):63-81. doi:10.3109/10929080801957712PubMedGoogle ScholarCrossref
34.
Richstone  L, Schwartz  MJ, Seideman  C, Cadeddu  J, Marshall  S, Kavoussi  LR.  Eye metrics as an objective assessment of surgical skill.  Ann Surg. 2010;252(1):177-182. doi:10.1097/SLA.0b013e3181e464fbPubMedGoogle ScholarCrossref
35.
Rhienmora  P, Haddawy  P, Suebnukarn  S, Dailey  MN.  Intelligent dental training simulator with objective skill assessment and feedback.  Artif Intell Med. 2011;52(2):115-121. doi:10.1016/j.artmed.2011.04.003PubMedGoogle ScholarCrossref
36.
Kerwin  T, Wiet  G, Stredney  D, Shen  HW.  Automatic scoring of virtual mastoidectomies using expert examples.  Int J Comput Assist Radiol Surg. 2012;7(1):1-11. doi:10.1007/s11548-011-0566-4PubMedGoogle ScholarCrossref
37.
Ershad  M, Rege  R, Fey  AM.  Meaningful assessment of robotic surgical style using the wisdom of crowds.  Int J Comput Assist Radiol Surg. 2018;13(7):1037-1048. doi:10.1007/s11548-018-1738-2PubMedGoogle ScholarCrossref
38.
Deo  RC.  Machine learning in medicine.  Circulation. 2015;132(20):1920-1930. doi:10.1161/CIRCULATIONAHA.115.001593PubMedGoogle ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Views 2,687
    Original Investigation
    Medical Education
    August 2, 2019

    Machine Learning Identification of Surgical and Operative Factors Associated With Surgical Expertise in Virtual Reality Simulation

    Author Affiliations
    • 1Neurosurgical Simulation and Artificial Intelligence Learning Centre, Department of Neurology and Neurosurgery, Montreal Neurological Institute and Hospital, McGill University, Montreal, Quebec, Canada
    • 2Department of Orthopedic Surgery, McGill University, Montreal, Quebec, Canada
    • 3Department of Biomedical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran
    JAMA Netw Open. 2019;2(8):e198363. doi:10.1001/jamanetworkopen.2019.8363
    Key Points español 中文 (chinese)

    Question  Can a machine learning algorithm differentiate participants according to their stage of practice in a complex simulated neurosurgical task?

    Findings  In this case series study, 50 individuals (14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated in 250 simulated tumor resections. An accuracy of 90% was achieved using 6 performance features by a K-nearest neighbor algorithm and 2 neurosurgeons, 1 fellow or senior resident, 1 junior resident, and 1 medical student were misclassified.

    Meaning  The findings suggest that machine learning algorithms may be capable of classifying surgical expertise with greater granularity and precision than has been previously demonstrated in surgery.

    Abstract

    Importance  Despite advances in the assessment of technical skills in surgery, a clear understanding of the composites of technical expertise is lacking. Surgical simulation allows for the quantitation of psychomotor skills, generating data sets that can be analyzed using machine learning algorithms.

    Objective  To identify surgical and operative factors selected by a machine learning algorithm to accurately classify participants by level of expertise in a virtual reality surgical procedure.

    Design, Setting, and Participants  Fifty participants from a single university were recruited between March 1, 2015, and May 31, 2016, to participate in a case series study at McGill University Neurosurgical Simulation and Artificial Intelligence Learning Centre. Data were collected at a single time point and no follow-up data were collected. Individuals were classified a priori as expert (neurosurgery staff), seniors (neurosurgical fellows and senior residents), juniors (neurosurgical junior residents), and medical students, all of whom participated in 250 simulated tumor resections.

    Exposures  All individuals participated in a virtual reality neurosurgical tumor resection scenario. Each scenario was repeated 5 times.

    Main Outcomes and Measures  Through an iterative process, performance metrics associated with instrument movement and force, resection of tissues, and bleeding generated from the raw simulator data output were selected by K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine algorithms to most accurately determine group membership.

    Results  A total of 50 individuals (9 women and 41 men; mean [SD] age, 33.6 [9.5] years; 14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated. Neurosurgeons were in practice between 1 and 25 years, with 9 (64%) involving a predominantly cranial practice. The K-nearest neighbor algorithm had an accuracy of 90% (45 of 50), the naive Bayes algorithm had an accuracy of 84% (42 of 50), the discriminant analysis algorithm had an accuracy of 78% (39 of 50), and the support vector machine algorithm had an accuracy of 76% (38 of 50). The K-nearest neighbor algorithm used 6 performance metrics to classify participants, the naive Bayes algorithm used 9 performance metrics, the discriminant analysis algorithm used 8 performance metrics, and the support vector machine algorithm used 8 performance metrics. Two neurosurgeons, 1 fellow or senior resident, 1 junior resident, and 1 medical student were misclassified.

    Conclusions and Relevance  In a virtual reality neurosurgical tumor resection study, a machine learning algorithm successfully classified participants into 4 levels of expertise with 90% accuracy. These findings suggest that algorithms may be capable of classifying surgical expertise with greater granularity and precision than has been previously demonstrated in surgery.

    Introduction

    Despite technological advances in artificial intelligence and machine learning, delivery of health care is mediated largely by direct interaction between physician and patient. This scenario is particularly true for surgical interventions, which carry substantive patient risks and increased costs to health care systems.1 As a consequence, the burgeoning field of surgical data science represents efforts to improve interventional health care through increased data collection, quantification, and analysis.2 Similarly, the use of virtual reality simulators has been explored as a means of providing objective assessment of technical ability in medicine, with the added benefit of retaining realism, pathology, and active bleeding states in a controlled laboratory setting. These systems generate vast data sets that quickly challenge traditional statistical methods. Artificial intelligence and machine learning systems lend themselves well to the analysis of large data sets generated in surgical procedures in 2 important ways: first, by uncovering previously unrecognized patterns, they can expand the understanding of the composites of technical expertise and surgical error, and second, by grouping participants according to technical ability, they offer novel avenues for training and feedback in health care.

    We sought to study the operative factors selected by a series of machine learning algorithms to most accurately classify participants by level of expertise in a virtual reality surgery. Using an advanced high-fidelity neurosurgical simulator allows participants to conduct a complex open neurosurgical brain tumor resection task in a risk-free environment.3,4 Our group has extensive experience in virtual reality surgical simulation; several studies have demonstrated that performance measures obtained from simulation can differentiate technical skills both between and within groups of expertise.5-9 Given the task complexity and the abundance of data generated during the simulated operation, we hypothesized that machine learning algorithms could identify previously unrecognized performance measures, as well as differentiate participants according to their stage of medical practice.

    Methods
    Study Participants

    All neurosurgeons, neurosurgical fellows, and neurosurgical residents from a single Canadian university were invited between March 1, 2015, and May 31, 2016, to participate in the trial. Medical students rotating on a neurosurgical service or having expressed interest in being contacted for trials were invited. Data were collected at a single time point and no follow-up data were collected. Participants were classified a priori as expert (neurosurgery staff), seniors (neurosurgery fellows and residents in years 4-6), juniors (neurosurgery residents in years 1-3), and medical students. All participants signed an approved Montreal Neurological Institute and Hospital Research Ethics Board consent form before trial participation. All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Declaration of Helsinki.10 The study received local ethics board approval at the Montreal Neurological Institute and Hospital. This report is structured according to guidelines for best practices in reporting studies on machine learning to assess surgical expertise in virtual reality simulation.11,12

    Study Design
    The Simulator

    The NeuroVR (CAE Healthcare) is a high-fidelity neurosurgical simulator designed to recreate the visual and haptic experience of resecting a human brain tumor through an operative microscope. The platform was developed in 2012 by a team from the National Research Council of Canada in collaboration with an advisory network of surgeons from 23 Canadian and international teaching hospitals.4 Care was taken to provide the most realistic sensory feedback for the user by incorporating physical properties of human primary brain tumors.4 As such, the attention to detail and resources used in the creation of the NeuroVR make it one of the most advanced high-fidelity simulators available for neurosurgery.3

    The Virtual Reality Tumor Resection Task

    The trial was carried out at the McGill Neurosurgical Simulation and Artificial Intelligence Learning Centre in a controlled laboratory environment void of distractions. A human intrinsic subpial brain tumor resection task was designed by neurosurgeons with extensive experience in neuro-oncologic and epilepsy neurosurgery. The subpial technique is a challenging bimanual psychomotor skill acquired in neurosurgery and is primarily used in epilepsy and oncologic surgery, where preservation of adjacent eloquent structures is of paramount importance.13 Participants were given written and verbal instructions that the goal of the scenario was removal of the cortical tumor using the ultrasonic aspirator without damaging adjacent normal brain tissue and vessels. A bipolar instrument could be used to lift and retract the pial membrane to gain access to the tumor and cauterize possible bleeding points. Participants performed the scenario 5 times; however, for the analysis these tasks were averaged and not treated separately. The duration of the resection procedure was limited to 3 minutes.14Video 1 is a sample video of the task and Video 2 is a 3-dimensional tumoral reconstruction.

    Statistical Analysis
    Raw Data Obtained From the Simulator

    After each trial, the NeuroVR provides a comma separated value (CSV) file containing, in 20-millisecond increments, the activation, force applied, tip position, and angle of each instrument; the volume of tumor and surrounding healthy tissues removed; blood loss; and whether a given instrument was in contact with the tumor, a blood vessel, or healthy tissue. MATLAB, release 2018a (The MathWorks Inc) was used to process the data into operative performance metrics that can be used by a machine learning algorithm. Interpolation was used to render the data regular and fill occasional missing data points (due to slight fluctuations in computer processing). eFigure 1 in the Supplement has further examples.

    Performance Metric Extraction

    To begin, raw data were transformed into performance metrics to be used by the algorithm, with the intention of generating operative measurements that would be easily interpretable by teachers and students of surgery. This process includes transforming instrument movement from the original x, y, and z coordinates into 3-dimensional representations of velocity (first derivative of position), acceleration (first derivative of velocity), and jerk (first derivative of acceleration), as well as the separation between instrument tips. The acceleration and tip distance variables were further refined to reflect the rate of change while the instruments were speeding up and slowing down as well as converging and diverging. The rate of change in volume of tumor and healthy tissue, as well as the rate of change of bleeding, and the number of attempts to stop bleeding were generated. Next, the aforementioned variables were extracted during 3 operative conditions: during the course of the whole scenario, during the tumor resection (ie, only when the ultrasonic aspirator was activated with decreasing tumor volume), and during blood suctioning (ie, when the ultrasonic aspirator was not active and while blood in the operative view was decreasing). Finally, the mean, median, and maximum values of all metrics in all conditions were obtained. Table 1 lists all 270 metrics generated. Examples among the total 270 possible metrics generated include mean aspirator force while resecting the tumor, maximum rate of bleeding during the course of the whole scenario, and median tip distance while suctioning blood. Performance measures of the 5 scenarios were averaged together for each participant.

    Metric Reduction and Normalization

    Metrics failing to demonstrate a significant (P < .05) difference on a 2-sided t test between any 2 groups were excluded. No corrections for multiple tests were done, as the t tests were performed for data-reductive purposes. Subsequent inclusion of the metrics in the algorithm corrects for the probability of type I error at this stage. Metrics were normalized via z score transformation to ensure optimal algorithm functioning.

    Iterative Loop

    The following steps involve a repetitive process whereby algorithm optimization and final performance metric selection occur. The process is outlined in Figure 1. Forward (starting with 1 and increasing in number) metric selection was performed by randomly adding metrics and backward (starting with the maximum and decreasing in number) metric selection was performed by randomly removing metrics. Calculation of accuracy was accomplished by leave-1-out cross-validation. Leave-1-out validation involves training the machine learning algorithm on the entire participant data set except for 1 individual, whose group membership is then estimated. The process is repeated with different individuals excluded until all participants have been classified. The total number of correctly classified individuals represents the overall accuracy of a given algorithm. No external data set was used to obtain the algorithm accuracy.

    Algorithms Used

    Four classifier algorithms were used: K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine. Parameter optimizations were carried out using functions included in MATLAB, release 2018a, as well as code written by us.15-19

    Results
    Participant Characteristics

    A total of 50 individuals (14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, and 12 medical students) participated in 250 simulated tumor resections. Demographic information is presented in Table 2. Consultant neurosurgeon subspecialization covered a wide breadth of practice, with most (9 [64%]) primarily involved in cranial surgery. A total of 7 neurosurgeons (50%), 10 senior residents (69%), 6 junior residents (60%), and 3 medical students (25%) indicated that they had used a surgical simulator previously.

    Machine Learning Ability to Classify Participants

    The K-nearest neighbor algorithm had an accuracy of 90% (45 of 50), the naive Bayes algorithm had an accuracy of 84% (42 of 50), the discriminant analysis algorithm had an accuracy of 78% (39 of 50), and the support vector machine algorithm had an accuracy of 76% (38 of 50). Figure 2 presents details on individual misclassification. Although beyond the scope of the initial hypothesis, in response to misclassifications between medical students and neurosurgeons, the algorithm optimization process was repeated with an emphasis on preventing misclassification between neurosurgeons and medical students, with resulting accuracies ranging between 88% (44 of 50) and 72% (36 of 50). This was accomplished by allowing the algorithm optimization process to stop if no misclassifications between neurosurgeons and medical students occurred, in addition to attaining a desired accuracy. eFigure 2 in the Supplement has further information regarding the individual misclassifications of these algorithms.

    Machine Learning Optimized Parameters

    The final K-nearest neighbor algorithm used included 2 neighbors with a cosine distance calculation. Novel data points were classified into the more skilled group in cases when 2 neighbors were from differing groups.

    The best-performing naive Bayes algorithm used gaussian (normal) kernel smoothing with a width of 0.31408. The final discriminant analysis algorithm used a δ value of 0.00068926 and a γ value of 0.99808 with a pseudo-linear discriminant type. The final support vector machine algorithm used a gaussian kernel function with the formula G(xj, xk) = exp (||xj − xk||χ2). Box constraint was 0.12958 and kernel scale was 3.1667 using the 1-vs-all coding method (in which 1 group is compared with all others).

    Performance Metrics Selected by Machine Learning Algorithm

    Of the 270 performance metrics generated from raw data, 122 were selected after reduction and normalization. The K-nearest neighbor algorithm used 6 performance metrics to classify participants (55, 63, 67, 123, 125, and 157), the naive Bayes algorithm used 9 performance metrics (9, 34, 39, 44, 51, 55, 60, 69, and 129), the discriminant analysis algorithm used 8 performance metrics (60, 63, 159, 168, 189, 194, 235, and 250), and the support vector machine algorithm used 8 performance metrics (9, 45, 48, 60, 106, 168, 176, and 265) (Table 1). Performance metrics selected by the algorithms spanned the following 4 principal domains: movement associated with a single instrument, both instruments used in concert, force applied by the instruments, and tissue removed or bleeding caused (Figure 3).

    Discussion

    In this prospective study using a high-fidelity virtual reality simulated neurosurgical brain tumor resection procedure, we sought to assess whether machine learning algorithms could select performance measures to classify participants according to their level of neurosurgical training. This study comes at a time of ever-increasing time pressure facing physician-educators to balance their commitment to patients and learners.20 In parallel, in the United States the search continues for a reliable means of examining Part III of the Maintenance of Certification, namely, the assessment of knowledge, judgment, and skills unique to surgical and procedurally oriented medical specialties.21,22 Both require an objective, consistent, transparent, and defendable means of summative and formative assessments of psychomotor ability.

    Simulators, while affording learners the opportunity to safely develop technical skills during the particularly dangerous and error-prone early phases of skill acquisition, do not obviate the need for learner feedback, which is often given by skilled instructors.23 Furthermore, although simulation has been incorporated into the certification process of the American Board of Surgery and the American Board of Anesthesiology, the former relies on human evaluators while the latter is meant only to stimulate self-reflection.22,24 Simulation-based technical skills training informed by artificial intelligence feedback systems may offer a solution.

    As innovations in artificial intelligence continue, so do the efforts to maintain human understanding of the algorithm classification process. This field has been termed transparent or explainable artificial intelligence.25 By understanding the performance data used by the algorithm to render its decision, it is possible to design systems to deliver on-demand assessments at the convenience of the examinee and with minimal input from skilled instructors. Such systems may be subject to continuous improvement as increasing participant data are collected and integrated into the algorithm.

    We found that the best-performing machine learning algorithm used as few as 6 performance metrics to successfully classify 45 of 50 participants into 1 of 4 groups of expertise. Although we chose to limit the performance measures to those that could be easily interpreted by a user, theoretically higher accuracies may be attained by including more abstruse metrics. Nevertheless, to our knowledge, no previous study using artificial intelligence to evaluate performance has demonstrated the ability to identify 4 groups in open surgery.26-37

    Limitations

    Insofar as technical skills measured on a simulator are reflective of operating skill in the real world, our findings outline a novel approach to understanding technical expertise in surgery. Although 4 different machine learning algorithms were used, there still exists the possibility that all algorithms are overfitted to our data set, limiting their performance when faced with novel data.38 As such, these algorithms must be tested on an independent data set before making final conclusions about their accuracy. Furthermore, in 3 of 4 algorithms a single medical student was categorized as a neurosurgeon. In response to this misclassification, we sought to limit misclassifications between these 2 groups in the algorithm optimization process as a proof of concept. Although this modification came at a cost of reduced overall accuracy, explicitly preventing misclassifications between certain groups may be desirable in high-stakes certification examinations.

    In addition, it is challenging to define populations of surgeons, fellows, and residents with equivalent skill to allow accurate classification. Neurosurgeon skill level was based on being a certified surgeon and resident skill level was based on their educational year, which does not adequately take into account subspecialization or other construct-validated objective assessments of skill sets. A more comprehensive evaluation of participants with an emphasis on demonstrated skills across assessment domains (eg, visual rating scales and training evaluations or assessment of visuospatial abilities) may result in improved algorithm performance.

    Conclusions

    Our study demonstrates the ability of machine learning algorithms to classify surgical expertise with greater granularity and precision than has been previously demonstrated. Although the task involved a complex neurosurgical tumor resection task, the protocol outlined can be applied to any digitized platform to assess performance in a setting in which technical skill is paramount.

    Back to top
    Article Information

    Accepted for Publication: June 10, 2019.

    Published: August 2, 2019. doi:10.1001/jamanetworkopen.2019.8363

    Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Winkler-Schwartz A et al. JAMA Network Open.

    Corresponding Author: Alexander Winkler-Schwartz, MD, Neurosurgical Simulation and Artificial Intelligence Learning Centre, Department of Neurology and Neurosurgery, Montreal Neurological Institute and Hospital, 3801 University St, E2.89 Montreal, QC H3A 2B4, Canada (manuscriptinquiry@gmail.com).

    Author Contributions: Drs Winkler-Schwartz and Yilmaz had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Drs Winkler-Schwartz and Yilmaz are co–first authors.

    Concept and design: Winkler-Schwartz, Yilmaz, Mirchi, Bissonnette, Ledwos, Siyar, Azarnoush, Del Maestro.

    Acquisition, analysis, or interpretation of data: Winkler-Schwartz, Yilmaz, Karlik.

    Drafting of the manuscript: Winkler-Schwartz, Yilmaz, Siyar, Azarnoush.

    Critical revision of the manuscript for important intellectual content: Winkler-Schwartz, Yilmaz, Mirchi, Bissonnette, Ledwos, Karlik, Del Maestro.

    Statistical analysis: Winkler-Schwartz, Yilmaz, Bissonnette, Karlik.

    Obtained funding: Del Maestro.

    Administrative, technical, or material support: Del Maestro.

    Supervision: Azarnoush, Del Maestro.

    Conflict of Interest Disclosures: Dr Winkler-Schwartz reported receiving grants from Fonds de Recherche du Québec–Santé, Robert Maudsley Fellowship for the Royal College of Physicians and Surgeons of Canada, and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Dr Yilmaz reported receiving grants from AO Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform (2019; patent No. 05001770-843USPR). Mr Mirchi reported receiving grants from AO Foundation and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Ms Ledwos reported receiving grants from AO Foundation and Di Giovanni Foundation during the conduct of the study and having a patent pending to Method and System for Generating a Training Platform. Dr Del Maestro reported receiving grants from AO Foundation, Di Giovanni Foundation, and Robert Maudsley Fellowship for Studies in Medical Education during the conduct of the study; receiving personal fees from AO Foundation and CAE outside the submitted work; having a patent pending to Methods and System for Generating a Training Platform; being a coauthor and neurosurgeon working with 3 engineers on a manuscript that was the first description of the NeuroTouch Simulator developed by the Medical Research Council of Canada, which was later taken over by CAE in 2016 who renamed the system NeuroVR; and being a visiting researcher at the Medical Research Council of Canada. No other disclosures were reported.

    Funding/Support: This work was supported by the Di Giovanni Foundation, the Montreal English School Board, the Montreal Neurological Institute, the McGill Department of Orthopedics, the Fonds de recherche du Québec–Santé, and a Robert Maudsley Fellowship for Studies in Medical Education from the Royal College of Physicians and Surgeons of Canada. The Medical Research Council of Canada has provided a prototype of the NeuroTouch that was used in this study.

    Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

    Additional Contributions: We thank all those who participated in this study.

    References
    1.
    Anderson  O, Davis  R, Hanna  GB, Vincent  CA.  Surgical adverse events: a systematic review.  Am J Surg. 2013;206(2):253-262. doi:10.1016/j.amjsurg.2012.11.009PubMedGoogle ScholarCrossref
    2.
    Maier-Hein  L, Vedula  SS, Speidel  S,  et al.  Surgical data science for next-generation interventions.  Nat Biomed Eng. 2017;1(9):691-696. doi:10.1038/s41551-017-0132-7PubMedGoogle ScholarCrossref
    3.
    Alaraj  A, Tobin  MK, Birk  DM, Charbel  FT. Simulation in neurosurgery and neurosurgical procedures. In: Levine  AI, DeMaria  S, Schwartz  AD, Sim  AJ, eds.  The Comprehensive Textbook of Healthcare Simulation. New York, NY: Springer New York; 2013:415-423. doi:10.1007/978-1-4614-5993-4_28
    4.
    Delorme  S, Laroche  D, DiRaddo  R, Del Maestro  RF.  NeuroTouch: a physics-based virtual simulator for cranial microneurosurgery training.  Neurosurgery. 2012;71(1)(suppl operative):32-42.PubMedGoogle Scholar
    5.
    Bugdadi  A, Sawaya  R, Olwi  D,  et al.  Automaticity of force application during simulated brain tumor resection: testing the Fitts and Posner model.  J Surg Educ. 2018;75(1):104-115. doi:10.1016/j.jsurg.2017.06.018PubMedGoogle ScholarCrossref
    6.
    Winkler-Schwartz  A, Bajunaid  K, Mullah  MAS,  et al.  Bimanual psychomotor performance in neurosurgical resident applicants assessed using NeuroTouch, a virtual reality simulator.  J Surg Educ. 2016;73(6):942-953. doi:10.1016/j.jsurg.2016.04.013PubMedGoogle ScholarCrossref
    7.
    Bajunaid  K, Mullah  MA, Winkler-Schwartz  A,  et al.  Impact of acute stress on psychomotor bimanual performance during a simulated tumor resection task.  J Neurosurg. 2017;126(1):71-80. doi:10.3171/2015.5.JNS15558PubMedGoogle ScholarCrossref
    8.
    AlZhrani  G, Alotaibi  F, Azarnoush  H,  et al.  Proficiency performance benchmarks for removal of simulated brain tumors using a virtual reality simulator NeuroTouch.  J Surg Educ. 2015;72(4):685-696. doi:10.1016/j.jsurg.2014.12.014PubMedGoogle ScholarCrossref
    9.
    Alotaibi  FE, AlZhrani  GA, Mullah  MA,  et al.  Assessing bimanual performance in brain tumor resection with NeuroTouch, a virtual reality simulator.  Neurosurgery. 2015;11(1)(suppl 2):89-98.PubMedGoogle Scholar
    10.
    World Medical Association.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects.  JAMA. 2013;310(20):2191-2194. doi:10.1001/jama.2013.281053PubMedGoogle ScholarCrossref
    11.
    Winkler-Schwartz  A, Bissonnette  V, Mirchi  N,  et al.  Artificial intelligence in medical education: best practices using machine learning to assess surgical expertise in virtual reality simulation  [published online June 13, 2019].  J Surg Educ. doi:10.1016/j.jsurg.2019.05.015PubMedGoogle Scholar
    12.
    Cheng  A, Kessler  D, Mackinnon  R,  et al; International Network for Simulation-based Pediatric Innovation, Research, and Education (INSPIRE) Reporting Guidelines Investigators.  Reporting guidelines for health care simulation research: extensions to the CONSORT and STROBE statements.  Simul Healthc. 2016;11(4):238-248. doi:10.1097/SIH.0000000000000150PubMedGoogle ScholarCrossref
    13.
    Hebb  AO, Yang  T, Silbergeld  DL.  The sub-pial resection technique for intrinsic tumor surgery.  Surg Neurol Int. 2011;2:180. doi:10.4103/2152-7806.90714PubMedGoogle ScholarCrossref
    14.
    Bugdadi  A, Sawaya  R, Bajunaid  K,  et al.  Is virtual reality surgical performance influenced by force feedback device utilized?  J Surg Educ. 2019;76(1):262-273. doi:10.1016/j.jsurg.2018.06.012PubMedGoogle ScholarCrossref
    15.
    MathWorks. fitcecoc: fit multiclass models for support vector machines or other classifiers. https://www.mathworks.com/help/stats/fitcecoc.html. Accessed January 9, 2019.
    16.
    MathWorks. fitcsvm: Train support vector machine (SVM) classifier for one-class and binary classification. https://www.mathworks.com/help/stats/fitcsvm.html. Accessed January 9, 2019.
    17.
    MathWorks. fitcdiscr: Fit discriminant analysis classifier. https://www.mathworks.com/help/stats/fitcdiscr.html. Accessed January 9, 2019.
    18.
    MathWorks. fitcnb: Train multiclass naive Bayes model. https://www.mathworks.com/help/stats/fitcnb.html. Accessed January 9, 2019.
    19.
    MathWorks. fitcknn: Fit k-nearest neighbor classifier. https://www.mathworks.com/help/stats/fitcknn.html. Accessed January 9, 2019.
    20.
    Spencer  J.  Learning and teaching in the clinical environment.  BMJ. 2003;326(7389):591-594. doi:10.1136/bmj.326.7389.591PubMedGoogle ScholarCrossref
    21.
    American Board of Medical Specialties. Steps toward initial certification and MOC. https://www.abms.org/board-certification/steps-toward-initial-certification-and-moc/. Accessed January 23, 2019.
    22.
    Ross  BK, Metzner  J.  Simulation for maintenance of certification.  Surg Clin North Am. 2015;95(4):893-905. doi:10.1016/j.suc.2015.04.010PubMedGoogle ScholarCrossref
    23.
    Issenberg  SB, McGaghie  WC, Petrusa  ER, Lee Gordon  D, Scalese  RJ.  Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review.  Med Teach. 2005;27(1):10-28. doi:10.1080/01421590500046924PubMedGoogle ScholarCrossref
    24.
    Epstein  RM.  Assessment in medical education.  N Engl J Med. 2007;356(4):387-396. doi:10.1056/NEJMra054784PubMedGoogle ScholarCrossref
    25.
    Abdul  A, Vermeulen  J, Wang  D, Lim  B, Kankanhalli  M. Trends and trajectories for explainable, accountable and intelligible systems: an HCI research agenda. Presented at: 2018 CHI Conference on Human Factors in Computing Systems; April 21-26, 2018; Montreal, QC.
    26.
    Murphy  TE, Vignes  CM, Yuh  DD, Okamura  AM.  Automatic motion recognition and skill evaluation for dynamic tasks.  Eurohaptics. 2003;2003:363-373.Google Scholar
    27.
    Megali  G, Sinigaglia  S, Tonet  O, Dario  P.  Modelling and evaluation of surgical performance using hidden Markov models.  IEEE Trans Biomed Eng. 2006;53(10):1911-1919. doi:10.1109/TBME.2006.881784PubMedGoogle ScholarCrossref
    28.
    Hajshirmohammadi  I, Payandeh  S.  Fuzzy set theory for performance evaluation in a surgical simulator.  Presence. 2007;16(6):603-622. doi:10.1162/pres.16.6.603Google ScholarCrossref
    29.
    Jog  A, Itkowitz  B, May  L,  et al. Towards integrating task information in skills assessment for dexterous tasks in surgery and simulation. Presented at: 2011 IEEE International Conference on Robotics and Automation; May 9-13, 2011; Shanghai, China.
    30.
    Liang  H, Shi  MY.  Surgical skill evaluation model for virtual surgical training.  Appl Mech Mater. 2011;40-41:812-819. doi:10.4028/www.scientific.net/AMM.40-41.812Google ScholarCrossref
    31.
    Loukas  C, Georgiou  E.  Multivariate autoregressive modeling of hand kinematics for laparoscopic skills assessment of surgical trainees.  IEEE Trans Biomed Eng. 2011;58(11):3289-3297. doi:10.1109/TBME.2011.2167324PubMedGoogle ScholarCrossref
    32.
    Huang  J, Payandeh  S, Doris  P, Hajshirmohammadi  I.  Fuzzy classification: towards evaluating performance on a surgical simulator.  Stud Health Technol Inform. 2005;111:194-200.PubMedGoogle Scholar
    33.
    Sewell  C, Morris  D, Blevins  NH,  et al.  Providing metrics and performance feedback in a surgical simulator.  Comput Aided Surg. 2008;13(2):63-81. doi:10.3109/10929080801957712PubMedGoogle ScholarCrossref
    34.
    Richstone  L, Schwartz  MJ, Seideman  C, Cadeddu  J, Marshall  S, Kavoussi  LR.  Eye metrics as an objective assessment of surgical skill.  Ann Surg. 2010;252(1):177-182. doi:10.1097/SLA.0b013e3181e464fbPubMedGoogle ScholarCrossref
    35.
    Rhienmora  P, Haddawy  P, Suebnukarn  S, Dailey  MN.  Intelligent dental training simulator with objective skill assessment and feedback.  Artif Intell Med. 2011;52(2):115-121. doi:10.1016/j.artmed.2011.04.003PubMedGoogle ScholarCrossref
    36.
    Kerwin  T, Wiet  G, Stredney  D, Shen  HW.  Automatic scoring of virtual mastoidectomies using expert examples.  Int J Comput Assist Radiol Surg. 2012;7(1):1-11. doi:10.1007/s11548-011-0566-4PubMedGoogle ScholarCrossref
    37.
    Ershad  M, Rege  R, Fey  AM.  Meaningful assessment of robotic surgical style using the wisdom of crowds.  Int J Comput Assist Radiol Surg. 2018;13(7):1037-1048. doi:10.1007/s11548-018-1738-2PubMedGoogle ScholarCrossref
    38.
    Deo  RC.  Machine learning in medicine.  Circulation. 2015;132(20):1920-1930. doi:10.1161/CIRCULATIONAHA.115.001593PubMedGoogle ScholarCrossref
    ×