Utility of the Simulated Outcomes Following Carotid Artery Laceration Video Data Set for Machine Learning Applications

Key Points Question What is the utility of a data set that contains videos of surgeons managing hemorrhage? Findings This quality improvement study of the Simulated Outcomes Following Carotid Artery Laceration (SOCAL), a public data set of surgeons managing catastrophic surgical hemorrhage in a cadaveric training exercise included 65 071 instrument annotations with recorded outcomes. Computer vision–based instrument detection achieved a mean average precision of 0.67 on SOCAL and a sensitivity of 0.77 and a positive predictive value of 0.96 at detecting surgical instruments from real intraoperative video. Meaning A corpus of videos of surgeons managing catastrophic hemorrhage is a novel, valuable resource for surgical data science.

course instructors (endoscopic endonasal approach experts) and watched a standardized video of a senior author (G.Z.) explaining the recommended stepwise technique of ICAI management. T2 was then performed with feedback.

Data Set Development
Intraoperative video was taken from the Karl Storz Video Neuro-Endoscope used during each of these trials. A total of 147 videos from this nationwide educational intervention were recorded and saved. Videos were recorded at a frame rate of 30 frames per second (fps) and a resolution of 1280x720 or 1920x1080. These videos are taken from multiple cadaveric heads, with different lighting, anatomy, laceration sites, camera resolutions, and brands of endoscopic instruments.
The duration of the trials varies from 46 seconds to 5 minutes. Each trial video was downsampled from 30 frames per second (fps) to 1 fps using ffmpeg and a bounding box was created around each surgical instrument in each frame and labeled as suction, grasper, cottonoid, muscle, string, drill, scalpel, or other (non-specified surgical instruments). For each instance of an instrument in frame, an annotator drew a bounding box around the instrument such that the entirety of the instrument was encompassed by the bounding box, following published protocols using the open-source image annotation software VoTT. Following a first pass of video annotations, members of the research team with significant experience viewing endoscopic endonasal video (GK, DJP) manually audited annotations to check for quality. Frames with missing annotations or mislabeled annotations were subsequently re-annotated In conjunction with trial video recordings, "outcomes data" (e.g. blood loss, task success) and demographic data (e.g. training status, confidence) was recorded for each participant.

Model Development
Using published model weights from pretraining on ImageNet, RetinaNet and YOLOv3 were finetuned to perform instrument bounding box detection on the SOCAL data set. When designing our training, validation, and test splits, we used one cohort of surgeons as a test set (7 surgeons, 14 trials), another cohort as a validation set (6 surgeons, 9 trials), and the remaining cohorts for model training (5 cohorts, 63 surgeons, 135 trials). We chose to split the data set in this way to replicate a real-world workflow, where the model would be tasked to analyze video (split into individual frames) from an entirely new cohort of surgeons. We evaluated these models on their ability to assign bounding box coordinates to all instances of suction, grasper, cottonoid, string, muscle, drill, scalpel, and other instruments in our data set.
We inputted our data set into two publicly available object detection models, RetinaNet and YOLOv3. 37,38 We trained and evaluated these models on their ability to assign bounding box coordinates to all instances of suction, grasper, cottonoid, string, muscle, drill, scalpel, and other instruments in our data set. The input into these object detection models is a video file, and the output is the bounding box coordinates and label of any of the eight instruments we trained on. To implement RetinaNet, we forked the fizyr/keras-retinanet github repository, initialized the model, using the preexisting Imagenet weights, and trained the model for 45,000 iterations (18 epochs, 2500 steps, batch size of 1). To implement YOLOv3 we forked the zzh8829/yolov3-tf2 github repository using the preexisting Darknet weights and trained the model for 26,972 iterations (11 epochs, 2,452 steps, batch size of 8). The learning rate was initialized to 1e-5 for RetinaNet and 1e-3 for YOLOv3.
For both models, the learning rate was decreased by a factor of 10 whenever the loss plateaued for two epochs. Training was stopped when the loss plateaued for five consecutive epochs.