Assessment of Automated Identification of Phases in Videos of Cataract Surgery Using Machine Learning and Deep Learning Techniques

Key Points Question Are deep learning techniques sufficiently accurate to classify presegmented phases in videos of cataract surgery for subsequent automated skill assessment and feedback? Findings In this cross-sectional study including videos from a convenience sample of 100 cataract procedures, modeling time series of labels of instruments in use appeared to yield greater accuracy in classifying phases of cataract operations than modeling cross-sectional data on instrument labels, spatial video image features, spatiotemporal video image features, or spatiotemporal video image features with appended instrument labels. Meaning Time series models of instruments in use may serve to automate the identification of phases in cataract surgery, helping to develop efficient and effective surgical skill training tools in ophthalmology.

eAppendix. Supplemental methods eTable 1. Metrics to evaluate performance of algorithms to classify phases in cataract surgery eTable 2. Differences in area under the receiver operating characteristic curve between pairs of algorithms for phase classification This supplementary material has been provided by the authors to give readers additional information about their work.

eAppendix. Supplemental methods
This section supplements the Methods described in the manuscript to provide sufficient detail to replicate our experiments. We used Python 3.5.2 with OpenCV and skvideo.io packages for video processing, scikit-learn for nearest neighbors, and PyTorch 0.4.1 as our deep learning Python library.

Support Vector Machine (SVM)
We used a SVM in Algorithm #1. The input to the SVM is labels for instruments that are in use at a given image or frame in a video, (i.e., cross-sectional data). We adopted a one-vs-the-rest strategy to implement the SVM, i.e., we fitted one SVM per phase to classify whether a given vector of instrument labels (in use at the time) belongs to the particular phase or any of the other phases. We fitted 10 such classifiers, one for each phase.
We represented labels of instruments in use in a given video frame as a dimensional indicator vector where the -th dimension is 1 if the corresponding tool is in use. Let … be a set of labels of instruments in use within phase of video clip . We select a subset * … * from such that * is a unique set of features. For example, if a particular phase contained only two unique combinations of instrument label, we included the two as part of the training data. The collection of * for all s and s constituted the dataset to fit our multi-class SVMs.
We used a linear SVM implementation provided by sklearn and performed a grid search over the penalty parameters ∈ 0.1, 0.2, 0.5, 1, 2, 5 of the error terms.

Temporal Recurrent Neural Network (RNN)
We used the RNN in Algorithms #2, #4, and #5. For the RNN, we implemented a standard long short-term memory (LSTM) model. We implemented three separate RNN models with different inputs: 1.
Spatial features obtained from the spatial CNN 2.
Spatial features obtained from the spatial CNN appending instrument labels 3.

Instrument labels alone
Given a single frame in the video, the output dimension of SqueezeNet (spatial CNN model) is a vector of size 512 and the dimension of the instrument labels (annotations) is 14. Thus, the inputs for the three separate RNN models itemized above are 512 by temporal length , 526 by temporal length , and 14 by temporal length , respectively. We structure the RNN as follows:  An LSTM layer that outputs a 512 dimensional vector. The input at each time-step is fed into this layer and saved.
 A global average pooling across time-steps.
 A fully connected layer that outputs a feature vector of length 128.
 The resulting feature is trained to classify phase of the clip using softmax activation and cross-entropy loss.
The following are the optimization parameters to replicate our study. We optimize the temporal RNN using the ADAM optimizer with the following parameters: 

Spatial Convolutional neural network (CNN)
We used the CNN in Algorithms #3, #4, and #5. To implement the spatial CNN, we utilized the SqueezeNet architecture. SqueezeNet consists of multiple fire modules. Each module contains a series of 1 × 1 convolutional filters to "squeeze" the information followed by a set of 1 × 1 and 3 × 3 filters to expand. In the standard architecture, there are 8 fire modules. The number of filters in each of the modules is given in Figure 1. SqueezeNet is implemented as follows: Convolution with n filters, where n is the number of classes. In our case, n is 10. Softmax. Figure 1: Information on the number of filters in each fire module. Note that in the SqueezeNet paper, these modules are labeled "fire2" through "fire9".
We take PyTorch's implementation of SqueezeNet from their pretrainedmodels package, already pre-trained on ImageNet. We then replaced the last convolutional layer with one that has 10 classes rather than 1000 in order to allow for classification of the 10 phases. The model is then finetuned to classify phase on cataract surgery images. To do this, we first extract training examples from our videos through the following steps.
 One frame per second is extracted from all surgical videos. This is done while still maintaining the split between training and evaluation data.  This image training data is then balanced so there is an even distribution of images for each phase.
We then finetuned the model on the above data set using the ADAM optimizer with the following parameters:  Initial Learning Rate: 0.001