Identifying Smoking Environments From Images of Daily Life With Deep Learning

This cross-sectional study assesses the utility of deep learning for identifying environments and objects associated with smoking.

Three approaches to classification comprising several specific classifiers were initially explored. Two were based on the Inception v4 CNN previously described (Approaches 1 and 2), and the third utilized a Faster-RCNN object detection network 1 based on the ResNet CNN architecture 2 and pre-trained on the Common Objects in Context (COCO) dataset (Approach 3). 3 All classifiers were trained and evaluated using nested cross-validation, 4 and the same CV partitions were used for each classifier. Numeric and categorical hyperparameters were selected as the median or mode, respectively, of the optimal values found in each inner loop. The final model was selected due to its competitive performance (i.e. no statistically significant differences compared to other Approach 1 models) and the interpretability and familiarity of logistic regression.
The final classifier (Inception v4 + L2-regularized Logistic Regression) had highest AUC and accuracy under all three validation schemes (i.e. trained via cross-validation with Durham images, trained via cross-validation with Pittsburgh images, trained via cross-validation with all images). However, several other Approach 1 models had similar performance; for example, a second Approach 1 model (Inception v4 + MLP) model had similar AUC when trained on Duke and combined image sets (0.855 and 0.828, respectively), and a third Approach 1 model (Inception v4 + LDA) had similar accuracy (78.6% and 76.3%, respectively). Detailed performance for all classifiers (mean ± SD of AUC and accuracy across all CV folds for all image sets) may be found in eTable 3.
Differences in AUC between classifiers of the same approach were not statistically significant (p>10 -4 ). In contrast, differences in AUC between approaches were statistically significant (p<10 -4 ): Approach 1 performed better than Approach 2, which in turn performed better than Approach 3. The one exception was the Pittsburgh image set, where differences between Approaches 1 and 2 were not statistically significant (p>10 -4 ).

Approach 1: Inception v4 + Classifier
These classifiers follow the approach described in the main text, in which the output logits from the pre-trained Inception v4 model were used as predictors to train a smoking/nonsmoking classifier in Scikit-learn 0.19.1. 5 In addition to L2-regularized logistic regression, we explored: (1) L1-regularized logistic regression, (2) a multilayer perceptron (MLP) with a single hidden layer, and (3) linear discriminant analysis. Hyperparameters tuned by nested CV included regularization parameters and the number of MLP hidden units.

Approach 2: Inception v4 Retraining
The Inception v4 network was modified and fine-tuned to directly classify images as smoking/nonsmoking. Specifically, the final two layers (logit and softmax) were modified for our two-class problem and randomly initialized. The network was then trained in Tensorflow via stochastic gradient descent (ADAM optimizer 6 , learning rate = 10 -4 , dropout pkeep = 0.8) with mini-batches of 60 images to minimize average cross-entropy over the training set for each outer fold. The number of training epochs was chosen by nested CV: training proceeded until average cross-entropy over the inner fold validation set exceeded 105% of its minimum. 7

Approach 3: Faster-RCNN-ResNet + Classifier
A COCO-trained Faster-RCNN-ResNet model was directly applied to all images via Tensorflow to detect objects included in the 90 COCO object classes. Object class counts were then taken as predictors for a classification model trained on the current dataset. Five classifiers were explored: (1) L1-and (2) L2regularized logistic regression, (3) multi-layer perceptron with a single hidden layer, (4) Bernoulli naïve Bayes, and (5) multinomial naïve Bayes. These classifiers were implemented in Python 3.5 via Scikit-learn 0.19.1.

Objects per Image
The number of objects detected per image (via Faster-RCNN-ResNet) was higher for the Durham images (p=0.004), with a greater proportion of images having ≥ 2 objects (77.7% Durham, 68.5% Pittsburgh; p<0.001).