Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths

Key Points Question What is the most accurate machine learning and natural language processing model to identify substances related to overdose deaths in medical examiner data? Findings In this diagnostic study of 35 433 death records, machine learning models were able to classify with perfect or near perfect performance deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal. Meaning In this study, a natural language processing workflow was able to automate identification of substances related to overdose deaths in medical examiner data.

eTable 1. Classifications and keywords of substances related to overdoses.
Coefficients were extracted using TF-IDF and logistic regression.Tokens in the Positive (right) plot increase the probability that the text description will be classified to the substance.Tokens in the Negative (left) plot decrease the probability that the text description will be classified to the substance.
Coefficients were extracted using TF-IDF and logistic regression.
Coefficients were extracted using TF-IDF and logistic regression.
Coefficients were extracted using TF-IDF and logistic regression.Tokens in the Positive (right) plot increase the probability that the text description will be classified to the substance.Tokens in the Negative (left) plot decrease the probability that the text description will be classified to the substance.
Coefficients were extracted using TF-IDF and logistic regression.Tokens in the Positive (right) plot increase the probability that the text description will be classified to the substance.Tokens in the Negative (left) plot decrease the probability that the text description will be classified to the substance.
Coefficients were extracted using TF-IDF and logistic regression.Tokens in the Positive (right) plot increase the probability that the text description will be classified to the substance.Tokens in the Negative (left) plot decrease the probability that the text description will be classified to the substance.
Coefficients were extracted using TF-IDF and logistic regression.Tokens in the Positive (right) plot increase the probability that the text description will be classified to the substance.Tokens in the Negative (left) plot decrease the probability that the text description will be classified to the substance.
Bootstrapped diagnostic metrics and 95% confidence intervals of best performing models in test dataset (n = 7,087) using TF-IDF as feature representations.Values are means of 1,000 resamples bootstrapping procedure, values in parenthesis are lower and upper bounds of 95% percentiles for the bootstrapping procedure.eTable 4. Bootstrapped diagnostic metrics and 95% confidence intervals of best performing models in test dataset (n = 7,087) using word embedding (GloVe) as feature representations.