The algorithm is based on skin pathology reports linked to health claims data in Ontario, Canada.
eTable. Predictor Variables Based on Ontario Health Insurance Program (OHIP) Claims
Customize your JAMA Network experience by selecting one or more topics from the list below.
Chan A, Fung K, Tran JM, et al. Application of Recursive Partitioning to Derive and Validate a Claims-Based Algorithm for Identifying Keratinocyte Carcinoma (Nonmelanoma Skin Cancer). JAMA Dermatol. 2016;152(10):1122–1127. doi:10.1001/jamadermatol.2016.2609
Can a valid algorithm be derived to identify keratinocyte carcinoma at a population level using health insurance claims data?
By applying recursive partitioning to a data set of 602 371 community laboratory pathology episodes linked to health insurance claims, an algorithm was derived with 82.6% sensitivity, 93.0% specificity, 76.7% positive predictive value, and 95.0% negative predictive value. The derived algorithm also performed well when validated using an independent hospital clinic data set.
The derived algorithm can reliably identify keratinocyte carcinoma for epidemiological research in the absence of cancer registry data. Recursive partitioning is an effective tool for deriving valid claims-based algorithms.
Keratinocyte carcinoma (nonmelanoma skin cancer) accounts for substantial burden in terms of high incidence and health care costs but is excluded by most cancer registries in North America. Administrative health insurance claims databases offer an opportunity to identify these cancers using diagnosis and procedural codes submitted for reimbursement purposes.
To apply recursive partitioning to derive and validate a claims-based algorithm for identifying keratinocyte carcinoma with high sensitivity and specificity.
Design, Setting, and Participants
Retrospective study using population-based administrative databases linked to 602 371 pathology episodes from a community laboratory for adults residing in Ontario, Canada, from January 1, 1992, to December 31, 2009. The final analysis was completed in January 2016. We used recursive partitioning (classification trees) to derive an algorithm based on health insurance claims. The performance of the derived algorithm was compared with 5 prespecified algorithms and validated using an independent academic hospital clinic data set of 2082 patients seen in May and June 2011.
Main Outcomes and Measures
Sensitivity, specificity, positive predictive value, and negative predictive value using the histopathological diagnosis as the criterion standard. We aimed to achieve maximal specificity, while maintaining greater than 80% sensitivity.
Among 602 371 pathology episodes, 131 562 (21.8%) had a diagnosis of keratinocyte carcinoma. Our final derived algorithm outperformed the 5 simple prespecified algorithms and performed well in both community and hospital data sets in terms of sensitivity (82.6% and 84.9%, respectively), specificity (93.0% and 99.0%, respectively), positive predictive value (76.7% and 69.2%, respectively), and negative predictive value (95.0% and 99.6%, respectively). Algorithm performance did not vary substantially during the 18-year period.
Conclusions and Relevance
This algorithm offers a reliable mechanism for ascertaining keratinocyte carcinoma for epidemiological research in the absence of cancer registry data. Our findings also demonstrate the value of recursive partitioning in deriving valid claims-based algorithms.
Keratinocyte carcinoma (often called nonmelanoma skin cancer) consists of cutaneous basal and squamous cell carcinomas.1 It is the most commonly diagnosed malignant neoplasm in populations of white race worldwide and can lead to substantial disfigurement, functional impairment, and (rarely) death.2,3 The estimated number of patients undergoing procedural treatment for keratinocyte carcinoma in the United States from 2006 to 2012 increased by 14%.4 Keratinocyte carcinoma accounts for 5% of all Medicare cancer expenditures, ranking as the fifth most expensive cancer to treat in terms of total costs in the United States.5,6
Despite such high incidence and health care costs, most cancer registries in the United States and Canada exclude keratinocyte carcinoma. This lack of population-based data impairs epidemiological research and health policy decision making. Administrative health insurance claims databases offer the opportunity to identify these cancers using diagnosis and procedural codes submitted for reimbursement purposes.7 However, previous efforts to validate claims-based algorithms based on 2 predictor variables have found low sensitivity to ascertain keratinocyte carcinoma.8
More complex algorithms have the potential to identify keratinocyte carcinoma with higher sensitivity and specificity but can be difficult to derive and validate owing to the large number of combinations of predictor variables that can be used. Recursive partitioning is a powerful statistical tool that systematically identifies the best predictors for a given outcome. It has been applied to claims databases to develop algorithms for identifying the presence of select medical conditions.9-11
We used recursive partitioning to derive and validate a health insurance claims–based algorithm for identifying keratinocyte carcinoma at a population level using linked data from pathology reports and Ontario Health Insurance Plan (OHIP) claims in Ontario, Canada, from January 1, 1992, to December 31, 2009. We completed our final analyses in January 2016. With 12.9 million residents in 2011, the province accounts for 38% of Canada’s population.12 We aimed to produce an algorithm that would have maximal specificity, while maintaining greater than 80% sensitivity.
After receiving research ethics approval from Women’s College Hospital (Toronto, Ontario, Canada), we linked skin pathology reports and health claims data for adult residents of Ontario (Figure). No informed consent was required for this epidemiological study because datasets were linked using unique encoded identifiers. We applied binary recursive partitioning (also known as classification trees) to these linked data to derive a claims-based algorithm that was then validated using internal and external data sets.
Ontario has a universal, single-payer health care system. The OHIP provides health insurance coverage of medically necessary care to all Ontario residents. No private payer reimbursement is allowed for any service covered by OHIP. The OHIP claims include diagnosis codes and procedural fee codes related to outpatient and inpatient physician encounters. The OHIP requires that physicians have a pathology report available to be reimbursed for any surgical treatment claim for skin cancer; therefore, all keratinocyte carcinomas should have been sent for pathological analysis unless they were treated topically without biopsy.
We obtained all 905 743 electronic skin pathology reports issued from January 1, 1992, to December 31, 2009, by LifeLabs, a major community laboratory in the province of Ontario. We imported the pathology reports into the Institute for Clinical Evaluative Sciences server in Toronto, which specializes in linking and managing population-based administrative databases for epidemiological research in Ontario. The pathology reports were securely linked to OHIP billing claims using each patient’s unique, encoded OHIP number before undergoing analysis.
We excluded 37 790 pathology reports for 36 134 nonresidents of Ontario and 1656 specimen collections dated outside of the 1992 to 2009 eligibility period. To account for multiple reports for the same or separate tumors in a given patient, we collapsed the remaining 867 953 reports into 627 510 community laboratory pathology episodes by combining all reports for the same patient within a 1-year window. After removing 25 139 episodes starting in patients younger than 18 years, our final sample consisted of 602 371 community laboratory pathology episodes.
We electronically searched the 3 diagnosis fields in each pathology report to identify potential keratinocyte carcinomas. Two stages of searches were conducted. In stage 1, search terms included basal cell, fibroepithelioma of pinkus, squamous cell, basosquamous, and keratoacanthoma. In stage 2, we used the broader term carcinoma to filter the reports remaining after stage 1. Search results from both stages were manually reviewed to confirm keratinocyte carcinoma diagnoses.
Patients with a histopathological diagnosis of keratinocyte carcinoma were classified as keratinocyte carcinoma positive, constituting the criterion standard diagnosis. Patients with any other diagnosis were considered keratinocyte carcinoma negative. If multiple skin specimens were collected from the same patient within a 1-year episode, the patient was classified as keratinocyte carcinoma positive if any of the specimens had a diagnosis of keratinocyte carcinoma. The diagnosis date was recorded from the earliest keratinocyte carcinoma pathology report within the 1-year period. Squamous cell carcinomas from anogenital or oral sites (aside from the lip) were not classified as keratinocyte carcinoma for our study because these cancers are already recorded in cancer registries.
We randomly divided the cohort into thirds using a computer random number generator (Figure). We used one-third to derive potential algorithms (derivation cohort [n = 201 158]), one-third to evaluate the potential algorithms and select the best-performing one (selection cohort [n = 200 237]), and the remaining one-third (internal validation cohort [n = 200 976]) to obtain an unbiased estimate of the performance of the final selected algorithm.
Using the derivation cohort, we identified the OHIP index claim for specimen collection (biopsy or surgery) associated with the pathology report. We developed a list of potential claims-based variables associated with this index claim that could be positively or negatively associated with keratinocyte carcinoma (eTable in the Supplement).
We input these predictors into the rpart function of a software package (rpart, version 4.1-10; The R Project for Statistical Computing) to perform binary recursive partitioning and construct an initial large classification tree. The nonparametric approach develops decision rules to fit a classification tree without making any assumptions about the nature of an underlying statistical model that relates the predictor variables to the outcome variable. The decision tree is built through an iterative process whereby nodes in the tree are split into increasingly homogeneous subgroups based on the observed outcome. A histopathological diagnosis of keratinocyte carcinoma was the outcome variable. The Gini criterion was used as the splitting rule, meaning that the rpart function sequentially partitioned the sample into increasingly homogeneous subgroups (ie, mostly keratinocyte carcinoma or mostly other diagnoses) based on one predictor variable at a time. The most discriminating predictor variable (ie, best able to distinguish keratinocyte carcinoma vs other diagnoses) was selected at each partition until no further partitioning was possible (stopping criterion, P < .0000001 on χ2 test).13,14 We then performed iterative, backward “pruning” of this tree to produce a sequence of smaller trees by increasing the minimum number of observations in a partition to qualify for further partitioning and by decreasing the false-positive α error.15 We varied the ratio of misclassification cost for false-negative vs false-positive predictions to produce multiple trees. Next, we calculated sensitivities and specificities from these multiple trees and selected the 2 best-performing algorithms (derived algorithms 1 and 2) to apply to the selection cohort.
Using the selection cohort, we compared the performance (sensitivity, specificity, positive predictive value, and negative predictive value) of derived algorithms 1 and 2 with the following 5 simple, plausible, prespecified algorithms based on claims dated within 180 days after the index date: (1) keratinocyte carcinoma surgical treatment claim or radiation treatment claim with International Classification of Diseases, Ninth Revision (ICD-9) diagnosis code 173 (keratinocyte carcinoma), (2) keratinocyte carcinoma surgical treatment claim or any claim with diagnosis code 173, (3) any claim with diagnosis code 173, (4) keratinocyte carcinoma surgical treatment claim, and (5) keratinocyte carcinoma surgical treatment claim with diagnosis code 173. We then selected the algorithm with the best performance (highest sensitivity and specificity ≥80%) and reasonable face validity (as determined by the investigators). Review for face validity led to the addition of 2 supplemental criteria. First, a patient was classified as keratinocyte carcinoma positive if diagnosis code 173 was recorded as being responsible for an inpatient or outpatient hospital stay based on records from the Canadian Institute for Health Information (Discharge Abstract Database, National Ambulatory Care Reporting System database, and Same Day Surgery Database). Second, the algorithm would predict keratinocyte carcinoma negative if the index claim or diagnosis code was specific to anogenital sites (eg, ICD-9 code 154 for anal carcinoma).
Next, we internally validated the performance of the final selected algorithm using the remaining one-third of our LifeLabs cohort (internal validation cohort). We also evaluated any temporal (5-year interval) variations in performance.
Last, we externally validated the performance of the final selected algorithm using an independent clinic data set obtained by retrospective medical record review. We used clinic records to identify all patients seen from May 1 to June 30, 2011, at the Women’s College Hospital general dermatology clinic (n = 2082). To identify histopathologically confirmed diagnoses of keratinocyte carcinoma in these patients, we manually reviewed their medical records and any pathology reports. These data were then linked to the OHIP claims database. The performance of the final algorithm and the 5 simple prespecified algorithms was evaluated using index OHIP claims associated with these patients in May and June 2011.
Among 602 371 pathology episodes, 131 562 (21.8%) had a diagnosis of keratinocyte carcinoma. Using recursive partitioning, the derivation cohort yielded algorithms with sensitivities ranging from 71.3% to 88.5% and specificities ranging from 90.4% to 95.3%.
When applied to the selection cohort, performance varied among the 2 best derived algorithms and the 5 prespecified algorithms (Table 1). Derived algorithm 1 was the best-performing algorithm in terms of specificity (92.9%) and positive predictive value (76.4%), while maintaining a sensitivity of 82.3%. This final algorithm identifies keratinocyte carcinoma if the index claim (specimen collection) is not for an anogenital site and any one of the following 3 criteria is present: (1) index claim for cutaneous specimen collection (biopsy or surgery), followed by any claim accompanied by ICD-9 diagnosis code 173 (nonmelanoma skin cancer) in the subsequent 180 days; (2) index claim for skin cancer surgical treatment accompanied by diagnosis code 173; or (3) hospital discharge record with diagnosis code 173.
In the LifeLabs community laboratory internal validation cohort, 21.9% (44 085 of 200 976) of pathology episodes had a keratinocyte carcinoma diagnosis. The final algorithm achieved 82.6% (95% CI, 82.3%-83.0%) sensitivity, 93.0% (95% CI, 92.8%-93.1%) specificity, 76.7% (95% CI, 76.3%-77.1%) positive predictive value, and 95.0% (95% CI, 94.9%-95.1%) negative predictive value. Algorithm performance did not vary substantially across time (Table 2). Compared with the actual incidence in our cohort, the algorithm produced slightly higher incidence estimates overall (23.6% vs 21.9%) and across sex and age groups (Table 3).
In the hospital clinic external validation cohort, 2.5% (53 of 2082) of patients had a keratinocyte carcinoma diagnosis. The final algorithm produced an incidence estimate of 3.1% (65 of 2082) and outperformed the 5 prespecified algorithms, particularly in terms of sensitivity (84.9%) and positive predictive value (69.2%) (Table 4).
We derived a claims-based algorithm that provides the first validated mechanism, to our knowledge, for identifying keratinocyte carcinoma with high sensitivity and specificity on a large regional scale in North America. This algorithm enables measurement of incidence trends, evaluation of associations with other diseases, and health services research related to diagnosis and management of keratinocyte carcinoma. Our use of recursive partitioning can also be applied to develop claims-based algorithms for other diseases or health systems.
Existing methods of identifying keratinocyte carcinoma in North America have major limitations. In Canada, only Alberta, Manitoba, Saskatchewan, and New Brunswick have cancer registries that record keratinocyte carcinoma.16-18 These 4 provinces have small populations (0.75-3.65 million) and few visible minorities (2%-18%),19,20 which limits the generalizability of their registry data. Similarly, data sources in the United States have limited generalizability across age groups, races, and socioeconomic status because they are restricted to specific health care organizations,21,22 Medicare patients,4,7 or small counties.23-25 Furthermore, the use of Medicare claims to identify keratinocyte carcinoma has not been validated, to our knowledge.
Our algorithm is more complex and performed significantly better than previously published claims-based algorithms. Validation studies8,26 using data from a US health maintenance organization found that an algorithm requiring both an ICD-9 diagnosis code 173 and a Current Procedural Terminology treatment code had the highest specificity (94%) and positive predictive value (68%) but low sensitivity (48%) compared with either code alone. When applied to our data sets, this algorithm (prespecified 5) performed worse than our final derived algorithm in terms of sensitivity and positive predictive value.
Our novel use of recursive partitioning to derive the algorithm has several advantages, including relative ease of handling a large number of predictor variables, flexibility in varying the prioritization of misclassifications to emphasize specificity vs sensitivity, and a nonparametric nature that does not require explicit assumptions about the nature of the relationship between continuous predictor variables and the outcome. The limitations of recursive partitioning include the potential for overfitting and low generalizability, which underscores the importance of validating the algorithm in an independent data set.
Our study has limitations. First, the coding system used for health claims may differ between geographic regions but is unlikely to affect the generalizability beyond Ontario because the diagnoses and procedures are the same across regions even if the codes representing them may differ. For example, diagnosis codes C44 (International Statistical Classification of Diseases and Related Health Problems, Tenth Revision) and 173 (ICD-9) represent keratinocyte carcinoma and should be interchangeable in terms of the algorithm performance. Similarly, procedural codes for Mohs surgery, curettage, or other surgical treatments should be interchangeable. Second, as with previous epidemiological investigations of keratinocyte carcinoma, the algorithm is unable to distinguish between basal and squamous cell carcinomas because they have the same OHIP diagnosis code. It also cannot distinguish primary vs recurrent tumors. Third, the algorithm would miss tumors that are diagnosed clinically and treated without any biopsy or surgical treatment claim recorded in the OHIP database. An example would be superficial tumors treated with topical medications. Fourth, because we collapsed multiple pathology specimens for a given patient within a 1-year window, the algorithm cannot identify multiple tumors in the same patient within that time frame.
We have demonstrated the novel use and value of recursive partitioning to derive a robust claims-based algorithm that offers a unique mechanism for population-based identification of keratinocyte carcinoma. The validated algorithm can facilitate efforts to measure the burden of disease and conduct studies of epidemiological associations in the absence of registry data.
Accepted for Publication: June 11, 2016.
Corresponding Author: An-Wen Chan, MD, DPhil, Women’s College Research Institute, Women’s College Hospital, 76 Grenville St, Room 6416, Toronto, ON M5S 1B2, Canada (email@example.com).
Published Online: August 10, 2016. doi:10.1001/jamadermatol.2016.2609.
Author Contributions: Dr Chan had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Chan, Fung.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Chan.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Fung.
Obtained funding: Chan, Austin, Weinstock, Rochon.
Administrative, technical, or material support: Tran, Kitchen.
Study supervision: Chan.
Conflict of Interest Disclosures: None reported.
Funding/Support: This study was funded by the Canadian Dermatology Foundation and the Biggar-Hedges Foundation and was supported by the Institute for Clinical Evaluative Sciences, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care. Dr Austin is supported in part by a Career Investigator Award from the Heart and Stroke Foundation of Canada (Ontario office).
Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclaimer: The opinions, results, and conclusions reported in this article are those of the authors and are independent from the funding sources. No endorsement by the Institute for Clinical Evaluative Sciences or the Ontario Ministry is intended or should be inferred. Parts of this study are based on data and information compiled and provided by the Canadian Institute for Health Information; however, the analyses, conclusions, opinions, and statements expressed herein are those of the authors and not necessarily those of the Canadian Institute for Health Information.
Create a personal account or sign in to: