Use of Latent Class Analysis and k-Means Clustering to Identify Complex Patient Profiles

This cohort study uses data clustering methods and clinical stakeholder assessment to identify clinical profiles in a population of medically complex patients.


Introduction
A small number of patients consume a large proportion of the total national health care budget. 1 These medically and socially complex patients have been the target of efforts by health systems and insurers to bend the cost curve of increasing health care expenditures. To date, care management programs designed to address needs and reduce hospitalizations or disease progression in medically complex patients have shown only modest success. [2][3][4][5][6][7][8] Recent randomized clinical trials 9,10 have now shown substantial clinical benefit or cost savings despite the intuitive value of care management efforts, such as social work consultation, electronic registries, pharmacist consultations, home visits, and other care management strategies.
One recognized limitation of current care programs involves the initial step of patient identification. 11 Individuals with complex medical and social care needs have traditionally been identified by prior year costs or care utilization, number and type of concurrent comorbid conditions, and/or predicted future hospitalization or costs. These approaches lack specificity and may contribute to the poor results seen for most prior care management interventions. Ideally, population-based care management programs for medically complex patients that include patient surveillance, tracking, and outreach by nurses, social workers, and other health care workers would be tailored to the different needs of distinct patient subgroups. 12 This strategy would allow care resources to be more effectively allocated to address care barriers faced by different patients with otherwise similarly high levels of comorbidity, prior utilization, and predicted risk. There are limited empirical data to guide which patient subgroups exist within the overall medically complex patient population.
We sought to characterize clinical heterogeneity among patients with the highest medical complexity (defined by commonly used thresholds for comorbidity, health care utilization, and predicted hospitalization risk) within a large, integrated care delivery system using available electronic health record data. We tested the hypothesis that a data-driven approach could yield distinct, clinically meaningful patient profiles within this narrowly defined stratum of medically complex patients. Our overarching goals were to provide an empirical basis for conceptual models of patient medical complexity 13 and to inform strategies for tailoring care for the most medically complex patients within a care system or network.

Setting and Participants
This cohort study was conducted within Kaiser Permanente Northern California (KPNC), an integrated care delivery system with 4.2 million members. KPNC provides care to a population insured through employer-based plans, Medicare, Medicaid, and the California health insurance exchange. Members are highly representative of the local populations. 14 KPNC uses a single electronic health record for all inpatient and outpatient care, including all pharmacy orders and prescriptions dispensed. Any out-of-network care is also recorded through the KPNC external billing system. The Kaiser Permanente Institutional Review Board approved the study and granted permission for a waiver of consent for study participants as allowed under the Common Rule. This report followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. 15 Among the 3.3 million adult members of KPNC who were 18 years or older, we defined a cohort representing a narrow band of the most medically complex patients within the care system. Using a snapshot of data from July 15, 2018, we identified patients based on high comorbidity (Comorbidity Point Score, version 2 [COPS-2] >14) 16  scores indicating more comorbidities. The LOH score uses patient historical clinical data and logistic regression to predict the likelihood of admission in the next 6 months; scores range from 0 to 1, with a higher score indicating greater likelihood of admission. Because both these scores are highly skewed (with most adults having low scores), thresholds (14 for COPS-2, >0.25 for LOH) are used in clinical practice to identify patients with multiple chronic comorbidities and increased hospitalization risk. On the basis of the goals of care management programs for medically complex patients, we also included the 2 or more ED admissions in the prior year as an indicator of potentially preventable high cost care. We then linked all available electronic data for these study patients from the preceding 12 months for our primary analyses and for the subsequent 12 months (to July 15, 2019) for our outcome assessments.

Analytic Process
Our goal was to define distinct patient clinical profiles through application of 2 independent, unsupervised grouping methods: latent class analysis (LCA) and k-means clustering with preprocessing by generalized low-rank models (GLRM). 18 We implemented the following 4 steps to achieve this goal.

Clinically Guided Variable Selection and Reduction
From the more than 5000 potential data elements available within the electronic health record identified in the first step, we created a more limited analytic data set by combining similar variables (eg, using medication group category to combine different types of benzodiazepines), selecting variables known to be important for gauging health status (eg, markers of frailty such as requiring a wheelchair), and excluding variables that were rare (<1% of cohort) or not informative (eg, common, benign dermatologic procedures). Decisions about variables were made by research team consensus (all authors). This clinically guided process also included assessment of statistical correlations using the Jaccard similarity metric 20 and divisive clustering techniques (proc varclus in SAS, version 9.4 [SAS Institute]) to combine or remove redundant and highly correlated variables, thereby increasing our ability to extract meaningful information from a smaller set of key variables. Variables were dichotomized using common clinical thresholds (eg, abnormal laboratory result thresholds) and top quartile (eg, for count variables). This process resulted in a final set of 97 informative variables for our cluster analysis (eTable 2 in the Supplement). We excluded basic demographic variables (age, sex, and race/ethnicity) and cohort-defining variables (COPS-2, LOH score, and prior 12-month ED admissions) from the variable reduction and grouping process. These excluded variables were subsequently used to further describe the resulting data clusters.

Identification of Patient Clusters
We implemented 2 independent analytic strategies to identify distinct patient clusters within our narrowly defined band of medical complexity. We first used LCA, a model-based approach that defines patients by shared underlying unobserved characteristics. Latent class analysis is an iterative, maximum likelihood method that estimates how patterns in patient characteristics can be summarized into a finite number of groups, or latent classes, by providing a probability distribution over the cluster assignment for each patient. To investigate the extent to which these groupings remained stable across 2 methods, we separately applied k-means clustering, a non-model-based method that applies optimization algorithms to define patient clusters. This cluster assignment method is based on the minimum distance of a patient from the centroid of the cluster (ie, the sum of the deviation of each variable compared with the centroid values).

Interpretation of Cluster Results
We used a 3-fold strategy to derive clinical meaning from the clustering analysis results. First, we examined 1-year follow-up outcomes (ED admission, in-patient hospitalization, and mortality) by cluster. Second, we shared our analytic results with clinical stakeholders to define clinically meaningful complex patient profiles and suggest potential care strategies tailored to these distinct patient profiles. Third, we compared the patient groupings independently derived from the 2 parallel methods (LCA and k-means clustering) to investigate patterns of commonality and dissimilarity.
The clinical stakeholders were members of the KPNC Complex Needs Advisory Council, an interdisciplinary committee of 25 clinicians (hospital-based physicians, geriatricians, primary care physicians, pharmacy leaders, nurse care coordinators, and social workers) and clinical operational leaders. Membership on the advisory council was based on expertise and commitment to improving care for patients with complex medical and social needs. To define clinically meaningful complex patient profiles, stakeholders first formed small groups of 2 to 3 individuals to review the data during a 2-hour session; then each group presented to the overall committee, and consensus was reached on the patient profiles represented by each data cluster. These profiles reflected the group's clinical experiences with different types of medically complex patients who they had cared for in their practices.

Statistical Analysis Latent Class Analysis
Latent class analysis is a finite mixture modeling method that assumes the overall population heterogeneity with respect to a distribution of observable response (ie, manifest) variables is the result of at least 2 or more unobserved, homogenous subgroups, known as latent classes. The

GLRM Reduction and k-Means Clustering
We first transformed our data set of binary variables into continuous latent variables using GLRM, with settings selected through a grid search with 5-fold cross-validation to minimize reconstruction error. We then applied k-means clustering with the Euclidean distance metric, a non-model-based method not grounded by an underlying statistical model and typically corresponding to discrete optimization algorithms to optimize a diverse range of objective criteria. Optimal cluster number solution was determined by finding the highest mean Rand index obtained by running k-means clustering on 100 bootstrapped samples across 2 to 10 clusters.

Patient Population
We identified 104 869 adults (3.3% of the adult population in KPNC) with a COPS-2 greater than 14 and either an LOH score greater than 25% or more than 2 ED admissions in the prior year; 86% had a low comorbidity score, 11% had a high comorbidity score only, and 3% had high comorbidity and utilization scores. The mean (SD) age of the sample was 70.7 (14.5) years, 52.4% were women, 39% were non-White race/ethnicity, and the mean (SD) COPS-2 was 72.7 (42.6

LCA Results
The optimal LCA solution yielded 7 patient groups. From the initial set of 97 variables used to define the groups, we identified the variables that had the largest variation in model-estimated prevalence across the 7 clusters. Figure 1 shows the 17 variables with at least 55% prevalence in 1 or more classes and less than 20% in other classes (details given in eTable 2 in the Supplement). 21 The 7 groups had significant differences in 1-year outcomes, with mortality ranging from 3.0% to 23.4%, hospitalization from 18.3% to 51.2%, hospice admission from 1.6% to 17%, and mental health care visit from 5.3% to 59.8% ( Table 1 and eTable 3 in the Supplement).

Patient Profiles
The data in Figure 1 and Table 1 were presented to the clinical stakeholders for review and discussion.
The group agreed on the following final descriptive labels for these 7 complex patient profiles: highest acuity (highest inpatient and outpatient utilization with the most comorbid conditions), older patients with cardiovascular (older patients with a high prevalence of cardiovascular diseaserelated conditions and complications), frail elderly (oldest group with the highest 1-year mortality and The 17 variables were chosen from the 97 used in the latent class analysis model because they had the largest variation in prevalence across the 7 classes. Each listed variable had at least 55% prevalence in 1 or more class and less than 10% in other classes. BNP indicates brain natriuretic peptide; CVD, cardiovascular disease. the most frailty-related needs), chronic pain management (high outpatient utilization and mental health needs complicated by ongoing prescription of opioid-related drugs), active cancer treatment (intensive oncologic therapy with associated medical and pain management issues), psychiatric illness (severe mental illness complicated by low income, social needs, and pain management), and less clinically engaged (prevalent comorbidities but fewer visits).
Further examination of the clinical utilization and comorbidity patterns in conjunction with the patient clinical profiles also suggested specific strategies for how existing clinical care tools and programs could be tailored to meet the different complex medical and social needs for each patient profile. These optimal care strategies are summarized in Table 2.

Comparison of LCA Results With GLRM and k-Means Clustering Results
The GLRM and k-means clustering approach yielded an 8-class solution. We investigated the extent to which patients assigned to these 8 clusters matched the 7 profiles derived from the LCA. As shown in Figure 2, most patients in 7 of the 8 k-means clusters were primarily in a single LCA-derived patient profile. For example, 54% of patients in the second k-means cluster were in the older cardiovascular disease LCA cluster and 88% of patients in the first k-means cluster were in the psychiatric illness LCA cluster. The overlap was very high for 2 k-means clusters (>75% of patients in each k-means cluster patients were included in the active cancer treatment or psychiatric illness LCA-derived clinical profiles) and moderately high for 5 k-means clusters (50%-75% of patients in each cluster were in a  single LCA-derived profile with <15% of patients in that cluster represented in another profile). The eighth k-means cluster (characterized primarily by obesity and insulin requirement) had patients represented in several different LCA groups with no one profile capturing more than 33% of the patients. Relative risk for 1-year outcomes by complex patient profile defined by k-means clusters were similar to the risks defined by LCA clusters (eTable 4 in the Supplement).

JAMA Network Open | Health Informatics
We noted that the highest acuity cluster from the LCA did not have a corresponding cluster in k-means clustering. When we investigated the overlap between grouping methods, we found that patients defined within the LCA highest acuity group were redistributed to the frail elderly (25%), older with cardiovascular disease (17%), pain management (17%), and the new, eighth k-means cluster (16%) (which we characterized as primarily patients with obesity and diabetes requiring insulin), suggesting that k-means clustering did not separate by acuity. Within the LCA less engaged profile, k-means clustering identified 2 sub-types: low engagement defined primarily by lack of online patient portal registration and low income (30% of the LCA less engaged group) and low prevalence of all variables, suggesting low acuity and medical stability (46% of the LCA less engaged group).

Discussion
We identified 7 distinct patient profiles within the top 3% of medical complexity by linking multidimensional, clinically guided clustering methods to stakeholder interpretation. Some profiles were narrowly defined (eg, patients undergoing advanced cancer treatment, patients with severe mental illness), whereas others reflected distinct facets within the general complexity phenotype (eg, high acuity requiring substantial inpatient care, frail elderly, and older patients requiring advanced care for complications of cardiovascular disease). The remaining 2 profiles were defined by need (chronic pain management) or lack thereof (relatively limited health system contact despite high levels of calculated comorbidity and risk).
Our results provide empirical data that may inform conceptual models of complexity in adult populations and further support the diversity of high-need patients. 22,23 One of the most prominent taxonomies, based on expert consensus by the National Academy of Medicines, created 6 categories of medically complex patients: children with complex needs, nonelderly disabled adults, frail elderly individuals, patients with major complex chronic conditions, patients with less severe but multiple chronic conditions, and patients with advancing illness. Such consensus-based models are a useful starting point but lack the empirical basis for redesigning care. 24 Our approach combined data and consensus to better elucidate the heterogeneity within a narrower band of adult patients (top 3%)  who are often indistinguishable using typical cost or diagnosis-based segmentation methods. By identifying the key variables that define different patient profiles, our study also offers data to guide efforts for operationalizing patient identification and to outline the types of different services and workforce competencies that may be required for complex care management ( Table 2).

JAMA Network Open | Health Informatics
Each of these profiles suggested strategies for organizing care. Of note, although some profiles were labeled by a key distinction, such as undergoing chemotherapy, every patient in each profile also had multiple other chronic conditions. Consequently, care programs focused solely on supporting a single issue (eg, cancer care or mental illness) are not likely to fit the full range of needs  in this medically complex patient population. Health care systems responsible for population care should implement strategies that go beyond directly supporting the immediate treatment needs to also address the impact of concurrent conditions. 25 Several of the medically complex patient profiles identified in our study have been described in different settings and contexts. For example, chronic pain management is a well-recognized clinical challenge, likely even more so for patients with multiple chronic conditions. Our findings suggest that pain management is a concern for a large segment of the medically complex patient population that may require continued investment and well-designed care teams. Other research has described the unique needs of medically complex patients with cardiovascular disease 26 and the importance of specific communication plans to address the needs of frail elderly adults with limited life expectancy. 27 Several studies have shown the benefit of comprehensive care for frail elderly adults. 28  predict when patients might be on a trajectory toward one of these complex profiles, categorizing the major pathways to becoming a medically complex patient, and identifying preventive interventions that could aim to slow transition into these groups over time.

Strengths and Limitations
This study has strengths. We analyzed highly dimensional clinical data for a large cohort of patients apply automated approaches may return clusters of limited clinical interpretability (eg, creating groups distinguished by 1-year outcome risks that lack unifying clinical themes that could inform care redesign). 34 Prior research has shown the value of asking clinicians to define patient complexity. [41][42][43][44] This approach allowed us to extend the value of analytic clustering by developing specific suggestions for tailoring care needs within the medically complex patient population.
An important concern for all health care-related machine learning analyses is the potential for perpetuating disparities because data obtained as part of routine care can be inaccurate or incomplete in a way that may be biased toward different patient racial/ethnic groups. 45 Although this concern is particularly an issue for prediction models used to assign health care resources to different patients, all such modeling should be cognizant of this problem. 46 With these concerns in mind, we incorporated principles of distributive justice into our model design and evaluation to ensure robust results and to minimize biased data and overreliance on automation. 47 These efforts included stakeholder collaboration to mediate the data through the clinical perspective to focus on outcomes that are meaningful to the target population; creating a data structure designed to avoid reinforcing health disparities by addressing nonrandom missingness (eg, specifying variables to represent lack of annual visits, screening, and health care contacts); optimizing completeness (KPNC includes out-ofsystem claims and has high-quality self-reported race/ethnicity coding); including utilization accumulated through nonvisit interactions to reduce bias against patients less likely to attend in-person visits (eg, telephone calls that often replace in-person visits for patients with transportation barriers) 48 ; and model building that emphasized transparency and error assessment at all stages of variable creation and testing. 49 This study also has limitations. Unlike established predictive modeling, there is currently no single gold standard for how to statistically validate data clustering results. To address this limitation, we examined validity within 3 domains: face validity (clusters corresponded to recognizable medically complex patient profiles by experts in the field), construct validity (2 methodologically unrelated clustering methods resulted in a qualitatively reassuring degree of overlap), and criterion validity (the different profiles had significantly different 1-year outcomes, demonstrating high correlation between profiles with external criteria). Another limitation was that our analysis was conducted within a single integrated care delivery system, which may limit generalizability. However, although the specific variables contributing to the clustering models may vary if replicated elsewhere, the corresponding patient clinical profiles developed by the clinical stakeholders were based on clinical experiences that are generalizable across care systems. In addition, although there are advantages to clinically guided model development, automated machine learning (ie, without clinical guidance) can potentially generate novel insights that are not readily apparent. However, such insights may not be actionable without broader clinical context.

Conclusions
The findings suggest that a single care model may not meet the needs of adults with high comorbidity and care utilization. Highly medically complex patient populations may be categorized into distinct patient profiles that are amenable to varying strategies for resource allocation and coordinated care interventions.