Geographic Variation and Associated Covariates of Diabetes Prevalence in India

Key Points Question What factors characterize the geographical distribution, associated socioeconomic and behavioral covariates, and overlap with tuberculosis endemicity of diabetes in India? Findings In this cross-sectional study of 803 164 individuals living in India, the observed spatial variation of diabetes revealed spatial clustering of diabetes prevalence, which was associated with behavioral and socioeconomic factors, such as income, consuming alcohol, and smoking tobacco. Despite evidence for diabetes-tuberculosis interaction at the individual level, there was a lack of consistent geographic overlap between both diseases at the national scale. Meaning This study adds to known risk factors associated with diabetes and identifies areas where diabetes prevalence is concentrated, including where potential overlap with tuberculosis occurs.

DHS India 2015-16 used the stratified two stage sampling technique in order to make sure that the sample is representative for the national target population. This technique is used to allow every individual on the target population to have a chance of being selected and included in the survey, and to prevent selection bias. DHS India 2015-16 used a sampling frame, which corresponded to a list of all PSUs (also called Enumeration Areas (EAs) or Sampling clusters), and the population of each PSU within the country. The India Census 2011 served as the sampling frame for the selection of the PSUs to identify the 100% of the target population.
After the identification of PSUs and population in the country, a stratification process took place to classify PSUs into homogeneous subgroups. This process allows to identify a representative sample for each subgroup. DHS India 2015-16 used grouping classification of PSUs located in rural and urban settings and included in the dataset of the survey. Based on the sampling frame stratification, the sample size was designed by the DHS India 2015-16 experts to define the PSUs selected for the survey, and the sampling for the number of households or individuals needed on these PSUs, to have statistically reliable results, and a representative of the country as a whole.
PSUs are areas which divide the sampling framework. For the DHS 2015-16, PSUs have different definitions for rural and urban settings. In rural areas, PSUs correspond to villages or part of villages, and in urban settings PSUs correspond to Census Enumeration Blocks (CEB). DHS India 2015-16 aggregated PSUs with less than 40 households to the nearest larger PSU. Rural and urban settings had similar criteria for selection of PSUs. As explained before, within each rural stratum, villages were selected from the sampling frame with probability proportional to size (PPS). In urban areas, CEB information was obtained from the Office of the Registrar General and Census Commissioner. CEBs were sorted according to the percentage of the population in each CEB, and sample CEBs were selected with PPS sampling.
In every selected rural and urban PSU, a complete household mapping and listing operation was conducted prior to the main survey. Selected PSUs with an estimated number of at least 300 households were segmented into segments of approximately 100-150 households. Two of the segments were randomly selected for the survey using systematic sampling with probability proportional to segment size. Therefore, India DHS 2015-16 clusters are either a PSU or a segment of a PSU. In the second stage, in every selected rural and urban cluster, 22 households were randomly selected with systematic sampling.
Finally, in the stratified two-stage sampling, PSUs were selected with random probability proportionally on the size (PPS), which means that PSUs with larger population are more likely to be selected for the survey. As a result, the PSUs locations (eFigure1 B) are representative of the target population density of India estimated from the Census 2011. PPS results in areas with fewer population having less households included in the survey. However, some mortality and fertility rates estimations require a minimum number of households or individuals to be included to estimate reliable results, regardless of the population size. Considering this constraint, oversampling of small regions, and under sampling of large regions is performed on the DHS India 2015-16. In order to restore representativeness of the sample and to prevent bias, DHS India 2015-16 includes sampling weights at household and individual levels. All weights and survey design were included in our statistical analysis as recommended by the DHS data analysis guidelines from the DHS program.

Revised National Tuberculosis Control Program (RNTCP) Reporting Process
This program utilizes a three-tier system of national reference laboratories, intermediate reference laboratories, and designated microscopy centers equipped for providing internationally standardized tuberculosis (TB) diagnostic services 4 . Data contain annualized total detected TB events using smear positive, smear negative and extra pulmonary cases. An smear positive is considered as an individual TB infected who can transmit the infection, and individuals with the TB infection in other than lungs are classified as extra pulmonary cases 5 . Data is available on the online repository of the Revised National Tuberculosis Control Program (RNTCP) 4 .

Model selection and covariates description
Model selection for the risk estimation of self-reported diabetes status was conducted by using a directed acyclic graph (DAG), to represent a structure of conditional independence among the selected covariates. DAGs describe statistical relationships among risk factors using a representation of nodes and directed edges that consider potential influence among covariates 6 . Moreover, DAGs are useful for elucidating causal paths and to control for different source of systematic bias on statistical models, especially for confounding among an exposure and outcome variables. eFigure1 describe the relation model between self-reported diabetes and TB endemicity exposure, including individual and PSU level covariates of interest that have shown relation with diabetes among the literature 7,8 .
DAGs has shown to be useful in determining the variables needed to control for effects of confounding when estimating the causal relation between exposure factors and the outcome. Statistical causal relations in DAGs are mathematically determined by adjusted and unadjusted paths. Paths are sequences of nodes connected by edges regardless of arrowhead direction 6 . eFigure1 describe control variables that are needed to control for systematic confounding between diabetes and TB endemicity level, considering the limitations of availability of data and other individual risk factors of interests. Adjusted paths were identified using the ggdag package in R statistical software and represented using the dashed arrow notation and squared nodes. Finally, for accounting unmeasured environmental adjusting factors we used the land travel friction covariate from the atlas malaria project, and we adjusted by the wealth index, place of residence, and educational level factors. Individual level covariates of interest are included in the bottom of the main relation and describe impacts among them with solid directed edges.
Finally, the individual covariates were classified using appropriate levels: gender (male or female), age (by quinquennial age group), religion (Hinduism, Islam, Sikh, or others), highest education level (no education, elementary, secondary, or higher), marital status (never married, currently married, or formerly married), alcohol use (yes or no), smoke tobacco (yes or no), weight status (underweight, normal, overweight, or obese), type of place of residence (urban or rural) and household wealth index (by quintiles). Taking into account that DHS individuals include population under 18, an slight modification for weight status was defined based on body mass index (BMI, unit: kg/m2) according to WHO standards, for under 18 population group: 'underweight', 'normal', 'overweight', and 'obesity' were defined as WHOrecommended sex-age-specific BMI <15th percentile (corresponding to BMI=18.5 at age 18 years), ≥15th but <85th (corresponding to BMI=25 at 18), ≥85th but <97th (corresponding to BMI=30 at 18), and ≥97th, respectively 9,10 .
Wealth index covariate is computed at household level from a set of factors included in the India DHS 2015-16 survey. The wealth index composite measure of a household's cumulative living standard. Household's ownership of selected assets are considered on this metric such as televisions and bicycles; materials used for housing construction; and types of water access and sanitation facilities 11 . Finally, a Principal Component Analysis PCA algorithm is used to identify contribution of this main components of the living standards of the target population and included at household and individual within the available datasets of the survey.

Spatial Analyses Additional Notes
An advantage provided by the spatial scan statistics is the ability to account for the uneven distributions of the sampled population. Scan statistics scanned a circular window that spanned the study region. The radius of the circle was changed continuously, so that it could take any value from 0% up to the default value of 50% of the sampled population inside the window. SatScan maximum radii size was set to 50 km for the self-reported diabetes clustering analysis considering the high density of PSUs included in the study. Similarly, TB clustering analysis maximum radii was set to 250 km to avoid high rate of overlapping. The numbers of observed and expected self-reported diabetes cases within and outside the circular window were then compared with the likelihood L0 under the null hypothesis of spatial randomness. The circular windows with the highest likelihood ratio values were identified as clusters. An associated p-value of the statistics was then determined through Monte Carlo simulations and used to evaluate whether self-reported diabetes cases were randomly distributed in space or not. Spatial data mapping was performed using Mapbox GL JS v 1.3.1 12 . All statistical analyses were conducted using R statistical software 13 and the survey and svydiags packages 14 . We weighted individual and PSU-level data and adjusted for two-stage cluster sample design according to recommendations of DHS 15 .

Model selection Analysis results
Complementarily to the DAG analysis, we performed a Variance Inflated Factor (VIF) stepwise model selection to account for multicollinearity among the selected variables using the svydiags package on R to account for the stratified two-stage sampling followed by the India DHS 2015-16 showed in the eTable1. VIF stepwise model selection used the correlation matrix computed over the samples, and measure collinearity using a multiple regression, and taking the ration of all given models divide by the variance of the single coefficients for each variable if it were fitted individually. All variables less than five (5.00) VIF score were allowed to be included in the adjusted model. For our specific DAG, any variable showed high correlation among the dataset and all adjusted paths and individual risk factors for diabetes were included in the final model.

Overall prevalence and demographical characteristics
Individual-level covariates association indicated that the highest prevalence of self-reported diabetes by age group was found in the oldest population between the ages of 45-49 for females (5.52%), and 50-54 for males (7.24%). Additionally, population reported as practicing 'other or any religion' was the group with highest prevalence for both genders, (females 2.24%, males 3.77%), followed by Muslims (females 2.04%, males 2.08%), Sikh (females 1.71%, males 2.08%), and Hinduists (females 1.61%, males 2.09 %). On the contrary, self-reported diabetes aggregated by educational level showed a disparity between genders, with highest prevalence on the population with 'primary' studies for females (2.14%), and 'Higher' studies for males (2.56%). Finally, BMI covariate was included mostly for females due to high proportion of missing values for males (98% of total male population missing BMI measurement). This covariate showed a distribution of self-reported diabetes prevalence of obese -5.86%, overweight -3.40%, normal -1.20%, and finally underweight population -0.71%. PSU-level covariate place of residence showed higher prevalence in urban settings (females urban 2.59% vs. females rural 1.21%, and males urban 2.72% vs. males rural 1.80%). Moreover, wealth index factor showed also highest prevalence in the richest population for both females and males (2.91% and 3.29% respectively). Finally, average land travel friction per meter also showed highest prevalence in the areas with less friction (females 2.41% and males 2.83%, in areas with less than 0.001 min per meter) compared to areas with higher friction (females 1.02% and males 1.53%).

Association Analyses Expanded Results
Additional associated covariates in the analysis were religion and marital status. Religion showed an increasing risk for diabetes in the 'Muslim' category (OR=1.33; 95% CI=1.20-1.48), and decreasing risk for 'Sikh' (OR=0.74; 95% CI=0.62-0.88), and not association for the 'Other or Any Religion' category compared with the Hinduist population. Marital status was also associated with the risk of developing diabetes, with lower risk for the 'Never Married' individuals (OR=0.74; 95% CI=0.61-0.90) compared to the 'Currently Married' population.

eAppendix 3. Supplementary Discussion
Additional notes on limitations of the study Important drawback of using plasma glucose levels for diabetes status is that diabetes diagnosis requires careful substantiation with retesting to assure the plasma glucose is unequivocally elevated. Epidemiological diabetes studies often overclassify the population with diabetes, by 25% of false positive cases due to lack of repeat testing [55]. We assumed that the 'currently has diabetes' variable is a more accurate outcome because it relies on clinical confirmation which is often based on retesting and includes other clinical aspects associated with diabetes. Self-reported diabetes will likely underestimate diabetes prevalence as a large fraction of population with diabetes are unaware of their disease status [56]. In our study, the diabetes prevalence resulted in 1.76%, a lower estimation than the 7.8% estimated for India [7]. In addition to under-diagnosis, the DHS sample included only individuals aged 15-59 years and excluded older populations with higher prevalence of diabetes. However, despite the use of this indirect measure for the TB endemicity and potential effects of different area unit scaling effects [57], we performed complementary clustering analyses, which was not affected by aggregation and was robust to indicate low geographic overlaps between both diseases.

Risk Factor
Level VIF score