Current and Future Spatiotemporal Patterns of Lyme Disease Reporting in the Northeastern United States

This cross-sectional study characterizes the spatiotemporal spread of Lyme disease in humans among counties in US endemic regions.


eReferences
This supplementary material has been provided by the authors to give readers additional information about their work.

Statistical modeling
A logistic regression analysis was performed to evaluate the effect of county characteristics on the probability to report the first LD case in a county that has never reported cases before.
Cases can also be linked to outdoor activities performed in habitats that are outside the county in which the cases are resident. To take into account this possibility, the model also included environmental characteristics of a county's neighboring counties (forest type and coverage [DEC_FORESTNB+ NON_DEC_FORESTNB+ANY_FORESTNB] and their mean elevation [ELEVNB]). The model's variable LD_RepNB represents the number of neighboring counties that have already reported LD cases. YEAR_RepNB is the year lag between the first reported LD case in one of the neighboring counties and the first reported LD case in the focal county. The state and year were included in the model as random effects (STATERND, YEARRND) to take into account reporting variation due to state characteristics not captured by the model's variables. The model used a binomial likelihood. The regression model was performed using a structure (STAR) model using a Markov random fields to account for spatial autocorrelation of the counties. The structure of network model used in this project is described in Bisanzio et al. 1

Model selection approach
A multi-model selection approach based on Akaike Information Criteria (AIC) was performed to find the best models to predict occurrence of the first case reporting of LD in any given county. 2 The ΔAIC was calculated among all proposed models as the difference between their AIC and the one with the lowest AIC value. All those models showing a ΔAIC <2 were included in the set of best models. 2 To reduce bias due to reporting system in each county, model selection was performed 1,000 times and 100 counties were randomly selected and excluded by the training dataset in each run. The model with the highest frequency of being identified as the best model among the 1,000 repetitions was selected to be used in the diffusion simulator model.

Diffusion simulation
A diffusion simulator was built in order to simulate the spatio-temporal diffusion of LD reporting in the counties included in the study area from 2000-2017. The model was built to quantify the probability of each county to report their first LD cases by 2017. The simulator used a stochastic approach based on 1,000 simulations. In each simulation run, spread of LD reporting was simulated using the results of the best logistic regression model, selected as described above (eTable 1): First LD case (1,0)= ANY_FORESTCT + ELEVCT + POPCT + WUICT + VECTORCT +ANY_FORESTNB + VECTORNB + LD_RepNB + YEAR_RepNB In order to reduce bias due to different reporting systems, in each simulation 100 counties were randomly selected and taken out from the training dataset. The logistic regression was used to estimate the probability of reporting the first case of LD in a county per each year. The prediction did only take into account the fixed effects of the logistic model. The random effects were included in the logistic regression to take into account state variability to report LD cases and changes in definition of LD cases. Thus, the probability estimation did not include the random effect to avoid the reporting bias at state level.
Each simulation run started with the Lyme disease reporting in the year 2000 and simulated the spread of cases during the following 17 years. The diffusion simulator structure was created to represent the spatial relationship among counties, with spatial proximity of counties represented in a planar network (eFigure 2). The nodes in the network are located at the counties' centroids and each node has connecting links to its first-degree neighbors. Each node was characterized by county forest coverage (combined coverage of deciduous and non-deciduous forests), mean elevation, reported first LD case (dichotomous variable), WUI, population, and time since first case reported. A single run of the simulator had 17 cycles, one per each year from 2001-2017. Per each cycle (year), the simulator calculated the probability (PLD) of each county without initial cases to report a LD case, based on the parameters from the logistic regression, the county's characteristics and those of the contiguous counties. The prediction used the characteristics of the first-degree neighboring counties flagged as reporting during the previous cycles. After a PLD was assigned to a county, a random number (RN) was extracted from a uniform distribution over the interval [0, 1] per each county. If RN was lower than the predicted PLD, the county was set as reporting LD cases. When a county started to report cases, it remained in this status until the end of the permutation. For each county, we recorded how many times it was flagged as reporting LD cases in each of the 1,000 simulator runs. The final probability of a county to report LD from 2000 to 2017 was calculated by dividing the times the county was set as reporting cases during the whole simulation by 1,000 (i.e., the number of simulations). The 95% confidence interval (CI) was calculated using a binomial function.