Development of a Machine Learning Model Using Multiple, Heterogeneous Data Sources to Estimate Weekly US Suicide Fatalities

Key Points Question Can real-time streams of secondary information related to suicide be used to accurately estimate suicide fatalities in the US in real time? Findings In this national cross-sectional study, combining information from 8 data streams encompassing various health services and online data sources enabled accurate, real-time estimation of US suicide fatalities with meaningful correlation to week-to-week epidemiological trends and a less than 1% error compared with actual counts. Meaning These findings advance the first efforts to create a population-level system for enabling real-time epidemiological trend monitoring of suicide fatalities.

to seek guidance after a substance exposure. This service is free of charge. As part of routine data collection, poison control centers record whether the exposure was a result of intentional self-harm. Data from NPDS used in our analyses reflect the daily number of calls for exposures from any substance with a self-harm intent (fatal and nonfatal), normalized by dividing by the daily total of all exposures called into the system. The timeframe of this dataset was 2014-2017and constitutes a valuable signal capturing self-directed violence.

Datasets from Economics and Meteorology
There have been prior efforts to explore the relationship between suicides and economic indicators or meteorological patterns, such as daylight hours. For example, multi-year ecologic studies have identified a relationship between unemployment as well as economic recessions on suicide mortality rates. 3 Consequently, we also investigated whether such data can be used effectively in our estimation framework.
Economic datasets: We first investigated multiple economic indicators available through Federal Reserve Economic Data (FRED; https://fred.stlouisfed.org/) and selected economic indicators based on three criteria: (i) the temporal granularity of the time-series dataset should be equal to or finer than monthly-level since more coarsegrained (e.g., quarterly or annual) time-series data would not be useful for estimating weekly suicide fatalities, (ii) the economic indicator should be available as raw data without transformation (e.g., seasonal adjustment), and (iii) the economic indicator should have a plausible theoretical relationship with suicide. eTable 1 lists the economic indicators meeting these criteria. All of the economic indicators studied are available monthly; hence, it was necessary to transform the datasets to estimate weekly suicide fatalities. Specifically, indicators released monthly were first used to calculate daily values by linearly interpolating (using state-of-the-art imputation techniques) between the consecutive two monthly values. We then extracted and used the value at the first day of each week (i.e., values on every Sunday) as the weekly indicator. A few economic indicators related to prices or wages (e.g., hourly earnings, home price index) are affected by inflation, which requires adjustment. Thus, we adjusted weekly earnings and home price index based on Consumer Price Index (CPI), which is also available from FRED (https://fred.stlouisfed.org/series/CPIAUCNS). The timeframe of this dataset was 2014-2017.

Daylight Hours:
The duration of daylight hours, calculated as the temporal difference between sunrise and sunset, has been considered as a predictor of the number of suicide fatalities 4 as both variables show similar seasonality; higher and lower in summer and winter, respectively. To test the number of daylight hours as a predictor, we collected the duration of daylight hours by using the times of sunrise and sunset from the application programming interface (API) described in eTable 1. As the U.S. encompasses a large geographic area, we retrieved the duration of daylight hours at the geographic center of the USA and used the averaged duration of daylight hours of each week as our predictor variable. The timeframe of this dataset was 2014-2017.

Datasets from Online Sources
Research over the past decade has situated social media platforms like Twitter to provide a reflection of social cohesion and mood of millions of individuals during the course of their daily lives. 4,5 Additionally, with the rising rates of social media use across different demographic groups, 6 research has shown that it is possible to gather meaningful cues about people's naturalistic behavior, affect, cognition, and sociality, unobtrusively and cheaply, and quickly, at a scale and granularity not possible before. Drawing on more relevant prior research in mental health and social media, 7-10 that reveals potential of data from social media platforms to inform risk estimation and prediction models (see Dredze 11 for an overview of the general role of social media in public health, and Chancellor and De Choudhury 12 specifically on the use of this data for mental health), in this work, we employed a variety of social media and online signals.
To gather suicide-relevant signals in social media or web services, we conducted keyword-based searching for 5 widely used social media platforms or web services 13 -Google Trends, YouTube Trends, Twitter, Tumblr, and Reddit. These platforms have been used in prior research to understand mental health as well as to assess suicidal risk. [7][8][9][10] They have also been found to be extensively used for health related information seeking and consumption, 14,15 with over 60% of internet users reporting to go online to look for health information. 16 We also note that, individually, almost none of these platforms are perfectly representative of the U.S. general population. 17 However, we note that they target slightly different population segments. For instance, Twitter users' median age is 40 years with females and Blacks represented more frequently than the U.S. general population, 18 Tumblr is largely popular among teens • • and young adults, 19 while Reddit is widely used by urban males. 20 Therefore, a comprehensive approach like ours that integrates these data streams can likely counteract significant skewness in representativeness of specific subgroups.
To gather data from these 5 sites, keyword lists (see eTable 2) were built by drawing from prior literature 21,22 using each data source and followed by manual review and curation of assembled lists by domain (public health) experts (SS, KH). These keywords consisted of pro-suicide terms, suicide prevention terms, terms relating to mental health state and risk factors, as well as other terms expressing socio-economic disadvantage. Using guidance from Tran et al, 23 we ensured we excluded keywords that were about movie, band or song titles, words that are used in a lyrical, humorous, or flippant context, and opaque or rare phrases. We describe the collection methodology of each social media or web services below. The descriptive statistics of the online datasets are summarized in eTable 3. Since each service provides different types of datasets (e.g., tweets, trend scores, or posts) in different ways, we implemented automated crawlers for individual services and calculated time series data (e.g, number of posts or users) from the collected datasets at week granularity. To extract the distinct patterns of each service, we finally normalized the dataset based on their usage statistics publicly available on Statista (https://www.statista.com/markets/).

Google and YouTube Trends:
Keywords used for both Google and YouTube trends were informed by prior literature as noted above . [24][25][26] For each keyword, Google Trends 27 provides a time series of normalized scores ranging from 0-100 which indicates the popularity of a given word in search data for a particular week in time relative to all other weeks examined 1 . To collect the weekly trend scores of the 42 suicide-relevant keywords in Google, we used the pytrends python library to access data from Google Trends. We obtained scores for all of the keywords in three categories provided by Google: "All categories", "Health", and "Mental Health": see 28 for all the available categories. In a given category, scores for each of the individual keywords were then summed for each week, yield-ing a single final numerical score for each week for the category. After computing the weekly scores for these three categories, we evaluated each of these datasets at the first stage of the proposed machine learning framework using only the training and validation data. For Google Trends selected "All categories" as it outperforms the models from the other categories. We repeated this procedure using YouTube as a target service and selected the "Mental Health" category using the same procedure. The timeframe of this dataset was 2014-2017.

Reddit: Although
Reddit supports an open policy that allows collection of all posts, comments, and user data on the platform, the official API returns only 1,000 recent posts in order to prevent performance degradation. Given growing needs to access historical Reddit data, a Reddit user, Jason Baumgartner, a few years ago, launched PushShift. PushShift is a service which not only stores longitudinal historical data from 2008 till recent time, including posts, comments, and users on Reddit, but also provides APIs for systematic acquisition of this data. Drawing on prior mental health research on Reddit, 29 we identified 55 suicide-related message boards or communities, known as subreddits. Using the publicly available PushShift API, we calculated the total number of weekly posts made on these subreddits and normalized the data by dividing by the total number of posts across all subreddits on Reddit for each week. The timeframe of this dataset was 2014-2017.
Twitter: Building on prior literature , 30 we collected tweets for 38 suicide-relevant keywords. Since it is challenging to ascertain whether the total weekly volume of Twitter messages should be normalized and how, we first computed three types of input streams from the collected data: (i) weekly volume of Twitter messages, (ii) weekly number of unique users, and (iii) weekly number of unique users normalized by using estimates of the total Twitter user base. 31 We then examined these three streams in the first phase of the proposed machine learning pipeline using only the training and validation data. We selected the first streamthe total weekly volume of Twitter messages for subsequent analysis. The timeframe of this dataset was 2014-2017.
Tumblr: To collect suicide-related posts from Tumblr, we used a list of 42 hashtags, drawn from prior research. 32 Using the official Tumblr API we collected all posts that contain at least one of these hashtags. We then calculated the weekly number of posts, normalizing the data by dividing each week's value by the estimated total number of weekly posts on Tumblr. This denominator for normalization was calculated by linear interpolation from annual estimates of posts on Tumblr. 33 The timeframe of this dataset was 2014-2017. 1 We note the following limitation with the use of Google Trends and YouTube Trends: these services do not provide raw counts of search volume, instead provide normalized values. This is in contrast to the social media platforms considered in this research, where we can obtain crude posting volume across populations of interest. An artifact of this black box nature of the Google/YouTube Trends service is that trends requested for earlier years can be slightly asynchronous compared to later years. Moreover, trends obtained on different days may result in slightly varying results for the same week, because of the underlying normalization. To tackle these issues, we adopted the measures outlined by Tran et al. 23

Online Data Sources Not Considered
We note that in this work we have not used data from some of the most popular social media platforms, such as Facebook; YouTube and Facebook are the most-widely used online platforms, and their user bases are broadly representative of the U.S. population as a whole. 13 Although Facebook data in recent years has been shown to be useful in various public health studies, 34,35 this data is not publicly accessible at population-scale through Facebook's APIs. Further, we did not consider Instagram, Pinterest, Snapchat, LinkedIn, and WhatsApp that are the other popular social media sites in the U.S., 13 some of which are particularly widely used among adolescents and young adultsa population at a greater risk of suicide. 36 This was because a) data from these platforms also could not be included due to unavailability of API-supported acquisition capabilities; or b) they were not deemed to be insightful for identifying suicidal risk or suicidal behaviors (particularly Pinterest and LinkedIn). Despite these limitations and logistical and practical bottlenecks, since we intend the outcomes of this research to be translatable for public health surveillance of suicide in the general population, focusing on platforms that allowed automated (programmatic) acquisition of public data or trends without necessitating individual informed consent was an important consideration. In addition, data on many of these social media platforms are audience-controlled; 37 however, the three social media platforms considered here -Twitter, Reddit, and Tumblrare microblogging, broadcasting, or public-facing platforms, that enable candid self-disclosure on stigmatizing topics like mental health and suicide. 38 For these reasons, data from these three platforms were deemed more relevant for this research.

eMethods 2. Machine Learning Model Development Time Series Forecasting Models: Considerations and Limitations.
To design the framework for estimating suicide fatality trends, we first considered conventional time-series forecasting models including Poisson process-based forecasting, 39 AutoRegressive (AR) models and related variants such as the AutoRegressive Integrated Moving Average (ARIMA), Vector AutoRegressive Moving Average (VARMA), and Seasonal AutoRegressive Moving Average (SARIMA). All of these models are based on the methods of Box and Jenkins 40 and are widely used in time-series forecasting problems due to their ease of implementation, 41 and ability to account for seasonality and underlying patterns in the data. 42 Unfortunately, the AR-associated models are not ideal for our particular suicide fatality estimation task given the unique constraints of the data used to make the predictions. In particular, the gold-standard suicide fatality data (outcome data) is released only one time per year as noted in the main article; specifically, data for the preceding year is released approximately each December of the subsequent year. For example, suicide fatality data for each week of 2017 is released at the end of December 2018. That is, since there is more than an year's worth of lag between when historical data is available and when estimates are being made, the practical estimation framework should predict the weekly numbers of suicide fatalities based on the historical data from the year before the target year of estimation. In formal terms, AR-associated models require the actual values at the previous step t-1 to predict the values at the given step (t) by their mathematical definition. Due to these real-world limitations in the underlying suicide fatality data and its episodic release, we did not employ AR models in our estimation framework. Moreover, AR models are also not considered very robust to outliers and missing data 43aspects likely to be true when multiple disparate data sources are combined. In fact, certain time series forecasting methods, more generally, do not adequately account for structural similarities (e.g., autocorrelational pattern, temporal trends) in the explanatory time series (e.g., signals from our various data streams) and the dependent time series (i.e., suicide fatalities). 23 Recent research has therefore advocated machine learning techniques such as LSTMs (long-term short-term memory) and RNNs (recurrent neural networks) as a way to tackle these challenges, especially when the explanatory data streams constitute multivariate time series. 44 Our proposed approach is guided by these recommendations.
Rationale Behind an Ensemble Approach. As noted in the main manuscript, there are significant limitations to using single data sources for a computational approach to estimating public health and other population-level outcomes. For instance, Mislove et al 45 have argued that Twitter data tends to skew towards young, urban, minority individuals, while Gayo-Avello 46 showed that age bias can affect attempts to predict socio-political outcomes from Twitter sentiment. Most relevantly, recent research by Ernala et al 47 showed that many existing research that has leveraged social media data solely as a proxy for mental health indicators suffers from issues of poor construct validity, offering limited clinical value. Further, there could be additional variations in social media usage along a number of dimensions: low resource or low-income areas may not have as much social media penetration. The challenges and risks of using a single data source are, however, not just limited to social media. Data sources such as health services may have different types of biases, such as those who do not reach out for emergency services for suicidal thoughts or behaviors may not being using Google to search for suicide related topics, or those who express suicidal thoughts on social media may not call the national suicide prevention helpline during a crisis. Essentially, a variety of gaps like these have been raised by Lazer et al, 48 such as ignoring the foundations of such measurement, reliability, and dependencies among various datasets, and most notably, "big data hubris," the often implicit assumption that big data are a substitute for, rather than a supplement to, "small data", like traditional data collection and analysis. Summarily, as boyd and Crawford rightly noted, "bigger data are not always better data". 49 Based on this intuition and observations in the existing literature, we consider social media and other data sources in conjunction with other data sources through ensemble approach. The rationale behind this is that when we consider these different but complementary signals, we will overcome the limitations and mis/under-representation of populations in any one of them alone. With this consideration, we determined that our model should meet the following criteria: (a) maximally extract the explanatory signals from each dataset; (b) estimate suicide trends by combining such extracted signals in an intelligent and harmonic way, which learns how to give weights to individual data sources; and (c) be robust to spurious correlations, missingness and sparsity, and outliers. From this consideration, we propose a machine learning pipeline consisting of two stages: (i) intermediate prediction stage and (ii) ensemble stage.
Our approach to adopt an ensemble strategy, based on state-of-the-art advancements in artificial neural networks, extends recent recommendations on harnessing large-scale data and computational techniques in healthcare. 50,51 We note that neural network models belong to a data-driven approach, where training depends on the available data with little prior rationalization regarding relationships between variables. 52 They also do not make any assumptions about the nature of the distributions of the underlying time series. As a result, these models are self-adaptive and considered a good alternative to time series forecasting models like ARIMA, where some of the explanatory data may be non-linear time series. 53 Further, our rationale to adopt a two-phase pipeline is as follows. The standard approach of ensembling is simple averaging that assigns equal weights to all component models, and therefore could be accomplished in a single step pipeline instead of two. 54 However, this simple averaging method may be sensitive to outlier values and unreliable for skewed distributions, especially given the diversity of the various streams considered in this research (ref. S2). By adopting a two-step approach and neural network models, we are able to find the optimal weights (or contributions of the estimates from each stream) by minimizing the sum of squared errors (SSEs).
Proposed Design. For the first phase, or the intermediate prediction stage, we focused on finding the best predictor of the weekly number of suicide fatalities for each given data source, by taking a training-validation approach. To this end, we built multiple machine learning models, each of which predicts the weekly number of suicide fatalities based on the time-series values from a single data source. In particular, a model (for a given data source) is trained based on the time-series data from a single source (i.e., Twitter) and the weekly numbers of suicide fatalities, and then finally outputs the intermediate results (a prediction of the weekly number of suicide fatalities in the given week of interest). The second phase involves combining these predictions via an ensemble approach 55  We acknowledge here that by design, this model is not interpretable and does not include explainability of the underlying decision-making complexities of the algorithms used. In machine learning research, interpretability and explainability are desired, 56 especially in the domain of health. These features allow users of these models to take a peek into specific attributes, patterns, or representations of the explanatory variables (the various data streams) that may lead to good, bad, or unexpected outcomes, or to simply render the model results actionable in the real world. Unfortunately, such model interpretability often comes at a costin terms of performancemore interpretable and explainable models, such as linear regression models, tend to be less powerful in comparison to techniques like neural networks. 57 The goal in this work was to build the most accurate estimation model of suicide fatalities that is possible, which drove our decision to tradeoff interpretability over performance. Future work can consider posthoc techniques applied on the trained models to meet domain-experts' interpretability or explainability needs. 58 Finally, by design, our machine learning pipeline also does not consider more sophisticated features derived from the datarecall, we simply derive frequency-based explanatory variables from the various data streams. Our rationale is twofold. First, frequency-based features are typical in prior suicide and other public health surveillance research, 11 and our approach aligns with these existing efforts. Second, while language data and data about the users of the various social media platforms have been harnessed as predictive signals in prior work, 7,9,[59][60][61] because of the diversity of the streams considered and their mutual incompatibility of linguistic and social norms, these features were not leveraged in this research. That said, future research can consider sophisticated techniques to harmonize these disparate linguistic and social behaviors in suicidal risk/outcome prediction (such as using convolutional neural networks or word embeddings 62 that learn high-dimensional representations in a latent space). This will be especially valuable if the goal is to examine the presence or absence of specific risk factors in social media/online data, or perhaps to even discover previously unknown correlates of suicidal risk and fatalities in specific subpopulations.

Development of the Pipeline.
Considering that statistics on suicide fatalities are released annually (as also noted in the Introduction section of the main manuscript), we assume the situation that only the suicide fatalities through 2016 are accessible and the suicide fatalities in 2017 have not released yettherefore 2017 in the year whose statistics have to be estimated. Accordingly, we set aside weekly suicide fatality data from 2017 as our held-out test data, 2016 fatalities data for validation, and data from 2015 and before (if available) for training the machine learning models corresponding to each data source. Note that the validation is a one-time procedure in order to select the optimal hyperparameters, which can then be used for model testing in subsequent years (here, the year 2017).With this set up, predictive features for training, validation, and testing for each of the health services and online data sources were constructed using a sliding window approach, 63 illustrated in eFigure 1. For Lifeline, Poison, Google, YouTube, Reddit, Tumblr, and all of the economic and meteorological datasets, we used weekly data over a 2-year sliding window as predictors when training, validating, or testing models. For ED Visit and Twitter data, weekly data over a single year sliding window time period were used due to data availability. Additionally, we leveraged historical data of weekly suicide fatalities as an additional data source for our estimation task, both to augment the estimates given by the realtime signals as well as to develop a baseline model for comparison. To preserve the natural data generation reality of this data sourceas noted in the main article as well as above, this data is released in an episodic fashion only once a yearwe avoided using data from the current year while estimating fatalities in the next year.
In this same first phase of our machine learning pipeline, we trained and validated a number of leading machine learning models, beginning from simpler, more interpretable models, and progressing to more sophisticated opaque models, for identifying the best predictor of weekly suicide fatalities corresponding to each data source. These included, Linear Regression, 64 Lasso, 65 Ridge Regression, 66 ElasticNet, 67 Random Forest, 68 and Support Vector Machine 69 models. Choosing this wide range of machine learning models also ensured that our ensuing results of estimated suicide fatalities are not an artifact of apriori selection of specific model types, as well as to ensure that the results are robust across different classes of machine learning approaches. As mentioned in the main manuscript, following standard machine learning procedures, we used an exhaustive grid search procedure to tune model parameters for each machine learning model and each data source, such as Ridge/Lasso/ElasticNet regression penalties on weight magnitudes, hyperparameters of the Random Forest, and the appropriate kernel for the Support Vector Machine regressor. All the parameters investigated during grid search are indicated in eTable 4.The best model for each source was selected as the one with the best performance among all candidate models, as determined by the Pearson correlation coefficient between the actual and the estimated weekly number of suicide fatalities in the validation period for the same data source.
For the baseline models, as noted in the main manuscript, we trained and validated three models that used the historical weekly counts of suicide fatalitiesa Linear Regression model; a more complex machine learning model (Support Vector Machine), chosen empirically to be appropriate given the nature of the data and to be consistent with prior literature; 70 and a traditional time series forecasting model, the Holt-Winters method. 71 This method has been employed in prior epidemiologic work on predicting a variety of public health outcomes, including suicide. 72 The Holt-Winters method is highly suitable in our case because the method models three aspects of a time series: a typical value (average), a slope (trend) over time, and a cyclical repeating pattern (seasonality), expressed as three types of exponential smoothingall of which are applicable in suicide fatalities data.
In the second (ensemble) phase, we combined the model predictions of weekly suicide fatalities given by each single data source from the first phase in an automatic and harmonic way via a Neural Network (NN) model, specifically a Multilayer Perceptron Model (MLP) 73 , which consists of multiple layers with the same number of hidden units and uses Relu as an activation function. MLP models are memory-less, and they use the feed forward neural network architecture, which applies a supervised learning technique called back propagation algorithm for training the neural network. 74 Note that the rationale of this approach is in line with prior literature that suggests the "Super Learner" model. 75 That is, the model learns how to give weights to different models to harmonically combine the multiple ma-chine learning models. Inspired this, we built a NN-based ensemble model and fed the intermediate values given by the best unit model for each data source to obtain weekly estimates of weekly suicide fatalities in 2017 (our held-out test data). Similar to the first stage, we chose the hyper-parameters of the MLP models from the ones listed in the eTable 4 via the grid search algorithm at the validation step. Note that we measured all the possible combinations among three categories of data sources -(i) the model from the history of suicide fatalities (ii) the models of social media platforms, and (iii) the models of health services, and found that the ensemble of combining all the data sources in three categories show the highest performance.

eMethods 3. Exploration of Alternative Data Sets
We explored the feasibility of the datasets from the sources not included in our framework by measuring the estimation performance of the individual models, described in eTable 5. Only the CPI-adjusted Home Price Index and the number of daylight hours showed somewhat satisfactory Pearson correlation coefficients when deployed in an actual prediction model. Thus, we excluded the remaining economic data sources and a single social media data source (Tumblr) from consideration in our machine learning pipeline.
To be thorough, we observed if combining the Economic (Home Price Index) and Meteorological (daylight hours) data with the Health Services and Online data would improve estimations of suicide fatalities, as shown in eTable 6. Compared to the Baseline + Health Services + Online model, as reported in the results of Table 2 in the main article, we see that adding Home Price Index (HPI), the number of daylight hours, or both to this model degrades its performance.
There are several hypotheses that can be offered to explain these observations. First, although economic data has been shown in several cross-sectional studies to be associated with suicide, 1 broad economic indicators are still proxies for risk factors associated with suicide, whereas, for example, the health services indicators used represent actual clinical encounters for suicide-related behavior, and are thus simply a predictor that is closer to the outcome being studied. Moreover, the economic data themselves are often lagged indicators of suicide, and although they bear a general relationship with suicide outcomes, they lack the temporal immediacy needed to be useful in an estimation task over a finer temporal granularitya granularity that is paramount for public health surveillance. Thus, combining a weak predictor with a strong one does not aid model accuracy; as noted previously, the quality of data is a key consideration in machine learning approaches beyond quantity of data. 49 With regards to the number of daylight hours, as noted in the main article, while this indicator has a strong Pearson correlation with suicides due to both being highly seasonal, the number of daylight hours from year-to-year in the U.S. is largely constant. Thus, by its nature, this indicator cannot predict year-to-year changes in the number of suicides, or long-term temporal increases or decreases in suicide rates, and will always bias predictions toward no net change in suicide counts.