eFigure 1. Twitter Study Sample
eFigure 2. Tweets by Disease Topic Over Time
eTable. Characteristics of the US Sample: Tweet Data and User Data
Customize your JAMA Network experience by selecting one or more topics from the list below.
Sinnenberg L, DiSilvestro CL, Mancheno C, et al. Twitter as a Potential Data Source for Cardiovascular Disease Research. JAMA Cardiol. 2016;1(9):1032–1036. doi:10.1001/jamacardio.2016.3029
Can Twitter, a social media platform for person-to-person communication, be used as a data source to study cardiovascular disease?
In this descriptive study, we identified 4.9 million Tweets about cardiovascular disease posted on Twitter. User demographics, as well as content (eg, risk factors, awareness, and treatment) and volume of Tweets, varied across cardiovascular diseases.
Twitter has potential as a data source for studying public communication about cardiovascular health.
As society is increasingly becoming more networked, researchers are beginning to explore how social media can be used to study person-to-person communication about health and health care use. Twitter is an online messaging platform used by more than 300 million people who have generated several billion Tweets, yet little work has focused on the potential applications of these data for studying public attitudes and behaviors associated with cardiovascular health.
To describe the volume and content of Tweets associated with cardiovascular disease as well as the characteristics of Twitter users.
Design, Setting, and Participants
We used Twitter to access a random sample of approximately 10 billion English-language Tweets originating from US counties from July 23, 2009, to February 5, 2015, associated with cardiovascular disease. We characterized each Tweet relative to estimated user demographics. A random subset of 2500 Tweets was hand-coded for content and modifiers.
Main Outcomes and Measures
The volume of Tweets about cardiovascular disease and the content of these Tweets.
Of 550 338 Tweets associated with cardiovascular disease, the terms diabetes (n = 239 989) and myocardial infarction (n = 269 907) were used more frequently than heart failure (n = 9414). Users who Tweeted about cardiovascular disease were more likely to be older than the general population of Twitter users (mean age, 28.7 vs 25.4 years; P < .01) and less likely to be male (59 082 of 124 896 [47.3%] vs 8433 of 17 270 [48.8%]; P < .01). Most Tweets (2338 of 2500 [93.5%]) were associated with a health topic; common themes of Tweets included risk factors (1048 of 2500 [41.9%]), awareness (585 of 2500 [23.4%]), and management (541 of 2500 [21.6%]) of cardiovascular disease.
Conclusions and Relevance
Twitter offers promise for studying public communication about cardiovascular disease.
Person-to-person communication is one of the most persuasive ways people deliver and receive information.1,2 Until recently, this communication was impossible to collect and study. Now, social media networks allow researchers to systematically witness public communication about health, including cardiovascular disease. Twitter, one such network, is used by more than 300 million people who have generated several billion Tweets.3,4
There are several unknowns when using social media for research on cardiovascular disease. Is it possible to separate signal from noise? Can the data be analyzed to characterize features associated with the person posting and the Tweet itself? Does the Twitter data set reflect real-time changes in conversation? We explored these questions by characterizing a sample of Tweets about cardiovascular disease from the United States.
This was an exploratory mixed-methods study of Twitter data associated with cardiovascular disease.5 This study was approved by the University of Pennsylvania Institutional Review Board.
Twitter is a social media platform that allows users to send and receive 140-character messages known as Tweets. Our data from July 23, 2009, through February 5, 2015, was made up of the “Twitter decahose,” a 10% sample of Tweets (covering 52 months of posts), and the “Twitter spritzer,” a 1% sample of Tweets (covering the other 15 months).
From this group of Tweets, we searched for keywords associated with the following 5 cardiovascular diseases: hypertension, diabetes, myocardial infarction, heart failure, and cardiac arrest. To generate a set of search terms, we used the Consumer Health Vocabulary,6 the Unified Medical Language System,7 and the consensus of the study authors. The following keywords were identified from these sources: diabetes (blood glucose and mellitus), heart attack (coronary attack, cardiac infarction, myocardial infarction, heart infarction, myocardial infarct, and myocardial necrosis), cardiac arrest (asystolic, asystole, cardiac arrest, heart arrest, ventricular fibrillation, and pulseless electrical activity), heart failure (cardiac failure), and hypertension (high blood pressure). To ensure that Tweets with these keywords were in English, we applied an English-language classifier to the sample.
Reported coordinates were used to identify Tweets that could be mapped to a county in the United States.8 For Tweets without coordinates but with location information, locations reflecting city or county plus state were mapped. Tweets that could not be mapped to a US county by this process were eliminated.
To characterize Twitter users, we collected information from each user’s account, including the number of friends and followers. Additional data about Twitter users can be estimated based on their behavior on the platform. By applying established language-based algorithms based on users with known demographics to the Tweets in our sample, we imputed the age and sex of each user9; these data were compared with a random sample of Twitter users.
To describe the content of Tweets, 2 of us (C.L.D. and C.M.) used NVivo (QSR International) to code 500 Tweets for each of the 5 cardiovascular diseases, adjudicating differences with a larger group of authors (F.B., D.B., and R.M.M.). After coding the set of 2500 Tweets, total agreement for each category was greater than 90% and the mean κ was 0.77.
We measured the number of Tweets per topic per day. To account for variability in the baseline Tweet count, we identified 3 peaks in Twitter posts for each US-based disease topic against 7-day running means. Two of us (L.E.S. and C.T.) then identified the triggers for the peaks by identifying the common theme in the Tweets for that day.
We used χ2 test to compare the sex of individuals Tweeting about cardiovascular disease with the sex of the general population of Twitter users. Paired t tests were used to compare the ages of individuals Tweeting about cardiovascular disease with those of the general population of Twitter users.
From an initial sample of 10 billion Tweets, we identified 4.9 million with terms associated with cardiovascular disease; 550 338 were in English and originated from a US county (eFigure 1 in the Supplement). Diabetes and myocardial infarction represented more than 200 000 Tweets each, while the topic of heart failure returned fewer than 10 000 Tweets (Table 1). Similar findings were noted when analyzing data from a sample of Tweets geocoded to the United States (eTable in the Supplement). Peaks in Tweet rate were associated most often with thematically connected events reported in the news (eFigure 2 in the Supplement).
Those tweeting about cardiovascular disease tended to be older than the general population of Twitter users (mean age, 28.7 vs 25.4 years; P < .01); mean age and sex varied across the different cardiovascular conditions (Table 1). Users tweeting about cardiovascular disease were less likely to be male compared with the general population of Twitter users (59 082 of 124 896 [47.3%] vs 8433 of 17 270 [48.8%]; P < .01).
Tweet content varied across and within cardiovascular disease terms (Table 2). Most of the hand-coded Tweets in our sample (2338 of 2500 [93.5%]) included health-related information. The most commonly represented theme was risk factors for cardiovascular disease (1048 of 2500 [41.9%]) (Table 2). Approximately one-fourth of all Tweets (585 of 2500 [23.4%]) discussed awareness, frequently in the setting of fundraising for disease. Many Tweets (541 of 2500 [21.6%]) discussed the treatment and management of cardiovascular disease, often focusing on topics such as diet and exercise. Of Tweets that discussed outcomes of cardiovascular disease (247 of 2500 [9.9%]), most (193 [78.1%]) mentioned death.
Tweets could be characterized by tone, style, and perspective. Tweets associated with cardiovascular disease often used metaphor (1106 of 2500 [44.2%]), emotional language with positive or negative sentiment (974 of 2500 [39%]), and first-person accounts (872 of 2500 [34.9%]) (Table 3). Three percent of Tweets included a statement that the individual posting the Tweet identified as having cardiovascular disease.
This study has 3 main findings. First, we identified a large volume of US-based Tweets about cardiovascular disease. Second, we were able to characterize the volume, content, style, and sender of these Tweets, demonstrating the ability to identify signal from noise. Third, we found that the data available on Twitter reflect real-time changes in discussion of a disease topic.
We were able to identify 4.9 million Tweets associated with cardiovascular disease. Of the hand-coded sample, 94% included some form of information associated with health rather than a colloquial but non–health-associated use of the term. Prior work has suggested, however, that the language of Tweets, regardless of whether they arise from patients or other members of the community, can provide insight into the health behaviors of communities that are known to influence risk of disease.10 We also observed that Twitter users respond to events, such as World Diabetes Day or celebrity deaths, within minutes to hours and that these peaks in discussion are easily identifiable in the Twitter data set.
This study has several limitations. Our study focused on Tweets relevant to 5 specific cardiovascular conditions. Broader terms such as heart disease, specific terms such as sudden cardiac death, and slang terms such as DM2 or diabeetus may have captured other themes associated with these diagnoses. This study characterized only US-based English-language Tweets. We did not characterize the impression for each Tweet or the identities of those who received each Tweet. Hand coding was performed to read the text of Tweets and infer content, purpose, and sentiment. The true nature or intent of the user could not be verified.
Twitter may be useful for studying public communication about cardiovascular disease. The use of Twitter for clinical research is still in its infancy. Its value and direct applications remain to be seen and warrant further exploration.
Corresponding Author: Raina M. Merchant, MD, MSHP, Penn Medicine Social Media and Health Innovation Lab, University of Pennsylvania, 423 Guardian Dr, Philadelphia, PA 19104 (email@example.com).
Accepted for Publication: July 13, 2016.
Published Online: September 28, 2016. doi:10.1001/jamacardio.2016.3029
Author Contributions: Dr Merchant and Ms Sinnenberg had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Sinnenberg, DiSilvestro, Dailey, Buttenheim, Barg, Ungar, Schwartz, Brown, Asch, Merchant.
Acquisition, analysis, or interpretation of data: Sinnenberg, DiSilvestro, Mancheno, Dailey, Tufts, Ungar, Schwartz, Brown, Merchant.
Drafting of the manuscript: Sinnenberg, DiSilvestro, Mancheno, Dailey, Tufts, Barg, Merchant.
Critical revision of the manuscript for important intellectual content: Sinnenberg, DiSilvestro, Dailey, Tufts, Buttenheim, Ungar, Schwartz, Brown, Asch, Merchant.
Statistical analysis: Sinnenberg, DiSilvestro, Mancheno, Dailey, Tufts, Ungar, Schwartz, Merchant.
Obtaining funding: Ungar, Merchant.
Administrative, technical, or material support: Dailey, Barg, Schwartz, Brown, Merchant.
Study supervision: Mancheno, Buttenheim, Barg, Brown, Merchant.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Buttenheim, Barg, Schwartz, and Asch; Mss Sinnenberg, DiSilvestro, and Mancheno; and Mr Dailey are employed by the US Government. No other conflicts were reported.
Funding/Support: This study was supported by grant R01-HL1422457 from the National Heart, Lung, and Blood Institute, Templeton Religious Trust (Dr Ungar), and grants K23 109083 and R01 122457 from the National Institutes of Health (Dr Merchant).
Role of Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Create a personal account or sign in to: