If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Address correspondence to: Mario A. Navarro, PhD, Office of Health Communication and Education, Center for Tobacco Products, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring MD 20993.
Machine learning and qualitative coding provide context to social media analysis.
•
Predicted Reddit user age groups allows nuanced comparisons on thematic topics by age group.
•
Opposition to flavor restrictions was prominent for both age groups.
•
Emergent themes by the age group 13–20 years were opposition to minimum age laws and flavored ENDS discussions.
•
Posts by the age group 21–54 years commonly mentioned general vaping use behavior.
Introduction
This study analyzes age-differentiated Reddit conversations about ENDS.
Methods
This study combines 2 methods to (1) predict Reddit users’ age into 2 categories (13–20 years [underage] and 21–54 years [of legal age]) using a machine learning algorithm and (2) qualitatively code ENDS-related Reddit posts within the 2 groups. The 25 posts with the highest karma score (number of upvotes minus number of downvotes) for each keyword search (i.e., query) and each predicted age group were qualitatively coded.
Results
Of 9, the top 3 topics that emerged were flavor restriction policies, Tobacco 21 policies, and use. Opposition to flavor restriction policies was a prominent subcategory for both groups but was more common in the 21–54 group. The 13–20 group was more likely to discuss opposition to minimum age laws as well as access to flavored ENDS products. The 21–54 group commonly mentioned general vaping use behavior.
Conclusions
Users predicted to be in the underage group posted about different ENDS-related topics on Reddit than users predicted to be in the of-legal-age group.
Electronic nicotine delivery systems (ENDSs) are the most commonly used tobacco product among U.S. youth, with an estimated 4.43 million high school students and 860,000 middle school students having ever used an ENDS product as of 2021.
In addition to the addictive properties of nicotine, reviews have identified several other harms and potential harms associated with ENDS use, including inhalation of toxins and decreases in lung function.
social media listening provides unique methodologies to obtain rapid insights and surveillance on product discussions.
Recent qualitative studies using social media for tobacco prevention and control research rely heavily on thematic coding and content analysis of posted material. For example, Wang and colleagues posted on the social media site, Reddit, to investigate ENDS flavor mentions,
However, the lack of publicly available demographic information on users is a limitation of social media data and may prevent researchers from understanding at-risk audiences via this route.
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].
developed an algorithm that examines users’ posts and metadata to predict and categorize Reddit users’ ages into 1 of 2 groups: 13–20 years (i.e., underage [UA]) or 21–54 years (i.e., of legal age [OLA]). These 2 age groups were used to separate users’ legal use of tobacco products and to provide an appropriate model because there were very few age references for those aged >54 years. This exploratory study, using the Chew and colleagues
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].
algorithm, investigates ENDS conversations, with a focus on flavor restriction and Tobacco 21 policy discussions for posts originating from predicted UA and OLA groups.
METHODS
Study Population
Figure 1 summarizes the 3 overarching steps of identification undertaken in this study. First, Reddit posts about vaping in general, flavor restriction policies, and Tobacco 21 policies were identified and downloaded from Brandwatch.com, a social media listening platform. Multiple search keywords were used to identify relevant posts about general vaping (e.g., vape, vaping, E-cigarette), flavor restriction policies (e.g., flavor policy), and Tobacco 21 policies (e.g., minimum [min] age laws and tobacco-related words such as cigarettes, vapes, and cigars). These keyword groups formed 3 separate queries to pull the data. Searches were also restricted to English language‒only posts.
A previously developed age prediction model was used to predict the age group for each author as either UA (13–20 years), OLA (21–54 years), or uncertain.
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].
These categories were used to examine the differences in conversations depending on whether the user was OLA to use tobacco. The lower bound was selected because Reddit users must be aged ≥13 years, and those aged >54 years could not be appropriately classified because of the small number of individuals who fell into this category during the development of the model. The age prediction model uses the gradient-boosted trees algorithm
to predict the probability that each user belongs to either the UA or OLA age groups. Analogous to logistic regression, predicted probabilities are generated by multiplying the trained model weights by the input variable values for each new observation, summing them together, and applying an inverse logit transformation. There are 15 input variables required for the model to generate predictions, spanning literary characteristics (e.g., sentences per comment) to subreddit posting frequencies (e.g., “proportion of user's posts or comments in the r/teenagers subreddit”). A full list of the variables used in the model, as well as further background on other variables considered, variable importance, and model performance, can be found in Chew et al.
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].
Because the model does not produce perfect predictions (test set F1 score, ∼0.79), we reduced the likelihood that the model returned false positives by only considering predictions with a predicted probability >0.6 for either age group. This process of rejecting predictions for which the model is most uncertain is referred to as classification with a reject option
in the literature. After applying the age prediction model to the posts from each query, we selected the 25 posts in each predicted age group and query with the highest karma scores (number of upvotes – number of downvotes). This resulted in 150 total posts across both age groups and 3 queries.
Data Analysis
Two coders were trained using a standardized codebook, and after achieving sufficient inter-rater reliability (percentage agreement reached at least 70%), they independently coded the study sample. All themes listed in the Results section were the themes in the codebook. Not all themes were present; more information is provided in the Results. Posts were excluded if they mentioned marijuana/tetrahydrocannabinol/cannabidiol, were not in the English language, or were not relevant to E-cigarettes.
RESULTS
Descriptive Statistics
Eighteen posts were excluded from the predicted UA group, and 24 posts were excluded from the predicted OLA group, leaving 57 UA (general vaping: 18, flavor restriction policies: 18, Tobacco 21 policies: 21) and 51 OLA (general vaping: 13, flavor restriction policies: 17, Tobacco 21 policies: 21) posts. For each query, the range of karma scores for coded posts was large, suggesting that most highly engaged posts (i.e., high karma scores) were captured (general vaping: predicted UA group: mean=3,715, min=1,212, maximum [max]=8,327; predicted OLA group: mean=1,071, min=553, max=2,352; flavor restriction policy: predicted UA group: mean=476, min=42, max=5,837; predicted OLA group: mean=438, min=248, max=1,188; Tobacco 21 policy: predicted UA group: mean=62, min=17, max=376; predicted OLA group: mean=185, min=20, max=1,259).
Post Categories
Table 1 reports the frequency and percentages of each post code category and subcategory. Coding categories included flavor restriction policies, access, Tobacco 21 policies, use, motivations for vaping, harm perceptions, products, memes/jokes, coronavirus disease 2019 (COVID-19), barriers to vaping, campaigns by the Center for Tobacco Products, and other. Barriers to vaping and campaigns by the Center for Tobacco Products did not emerge as code categories, even though they were originally in the codebook.
Table 1Postcategory Prevalence in Both Predicted Underage and Of-Legal-Age Post Authors
For both UA and OLA groups, the categories of flavor restriction policies and Tobacco 21 policies were the most prominent (>40%). Between the 2 groups, the products and memes/jokes categories were more prominent for UA than for OLA. The categories of use and harm perceptions were more prominent for OLA.
Demonstrating nuances between the groups, subcategory differences continued between predicted age groups. For flavor restriction policies, opposition was a primary subcategory for both predicted age groups, but many flavor restriction posts fell into the other subcategory for the UA group and skepticism for the OLA group. To clarify, Opposition was defined as “voicing clear opposition or encouraging work against an ordinance,” and skepticism was defined as “doubt about the motives behind or effectiveness of an ordinance.” Posts coded as other were dominated by news stories in both groups. The OLA group had nearly twice as many opposition codes as the UA group, and the second most common codes for UA were links to news stories. A clear distinction between the groups is that the OLA group showed greater opposition and skepticism to flavor restriction policies.
For the Tobacco 21 policy category, a similar pattern emerged for the UA and OLA groups, with opposition, skepticism, and the other subcategories dominating the conversation, although for this topic, the UA group showed greater opposition and skepticism, whereas the OLA group posted mostly other-category news links. For the UA group, a subcategory code emerged that detailed the desire to allow ENDS users aged 18–20 years, who were previously able to use ENDS products, to continue having the ability to purchase ENDS products (legacy clause, sometimes referred to by posters as grandfather clause). Within the subcategories of use, the vape terms subcategory was the most prominent for both groups. These terms consisted of vapes, vaping, vape master, ripping, and Juuling for the predicted UA posts and fire up your rig, e-liquid, ejuice, nic juice, coils, and pod system for OLA posts.
For motivations for vaping, the primary motivation mentioned for both UA and OLA posts was the desire to avoid cigarettes. Harm perception posts were primarily identified for the OLA group and ranged in topic from vaping-related illnesses to feeling better after quitting. The product category was primarily made up of brand names. For the UA group, this brand was exclusively JUUL, but the OLA group included others such as Lava Pods. Memes/jokes emerged predominantly among UA posts and included visual jokes for various sorts of media and contained jokes mocking vaping. COVID-19 information, in the form of news articles, was discussed mostly within OLA posts. The other-category posts were more prominent for OLA posts and consisted of an individual's personal relationship with vaping, usually with a form of judgment.
DISCUSSION
This mixed methods analysis of Reddit posts provided insight into ENDS online conversations by differentiating conversations by 2 predicted age groups (13–21 and 21–54 years). Differences between predicted age groups emerged for both frequency of code categories and more specific content within categories. Posts were coded into the categories of flavor restriction policies, Tobacco 21 policies, use, motivations for vaping, harm perceptions, products, memes/jokes, COVID-19, and other. Looking at the subcategories, a more nuanced story emerged such that most posts for the UA group fell into the other category, and skepticism posts were most prevalent for the OLA group. A similar pattern emerged for the Tobacco 21 policy category. One differentiating subcategory for the Tobacco 21 policy category was the legacy clause code for UA posts. This study aligns with previous research, which found age restriction opposition by UA Reddit users, using E-cigarettes to avoid cigarettes,
This study has implications for future research and for public health surveillance. The mixed methodologies (i.e., data science models and qualitative coding) used in this study can be applied to a vast number of public health topics. In addition, age algorithms have been applied to other platforms in the past and there can be an expansion of the platforms that are analyzed.
Finally, automated data science methodologies (e.g., topic modeling) could provide a way to autocategorize posts, making it easier to provide thematic analyses for large amounts of data and provide a more rapid form of surveillance.
Limitations
This study had several limitations. The keywords used in the query did not reflect all the relevant keywords that could differentiate between posts written by UA and OLA users of Reddit. Relevant posts could have been missed. Although a sample of the top 25 most engaged posts was used, the sample size is still fairly small. This may limit the generalizability of the results. Because the sample was small, it was inappropriate to conduct any statistical analyses.
CONCLUSIONS
Reddit posts provide a robust public access data source that can be used by researchers.
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].
This study used a combination of methodologies to paint a picture of the current ENDS landscape on Reddit. Differences were found across all the 3 queries (i.e., general vaping, flavor restriction policies, and Tobacco 21 policies). These differences highlight the importance of using a combination of classification tools and qualitative coding that allows researchers and public health professionals to better understand perceptions and knowledge, attitudes, and beliefs about a product to develop more targeted messaging.
ACKNOWLEDGMENTS
This publication represents the views of the author(s) and does not represent Food and Drug Administration Center for Tobacco Products position or policy.
This work was funded by contract with the Center for Tobacco Products, U.S. Food and Drug Administration, U.S. Department of Health and Human Services (number HHSF223201510002B-Order #75F40119F19020).
Declarations of interest: none.
CRediT AUTHOR STATEMENT
Mario A. Navarro: Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review and editing. Andrea Malterud: Methodology, Writing – original draft, Writing – review and editing. Zachary P. Cahn: Writing – original draft, Writing – review and editing. Laura Baum: Formal analysis, Project administration, Methodology, Writing – original draft, Writing – review and editing. Thomas Bukowski: Data curation, Formal analysis, Software, Writing – original draft, Writing – review and editing. Caroline Kery: Data curation, Formal analysis, Writing – review and editing. Robert F. Chew: Data curation, Formal analysis, Software, Writing – original draft, Writing – review and editing. Annice E. Kim: Conceptualization, Supervision, Writing – review and editing.
REFERENCES
Gentzke AS
Wang TW
Cornelius M
et al.
Tobacco product use and associated factors among middle and high school students – National Youth Tobacco Survey, United States, 2021.
Predicting age groups of Reddit users based on posting behavior and metadata: classification model development and validation [published correction appears in JMIRPublic Health Surveill. 2021;7(4):e30017].