Machine Learning Classifiers for Socio-Demographics of Social Media Users: Limitations and Possibilities

Poster Presentation at the APHA 2019 Annual Meeting and Expo


Date
Event
APHA 2019 Annual Meeting and Expo
Location
Philadelphia, PA

Background: Social media analyses of health behaviors, such as vaccination, have shown significant promise for surveillance. However, existing research, both quantitative and qualitiative, suggests that health behaviors varies significantly with socio-demographic factors, particularly race, ethnicity, and gender. Often, these factors are not explicitly disclosed on social media platforms and must instead be inferred, raising the possibility of methodological bias. This session examines existing tools and methodologies used to infer socio-demographics of social media users. Methods: We survey several socio-demographic classifiers that work with Twitter and Reddit data. These classifiers use features including users’ language patterns, follower behaviors, and choice of names. These classifiers predict labels including users’ gender, race and ethnicity, or filter out social media accounts run by organizations.

Results: We explain how the data for these classifiers is collected, how the classification models are trained, and how they could be applied to public health research. We in particular discuss the limitations that these classifiers have, including possible methodological bias introduced by the challenges of large-scale data collection of social media users’ demographic information.

Discussion: Health behaviors vary with socio-demographic factors, which are challenging to measure on social media platforms. Machine learning classification of socio-demographics is possible, but requires interdisciplinary considerations.