QUANTIFYING POLITICAL LEANING FROM TWEETS, RETWEETS, AND RETWEETERS
In recent years, big online social media data have found many applications in the intersection of political and computer science. Examples include answering questions in political and social science (e.g., proving/disproving the existence of media bias [3, 30] and the “echo chamber” effect [1, 5]), using online social media to predict election outcomes [46, 31], and personalizing social media feeds so as to provide a fair and balanced view of people’s opinions on controversial issues [36]. A prerequisite for answering the above research questions is the ability to accurately estimate the political leaning of the population involved. If it is not met, either the conclusion will be invalid, the prediction will perform poorly [35, 37] due to a skew towards highly vocal individuals [33], or user experience will suffer. In the context of Twitter, accurate political leaning estimation poses two key challenges: (a) Is it possible to assign meaningful numerical scores to tweeters of their position in the political spectrum? (b) How can we devise a method that leverages the scale of Twitter data while respecting the rate limits imposed by the Twitter API? Focusing on “popular” Twitter users who have been retweeted many times, we propose a new approach that • Felix M.F. Wong was with the Department of Electrical Engineering, Princeton University. He is now with Yelp, Inc. Email: [email protected] • Chee Wei Tan is with the Department of Computer Science, City University of Hong Kong. Email: [email protected] • Soumya Sen is with the Department of Information & Decision Sciences, Carlson School of Management, University of Minnesota. Email: [email protected] • Mung Chiang is with the Department of Electrical Engineering, Princeton University. Email: [email protected] Preliminary version in [51]. This version has substantial improvements in algorithm, evaluation and quantitative studies. incorporates the following two sets of information to infer their political leaning. Tweets and retweets: the target users’ temporal patterns of being retweeted, and the tweets published by their retweeters. The insight is that a user’s tweet contents should be consistent with who they retweet, e.g., if a user tweets a lot during a political event, she is expected to also retweet a lot at the same time. This is the “time series” aspect of the data. Retweeters: the identities of the users who retweeted the target users. The insight is similar users get followed and retweeted by a similar audience due to the homophily principle. This is the “network” aspect of the data. Our technical contribution is to frame political leaning inference as a convex optimization problem that jointly maximizes tweet-retweet agreement with an error term, and user similarity agreement with a regularization term which is constructed to also account for heterogeneity in data. Our technique requires only a steady stream of tweets but not the Twitter social network, and the computed scores have a simple interpretation of “averaging,” i.e., a score is the average number of positive/negative tweets expressed when retweeting the target user. See Figure 1 for an illustration. Using a set of 119 million tweets on the U.S. presidential election of 2012 collected over seven months, we extensively evaluate our method to show that it outperforms several standard algorithms and is robust with respect to variations to the algorithm. The second part of this paper presents a quantitative study on our collected tweets from the 2012 election, by first (a) quantifying the political leaning of 1,000 frequently retweeted Twitter users, and then (b) using their political leaning, infer the leaning of 232,000 ordinary Twitter users. We make a number of findings: • Parody Twitter accounts have a higher tendency to
Fig. 1. Incorporating tweets and retweets to quantify political leaning: to estimate the leaning of the “sources,” we observe how ordinary users retweet them and match it with what they tweet. The identities of the retweeting users are also used to induce a source similarity measure to be used in the algorithm. be liberal as compared to other account types. They also tend to be temporally less stable. • Liberals dominate the population of less vocal Twitter users with less retweet activity, but for highly vocal populations, the liberal-convservative split is balanced. Partisanship also increases with vocalness of the population. • Hashtag usage patterns change significantly as political events unfold. • As an event is happening, the influx of Twitter users participating in the discussion makes the active population more liberal and less polarized. The organization of the rest of this paper is as follows. Section 2 reviews related work in studies of Twitter and quantifying political orientation in traditional and online social media. Section 3 details our inference technique by formulating political leaning inference as an optimization problem. Section 4 describes our dataset collected during the U.S. presidential election of 2012, which we use to derive ground truth for evaluation in Section 5. Then in Section 6 we perform a quantitative study on the same dataset, studying the political leaning of Twitter users and hashtags, and how it changes with time. Section 7 concludes the paper with future work
Research Paper Link: Download Paper