Detecting Topics in Tweets
Twitter is a widely used short message service. With 313M monthly active users, it is also an important platform for politicians to reach potential voters. It thus seems interesting to investigate how some of the most popular German party’s chairholders use their twitter account and if any differences between their tweets’ style can be detected.
Tweet collection and vectorization
To that aim, we used the twitter API to collect all the available tweets of the chairholders of four political parties (die LINKE, CDU, SPD, AfD). We accessed the API with the Python library Tweepy and excluded retweets and replies:
tweets = api.user_timeline(screen_name=screen_name, include_rts=False, exclude_replies=True)
We preprocessed the tweets’ content by removing urls, all punctuation except hyphens and by converting each word to lower-case. As German is a strongly inflecting language, applying stemming to the words is a common preprocessing step. We used the stemmer implemented in the Natural Language Toolkit NLTK. In addition to that, we extracted the hashtags contained in each tweet and used them as an additional feature. Here is a tweet before and after preprocessing for illustration:
Tweet Before preprocessing:
“Besondere Glückwünsche an @Cumhuriyetgzt und @candundaradasi! Ein mutiges Signal für alle, die in der #Türkei für Meinungsfreiheit kämpfen.”
Tweet after preprocessing:
“besond gluckwunsch an @cumhuriyetgzt und @candundaradasi ein mutig signal fur all die in der turkei fur meinungsfrei kampf #Türkei”
Each tweet is then transformed into a numerical vector containing the counts of each word appearing in it, a so-called bag-of-words vector. Given the vectorial representation of the tweets, it is possible to compare them mathematically.
Embedding the tweets with t-SNE
t-SNE is a method for dimensionality reduction. Each tweet in our corpus corresponds to a vector with many entries. By reducing their dimensionality, it is possible to visualize them in a 2D space. An additional benefit of the t-SNE method is that similar vectors will be mapped close together. The embedding thus groups together tweets on similar topics.
The Python machine learning library Scikit-learn comes along with a t-SNE implementation. The model was initialized as follows:
tsne = TSNE(random_state=0, perplexity=8)
The perplexity parameter indicates how many neighbours of each tweet will be considered when calculating the embedding. We applied the model to the tweets of each politician to obtain a 2D-representation which we plotted as a scatter plot.
Tweets – very short messages
The results we obtained (see other blog post) already gave interesting insights on the thematic structure of each politician’s tweets. In addition to that, we validated the obtained t-SNE embedding by applying the K-means clustering algorithm to the bag-of-words vectors. The results were consistent with the t-SNE grouping. However, there were still many tweets that could not clearly be assigned to a cluster. This might be due to the fact that twitter messages are restricted to 140 signs and are thus very short. Sometimes they even consist of a single word or a hashtag only. Their information content is therefore very restricted and a topic can not always be clearly derived.