Learning a new concept in any field can be a daunting task. A typical approach is to find an application that is relevant or at least interesting to you, the learner. In teaching myself unsupervised machine learning, I was intrigued by QAnon drops; anonymous internet posts that have been shaking up the socio-political world. Are there interpretations that ML can make on QAnon drops? Do these drops lean towards a specific sentiment? Lets find out!
With the preceding motivation I set out to create a python-based unsupervised approach to describing the sentiment of QAnon drops. In the following write up we will use word2vec and K-means cluster, to implement a quick and dirty method of extracting positive or negative sentiment from these posts.
Tech and the Code
If you’d like to follow along or just skip to the nitty-gritty; all the source code for this project is available on my github here.
There are a few extremely helpful resources I’d like to call out before we begin:
- Curated QAnon Drop dataset in JSON format
- Great resource for an intro on K-means clustering
- Excellent article on unsupervised sentiment analysis
Brief Primer on QAnon Drops
Back in 2017 anonymous, politically-motivated posts began popping up on the image-board 4-chan and then 8-chan. These posts or “drops” detailed a clandestine organization of high ranking individuals associated with pedophilic sex cults; among other nefarious practices. The true author remains a mystery to this day. The drops have garnered international attention and have provided motivation to a sizable swath of our population. Whether or not any of it is true is beyond this article however it’s very interesting to analyze from a data science standpoint. For more info on QAnon, checkout the wikipedia article.
QAnon drops appear (from a cursory glance) fairly random but do follow a similar theme throughout. On the surface, they are subject to numerous misspellings and incorrect grammar:
“Court order to preserve ALL data sent to GOOG?\nThink GOOG+ / Gmail / etc.\nComms Cleanup?\nThe More You Know…”
“There is TRUTH in MEMES.\nTRUTH that DESTROYS the FAKE NEWS narrative. \nHOUSE OF CARDS.\nQ”
In order to glean some sort of meaning, we need to clean up the data a bit by removing this “noise”.
Of note however, we don’t want to remove too much as we could lose the overall “QAnon-ness” of it.
The first step is to import the drops from JSON into a pandas dataframe where each drop is an individual row. Newline, tab, carriage returns are removed, as well as “Q” signatures. We also make use of Gensim’s
remove_stopwords to remove common words such as “this”, “there” and “from”.
from gensim.parsing.preprocessing import remove_stopwordscorpus = pd.DataFrame(columns=['text'])for i in imported_text['posts']: if 'text' in i: corpus.loc[len(corpus.index)] = remove_stopwords(str(i['text']).replace("\n", "").replace("\r", "").replace("\t", "").replace('Q', '').lower())
Next, general special characters need to be removed.
There are a lot of “test” posts, reply posts and hyperlinks that need to be removed as well.
# Drop test postscorpus['text'] = corpus[~corpus['text'].str.contains("test")]# Remove Linkscorpus['text'] = corpus['text'].str.replace(r'https\S+', '', regex=True)corpus['text'] = corpus['text'].str.replace(r'http\S+', '', regex=True)# Drop post repliescorpus['text'] = corpus[~corpus['text'].str.contains(">>")]corpus.dropna(inplace=True)
Finally, as we will see in the next post, word2vec is expecting a corpus of a list of lists where the inner list are the individual words in each drop. Lets go ahead and make each dataframe row a list.
corpus['text'].apply(lambda x: str(x).split())
This gives us a corpus of 3113 drops.
Now that we have a good grasp on the data, lets extract some word embeddings with word2vec… See you in Part Two!