To catch up, we are discovering sentiment in QAnon drops using an unsupervised (k-means clustering) approach. In my previous post we imported our data into a pandas dataframe and did some initial data cleaning.
What happens now
In this article we take our logical next step, translating the QAnon drops into a format the classifier (discussed in the next post) can understand. To do this we are going to leverage a python ML model called word2vec.
A little on Word2vec
Machine learning models do not learn well or even at all, in some cases, on text data. Since the models rely on complex calculations, they require numerical data.
In order to use text data we must translate it into something the model can understand. There are a few ways to accomplish this such as one-hot encoding and label encoding; both of which are beyond the brevity of this article.
The algorithm, Word2vec is another, arguably more powerful, text translation algorithm. It differs from the text encoding mentioned before in that it is a text embedding model. Here, the aim is to use a neural network to learn a vector-space representation of an entire corpus of text. This is advantageous because the vector space will attempt to maintain relations of similar words by grouping them closer together.
For a great deep dive, check this out for an illustrated description on word2vec.
Building / Training
We are using Gensim’s implementation of the word2vec algorithm.
from gensim.models import Word2Vec
Word2vec expects a list of lists where rows represent lists of words in each sentence like:
["this", "is", "an", "example"],
["this", "is", "another", "example"]
Gensim’s library provides a method to convert your data to a list of lists with the built-in function
build_vocab however we already split our our sentences in the data cleaning step.
Now that our data is ready, lets implement the word2vec model:
v_model = Word2Vec(corpus_list, min_count=3, vector_size=1000, workers=4, window=5, epochs=40, sg=1)
Lets talk about what this actually does…
When training data is fed into the model it will attempt to generate a set of unique words. Then, during training, the model will generate a vector space representation of the words based on a “window” of consecutive words in each sentence. The model assumes that close words in each sentence imply “closeness” in the vector space.
Lets talk about the hyperparameters…
min_count = defines the minimum threshold for the number of occurrences a word must achieve before it’s vectorized. In this example, a word needs to occur at least three times before the model uses it.
vector_size = represents the output vector dimensionality.
workers = defines how many threads the model will use during training
window = represents the ‘window’ of words in each sentence that the model uses to build similarities.
epochs = defines how many times the model will train through the entirety of the corpus.
You’ll need to try a few different sets of hyperparameters to tune the model to your liking. Although not implemented here, a hyperparameter search might be a good option.
We can get a good understanding of the trained embeddings and whether or not they make sense by choosing some words at random and running
most_similar() on the trained vectors. These vectors can be grabbed by calling
.wv on the trained model. This will give the 10 closest words, or rather the words with the closest similarities. Here are some examples:
('favor', 0.5992740988731384)]print(v_model.wv.most_similar('investigation'))[('cops', 0.7476568818092346),
As you can see the similar words in the examples do hold similar meaning in generalized English speech. Particularly the most similar to ‘investigation’ we can notice the similarities.
A caveat with this model is not letting the model learn long enough and running into under-fitting. If the vectors are not well defined enough, and the words are too equidistant apart, the classifier we use in the next step will have a hard time splitting the words into two groups.
Now that we have our corpus in the correct form we can cluster the vectored words into positive and negative sentiments using a K-Means classifier. See you in Part Three!