(Part Three)

# Recap

In the previous two posts we imported a corpus of QAnon drops, split it and extracted a vector-space embedding of the individual words.

# What Happens Now

In this part we will take our newly vectorized data and pass it into a k-means classifier. Our hope here is to split each word into positive and negative sentiment. After which we will use each words “sentiment score” to build the overall sentiment of each drop.

# K-means Implementation

`from sklearn.cluster import KMeansk_means = KMeans(n_clusters=2, max_iter=5, random_state=True, n_init=40).fit(corpus_vec.astype('double'))`

K-Means is a popular algorithm in unsupervised machine learning that attempts to divide the corpus into the number of clusters asked.

`n_clusters` = number of clusters to divide into. (Here 2, positive and negative)

`max_iter` = number of iterations for the model to run.

`n_init` = number of times the model will train with different center starting points.

# K-means Interpretation

Let’s take a look at the plot.

Since our vector space is many more dimensions than 2 (our clusters), the plot might be a little hard to interpret.

We can take a look at some sample data to extract a little more meaningful information…

Let’s use the vector values for each of the centroid results from our k-means and plug them into a word2vec function called `similar_by_vector`

`print(trained_vec.wv.similar_by_vector(trained_k_means.cluster_centers_, topn=100, restrict_vocab=None))print(trained_vec.wv.similar_by_vector(trained_k_means.cluster_centers_, topn=100, restrict_vocab=None))`

What’s happening here is our trained w2v model will return the top 100 most similar words to the center vector… In other words the top 100 with the closest relation. We can use this to eyeball some sentiment interpretations between the two. We will dial this in a little more a bit later…

‘Center 0’ returns fairly negative with many words such as:

`(profiteering,-0.49529905644488126),(cheat,-0.42033156706675295),(racists,-0.385416523053752),(death,-0.27131969847491444),(corruption,-0.23484911923399268)`

Whereas ‘Center 1' returns more neutral to positive words:

`('holy',0.6056230068206787),(glorious,0.3231151300025639),(logically,0.2628306231335882),(happiness,0.21238615344313627), ('cooperating',0.605171263217926)`

Remember, these words aren’t ALL going to be very positive or very negative, at least at a glance; as the English language runs the entire gambit between. Its safe to assume that taken word for word, this corpus provides the same distribution.

Next, we need to populate a dataframe with columns for the words, the corresponding vector and the cluster that k-means reports for each word. We will also give the negative words a `cluster_value` of -1 and positive words a `cluster_value` of 1. With this we can determine how positive or negative a word is by multiplying the `cluster_value` by the inverse of the distance from centroid returned from k-means. This gives us a weighted value from 0 to 1 telling us how strong the sentiment is.

`# Weight words in clusters based on distance to center and populate pandas dataframesentiment_corpus = pd.DataFrame(trained_vec.wv.index_to_key)sentiment_corpus.columns = ['key']sentiment_corpus['vectors'] = sentiment_corpus['key'].apply(lambda x: trained_vec.wv[f'{x}'])sentiment_corpus['cluster'] = sentiment_corpus['vectors'].apply(lambda x: trained_k_means.predict([np.array(x)]))sentiment_corpus.cluster = sentiment_corpus['cluster'].apply(lambda x: x)sentiment_corpus['cluster_signed'] = [-1 if i==0 else 1 for i in sentiment_corpus['cluster']]# Lets determine HOW positive or negative each word is...sentiment_corpus['min_value'] = sentiment_corpus.apply(lambda x: 1/(trained_k_means.transform([x.vectors]).min()), axis=1)sentiment_corpus['distance'] = sentiment_corpus['min_value'] * sentiment_corpus['cluster_signed']`

Let’s see how many words call into each category…

`positive_word_count = len(sentiment_corpus[sentiment_corpus['cluster_signed'] == 1])negative_word_count = len(sentiment_corpus[sentiment_corpus['cluster_signed'] == -1])print(f"Number of positive words: {positive_word_count}")>> Number of positive words: 1627 print(f"Number of negative words: {negative_word_count}")>> Number of negative words: 1467`

If you have any pre-assumptions regarding the sentiment of QAnon drops, you might be surprised at the preceding data. One thing to keep in mind, though, is the sentiment of the English language as a whole. I am not a linguist but I’d bet its a good assumption that English has more neutral to positive words than negative ones. Also, just because there are more positive than negative words in this corpus DOES NOT mean that each drop till aggregate to more positive than negative sentiment.

To understand the overall sentiment per drop let’s put our generated weights back into full drop form.

To do this we will generate a dataframe with the split individual weights in one columns and the aggregate score for the whole drop in the second column.

`score_df['split'] = corpus_df['text'].map(lambda x: [sentiment_corpus[sentiment_corpus['key'] == j]['distance'].values for i, j in enumerate(x)])score_df['split'] = score_df['split'].map(lambda x: [j for j in x if j.size > 0])score_df['split'] = score_df['split'].map(lambda x: [j for i, j in enumerate(x)])score_df['line_score'] = score_df['split'].map(lambda x: sum(x))`

An important gotcha for this step rears it’s head from way back in Part 2. Our w2v model skipped vectorizing words that appeared less than 3 times. Therefore when we replaced our corpus with sentiment values, some will not be present. We can easily remedy this by removing before totaling up the values.

`score_df['split'] = score_df['split'].map(lambda x: [j for j in x if j.size > 0])`

# QAnon Drop Insight

Here’s the part where everything comes together. We have a dataframe populated with the sentiment values for each QAnon drop. Lets print some figures and plot…

`# Totalpositive_drops_count = len(score_df[score_df[‘line_score’] > 0])negative_drops_count = len(score_df[score_df[‘line_score’] < 0])print(f”Number of positive drops: {positive_drops_count}”)>> Number of positive drops: 1283print(f”Number of negative drops: {negative_drops_count}”)>> Number of negative drops: 1777`

As you can see, even though we had more positive words than negative; we had slightly more negative drops than positive. Some of you with pre-assumptions were right. If we take a look at the histograms you can see that there smaller amounts of words with VERY strong sentiment. Also you can see there is a larger distribution for negatively leaning words.

Overall, the sentiment was close. Closer than I assumed at the beginning of this project. Since this was quick and dirty, I’m sure these models could be tuned more.

# Next For You

Go crazy, tune your hyperparameters, validate your data with known trained data! Dive into some metrics. Use your own corpus of something completely unrelated!

# Next For Me

I’d like to run the model against an uncleaned (or less cleaned) dataset to attempt to capture more of the “QAnon-ness”.

I need to run some sort of validation against the models. Namely comparing the k-means results against a known corpus of positive and negative words.

I’d like to see if there was positive or negative movement in the stock market on positive and negative drop days.