Twitter Sentiment Analysis Using Viterbi Algorithm¶
- Note this code is not executable. Code is excerpts from the project which was built in conjunction with two other classmates.
Overview¶
In the SUTD Machine Learning class we were assigned the task of analyzing sentiment of tweets in two different languages using the Viterbi Algorithm. The data set consisted of about 200 tweets from each language. In the training data set each word came with a corresponding tag indicating the sentiment of the word. The only packages we were allowed to use math packages such as math
and numpy
. Below are excerpts of explanations from our report on the project.
The process of "training" and "predicting" involves finding probabilities between words and the possible tags, as well as the probabilities of one tag coming after another. Then with these probabilities it is then decoded on a test data set using the Viterbi algorithm. This process is discussed more in depth below as well as some ways we chose to improve the accuracy of this algorithm.
Emission and Transmission Parameters¶
Emission parameters are the probabilities of observing a certain word given the sentiment label. This parameter involves going through all of the training data and counting the amount of times a word emits a tag divided by the number of times the word occurs. See the formula below.
$$ e(x|y)= \left\{ \begin{array}{ll} \frac{Count(y\rightarrow x)}{Count(y) + k} & \text{If the word token appears in the training set} \\ \frac{k}{Count + k} & \text{If word token x is the special token UNK} \\ \end{array} \right. $$
Transmission parameters is the probabilities of transitions from one sentiment label to another. This probability is calculated using the formula below.
$$q(y_i | y_{i-1}) - \frac{Count(y_{i-1},y_i)}{Count(y_{i-1})}$$
# Excerpt of a function written to find emission parameters
"""
Emission parameter estimation using MLE
Input:
* tags: a dictionary of tags, that provides a count for how often a word has occurred for a given tag
* k: smoothing parameter for the #UNK# token
Returns:
* eParams: a dictionary of tags that calculates a ratio between a tags words and its total count i.e: {"O":{"hi":0.5,"#UNK#":0.5}}
"""
def findEmissionParams(tags, k=1):
eParams = defaultdict(lambda: defaultdict(lambda: 1e-10))
for tag, tagData in tags.items(): # for each word under a tag
for word, wordCount in tagData['words'].items(): # create the parameter emission = wordCount / totalCount + k
eParams[tag][word] = wordCount / (tagData['count'] + k)
eParams[tag]['#UNK#'] = k / (tagData['count'] + k) # update #UNK# token for each tag
return eParams
The Viterbi Algorithm¶
The key definitions in the algorithm are hidden states and observations. With the emission parameters we calculated the probability that a given observation outputs a hidden state. The transmission parameters contain the probabilities that hidden state ${x}$ comes after hidden state ${y}$. In our case the observation is a word in a tweet and the hidden state is the sentiment label associated with it. The Viterbi Algorithm is a dynamic programming model that leverages these to probabilities to find the most likely sequence of hidden states or tags based off a given observation or tweet. In depth analysis of the algorithm is not discussed here but a function from the project is provided to show our implementation.
def viterbi(observed_data, trans_params, emi_params, words):
# Define n = len(observed_data)
# Viterbi matrix is n+1 x num_states
# backpointer table is n x
# This is used to trace back and find the maximum path??
# Calculate initial state probabilites using transmission matrix and emission matrix
# Fill in the viterbi table
# For every word (t) in observed_data
# Iterate through the transmission matrix and emission matrix using a double for loop
# for every state in states
# Probabilities = np.zeros()
# for every prev_state in states
# probs[s_prev] = viterbi_table[t-1][s_prev] * transmission_matrix[s_prev][s] * emission_matrix[s][observed_data[t]]
# viterbi_table[t][s] = np.max(probs)
# backpointer_table[t][s] = np.argmax(probs)
# calculate the probability of the most likely sequence
# use back tracking to find the sequence of states
labels = list(emi_params.keys())
n = len(observed_data)
num_labels = len(labels)
# Viterbi Table is 2d dict that contains n x num_states entries. Is used to store the probability paths
delta = [{state: 0.0 for state in labels} for word in observed_data]
# Back tracing table is a 2d dict that contains the previous key associated with the probability in v_table
psi = [{state: None for state in labels} for word in observed_data]
# Initializing the beginning of the algorithm
for l in labels:
if observed_data[0] not in words:
delta[0][l] = np.log(trans_params['START'][l]) + np.log(emi_params[l]['#UNK#'])
else:
delta[0][l] = np.log(trans_params['START'][l]) + np.log(emi_params[l][observed_data[0]])
psi[0][l] = 'START'
# Iterating through the rest of the table
for t in range(1, n):
for s in range(num_labels):
probability_lst = []
for prev_s in range(num_labels):
# NOTE: Need to check if the location in trans_params is exists and if the location in emi_params exists
# This is done with .get()
if observed_data[t] not in words:
probability_lst.append(float(
delta[t - 1][labels[prev_s]] + np.log(trans_params[labels[prev_s]][
labels[s]]) + np.log(emi_params[labels[s]]['#UNK#'])))
else:
probability_lst.append(float(
delta[t - 1][labels[prev_s]] + np.log(trans_params[labels[prev_s]][
labels[s]]) + np.log(emi_params[labels[s]][observed_data[t]])))
# max of all the probabilities
delta[t][labels[s]] = max(probability_lst)
psi[t][labels[s]] = labels[np.argmax(probability_lst)]
# Find the transition from last word to STOP
probability_lst = []
for prev_s in range(num_labels):
probability_lst.append(
float(delta[n - 1][labels[prev_s]] + np.log(trans_params[labels[prev_s]]['STOP'])))
# v_table[n - 1][states[num_states - 1]] = max(probability_lst)
# bt_table[n - 1][states[num_states - 1]] = states[np.argmax(probability_lst)]
# Now time for some back tracing to find the correct sequence with bt_table
sequence = [None for i in range(n)]
# sequence[n - 1] = max(v_table[n - 1], key=lambda k: v_table[n - 1][k])
sequence[n-1] = labels[np.argmax(probability_lst)]
for t in range(n - 2, -1, -1):
sequence[t] = psi[t + 1][sequence[t + 1]]
return sequence
Improvements¶
Second Order Viterbi Algorithm¶
In addition to implementing first order Viterbi, we were also tasked with implementing second order Viterbi. The difference is in the transmission parameters and the decoding of such. The transmission parameters turn into a 3 dimensional matrix that has all the possible combinations of of tags and then populating those combinations with probabilities that they occur. In the decoding stages of the algorithm a whole new dimension is added to keep track of the best path of the previous two states. This means that the immediate previous state best path may not be the best when you look back two steps.
Smoothing¶
Various smoothing methods were tested on our data. Laplace Smoothing, Good Turing Smoothing, and Jelinek Mercer Smoothing were all tested on the emission parameters of the data. The best smoother for our data will be discussed later. Each of these smothers is a different way to smooth frequency statistics and remove noise that may be within the data. Example is provided below.
"""
Jelinek Mercer Smoothing
"""
def findEmissionParamsJelinekMercer(eParams, lambdaVal):
smoothedParams = defaultdict(lambda: defaultdict(int))
# Compute the total count of all words in the corpus
totalWords = sum([sum(tagData.values()) for tagData in eParams.values()])
# Calculate the Jelinek-Mercer adjusted probabilities
for tag, tagData in eParams.items():
for word, count in tagData.items():
# Compute the background probability of the word
bgProb = sum(eParams[tag2].get(word, 0) for tag2 in eParams.keys()) / totalWords
# Calculate the Jelinek-Mercer adjusted probability
smoothedParams[tag][word] = (1 - lambdaVal) * (count / sum(tagData.values())) + lambdaVal * bgProb
return smoothedParams
Stop Word Removal¶
Stop word removal is a method that operates under the belief that commonly used words such as 'the', 'a', 'this', 'that' and many others are considered noise in the data. Upon looking at our data set we realized that twitter handles are all unique and always emit the 'O' tag which stands for 'out' or irrelevant to sentiment. This caused a lot of noise in the UNK emission parameter because all tags are unique. So we decided to remove them from the training data and in testing automatically classify them as 'O'. These parameters were then no included in the Viterbi decoding process. Similar things wer don to punctuation and other twitter specific text. Therefore our stop word removal list was custom and not implemented on the standard word list.
Lemmatization¶
Lemmatization is another method that can help improve emission parameters in some cases. It is the process of reducing all verbs to their base form. For example 'ran' and 'running' would all become 'run'. It also gets rid of pluralization of words, so 'words' would become just 'word'. This is another form of noise reduction in data. This was implemented with a preexisting lemmatizer in the spaCy python library. It already has one implemented for English and French.
Results¶
The final results were fine tuned to get the best output. For English the best results came from using Lemmatization, Stop word removal, Jelinek Mercer smoothing with k = 0.1, and first order Viterbi. For French the best results came from using Lemmatization, Stop word removal, Jelinek Mercer smoothing with k = 0.01, and second order Viterbi. With just a small data set we were able to achieve a precision of 71% in English and 64% in French.