EmbedRank.

Can word importance be derived from a word embedding?

Introduction

This exploration was motivated by work with Intuit through Machine Learning at Berkeley. The goal was to investigate some novel methods for keyword extraction by using a word embedding space. EmbedRank is inspired by TextRank which is itself inspired by Google's infamous PageRank.

Word Embeddings

Representing words as vectors is a key step before performing any further statistial analysis or natural language. The simplest representation of a word is simply a one-hot encoded vector of the entire dictionary. While this can be suprisingly useful in a lot of contexts, there is an obvious motivation in NLP to represent words as vectors that encode both synctatic and semantic information. While there are more recent embedding methods like GloVe, I'll briefly explain the most common embedding method, word2vec.

word2vec

At a high level, word2vec can be thought of as a single hidden layer "autoencoder", where instead of mapping a word to itself, we map it to surrounding context words (or vice versa). If we represents our words as a one-hot encoded dictionary vector, we can learn to represent these words as the outputs of the hidden layer in two ways:

Continuous bag of words (CBOW): Using surrounding context words to predict a center target word.
Skip-gram: Using a center word to predict surrounding context words.

In deciding "surrounding" words, we select a window-size hyperparameter. For example, a window size of 2 would use 2 words before and 2 words after the center word for training.

skip-gram and CBOW — Skip-gram and CBOW models.

In the graphic above, we would have a window size of C/2. The left model represents the skip-gram and the right is CBOW. While word2vec learns the weights of both W and W', we ultimately only care about W. Given a dictionary of size V, W takes in a V-dimensional one-hot word vector and maps it to some lower, N-dimensional space. This N-dimensional space, commonly set at N = 300, is our word vector representation!

There is slightly more nuance to the training of these models: CBOW relies on a naive Bayes assumption of surrounding words, and given that V can run into the millions, only a random subset of word weights are updated at each step using a nice trick called negative sampling. Further details on this can be found here.

Keyword Extraction

I'll briefly three frequency and co-occurence based keyword extraction methods: sum tf-idf, Rapid Automatic Keyword Extraction, and TextRank.

Sum tf-idf

Term frequency - inverse document frequency (tf-idf) is a common metric for representing the importance of a word to a document. Document-level vectors of tf-idf are commonly used for a lot of text mining and NLP related tasks.

Term frequency

Given a term t and document d, with a total of n words in the document, the simplest metric for term frequency is simply the relative frequency:

\[TF(t,d) = \frac{n_t}{n} \]

To reduce issues related to longer documents, sometimes it may be more effective to use "augmented frequency" with hyperparameter K (often, K = 0.5), that scales a word depending on its proportion relative to the most frequent word:

\[TF(t,d) = K + (1-K) \frac{n_t}{\max_{t'} n_{t'}}\]

Inverse document frequency

For some corpus C containing N documents, we calculate inverse document frequency as follows:

\[ IDF(t,C) = log \frac{N}{\sum_d 1(t \in d)}\]

From tf-idf to keyword extraction

The metric can be thought of as a way to find the most relevant "rare" words to a document by penalizing commonly occurring words in the entire corpus. While mostly used to identify document keywords, or represent documents as vectors, we can do fairly simple corpus-level keyword extraction by summing the tf-idfs over every document!

RAKE

Rapid Automatic Keyword Extraction (RAKE) is an extremely simple and efficient method for identifying document keywords. From the paper:

The input parameters for RAKE comprise a list of stop words (or stoplist), a set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords, which are sequences of content words as they occur in the text. Co-occurrences of words within these candidate keywords are meaningful and allow us to identify word cooccurrence without the application of an arbitrarily sized sliding window.

After every candidate keyword is identified and the graph of word co-occurrences is complete, a score is calculated for each candidate keyword and defined as the sum of its member word scores. We evaluated several metrics for calculating word scores, based on the degree and frequency of word vertices in the graph: (1) word frequency (freq(w)), (2) word degree (deg(w)), and (3) ratio of degree to frequency (deg(w)/freq(w)). The score for each candidate keyword is computed as the sum of its member word scores.

TextRank

TextRank is based on running PageRank on a word co-occurrence matrix of candidate words (i.e. after removing stop words or other preprocessing steps). First, we make this co-occurrence matrix S row-stochastic by scaling each row, such that in this "word digraph", the entry at (i,j) represents the probability that given you are at node/word i, you will take the edge (i,j) with probability S(i,j). We also introduce a dampening factor alpha, that represents the proportion of time following these edges (vs. randomly jumping to another node). This allows us to create a stochastic, aperiodic and irreducible matrix G:

\[ G = \alpha S + (1-\alpha) \frac{1}{n} ee^T\]

The TextRank solution is given by a row vector v, which represents the proportion of time a "random surfer" of this digraph would spend at each word. Rankings can then be obtained from v (in reality, the solution usually converges quite quickly):

\[ v = vG^{\infty}\]

EmbedRank

This novel method builds on embedding models and PageRank to establish word importance rankings as follows:

Construct a word-embedding using word2vec, GloVe.
Narrow the dictionary to some candidate list (i.e. by removing stop words, or retaining only nouns/adjectives).
Build a similarity matrix between candidate words by using either cosine distance, or a kernel e.g. an RBF kernel, with a gamma hyperparameter that can be tuned as a proxy for how the edge weight in the digraph will decay with distance. Both of these are nice because they are in (0,1) and have a probabilistic flavor.
Optionally set weights to 0 depending on some similarity threshold, to enforce more sparsity (inspired by the actual PageRank matrix).
Establish row-stochasticity by scaling rows to sum to 1. This can be done either by dividing elements by row sums, or using generalized softmax with temperature tau:
Construct a word-embedding using word2vec, GloVe.

Given embedded word vectors V and similarity matrix S, we define cosine similarity, RBF kernel and our row-wise softmax below (here we exclude 0 entries including the diagonals):

\[ cos(\theta) = \frac{V_i \cdot V_j}{||V_i||||V_j||}\] \[ K(V_i,V_j') = \exp(-\gamma ||V_i - V_j||^2)\] \[S'_{i,j} = \frac{\exp \frac{S_{i,j} \cdot 1(S_{i,j}>0)}{\tau}}{\sum_j \exp \frac{S_{i,j} \cdot 1(S_{i,j}>0)}{\tau}} \]

Overall, this method can be tuned to explore the tradeoff between how closely a word is associated with its nearer neighbors, and how centered it is relative to the entire embedding. This tradeoff will be influenced by the choice of kernel and hyperparameters, the thresholding and the row-wise scaling (e.g. softmax temperature). While I will aim to benchmark this method in the future, for now we are testing it with domain-specific keywords in personal finance and small business. With the dummy embedding generated in 2D space below (the relationships don't actually mean anything in this case), the following ranks were obtained:

Dummy 2D embedding — A dummy 2D embedding with the results of EmbedRank.

From this simple example, a few preliminary observations can be made:

Words on the extremes of the embedded space, and words that are further away from others, are given the lowest rank.
The top ranked words appears closest to the centre of the embedding.
Stronger clusters of neighbors appear to improve rank.

While it's early days, I'm looking forward to exploring this method further, and hopefully arriving at some similar results to other methods mentioned earlier.