Introduction

Natural Language Processing (NLP) is a branch of machine learning that analyze to understand and able to generate human language, aiding decision making. NLP has a wide range of applications, including chatbots, sentiment analysis, machine translation and speech recognition.

  • Sentiment analysis invovles analyzing the emotional aspect of a piece of text. It is to determine whether it is positive, negative or neutral (whether people like or dislike the product)

  • Named Entity Recognition: this involves identifying and classifying named entities in a sentence. For example, a particular word can be a person, organization, date, or location.

  • Part of speech tagging: this is to tag each word in a sentence with its part of speech. We need this to understand the meaning of the whole sentence.

  • Text classification has tasks such as classifying an email as spam or not.

  • Machine translation: assisting language translation with suggestion automatically.

  • Chatbot: converse with users on narrow topics

Text preprocessing

Some usual preprocessing such as lowercase all the words, remove punctuations.

Tokenization

In general, the prepocessing of an NLP task involves breaking down text into individual tokens. Tokens are root syllables that make up that word. The process of breaking down text into tokens is called tokenization. After tokenization, we convert tokens into numbers, also called vectorized. A word can be represented as a vector of multiple values and each value is the value in a dimension of that word in the latent space.

Stemming

Stemming is when we truncate the words to their stem. For example: going and go and gone all have the stem ‘go’. Before stemming, given a document, we would need to break the document into sentences and then into words. Then we can choose to remove stop words (words that occur a lot but doesn’t convey much meaning such as: the, and, etc) or lowercase all the word or to remove punctuations. There would be cases where those preprocessing of removing those damages the model, for example, in a question-answering context, the question mark can signify the end of a question hence removing them confuses the model.

N-gram handling

Sometimes the words co-occur in n-grams. For example ‘the United States’ can be treated as an n-gram of length three.

One hot encoding

We collect a corpus of unique words from document (mixing them altogether). Then we start to encode the interested sentence into a sparse matrix of 0 and 1, with each row to be one word in the corpus, and each column to be one word in the sentence. In each row, we use number 1 to represent the appearance of the word in the sentence, other words to be 0. For example:

The fox jumps over the cat.

The: 1 0 0 0 1 0

fox: 0 1 0 0 0 0

jump: 0 0 1 0 0 0

..

Value of 1 indicate the presence of a word (row) at a particular position (column) in the sentence.

Word vector

While one hot encoding captures word location only, word vector (word embedding) captures both meaning and location. With a reprensented vector in the latent space, NLP models can also learn linguistic features automatically. Since these embeddings capture semantic and syntactic relationships between words they can be used as input features for natural language processing (NLP) models.

The overarching concept is a word space (latent space) that assigns meaning to each location. Initially, each word has a random location within the space. Then we consider the frequency in which two words show up together in a big corpus, that would adjust those word locations gradually, hence the words can pick up and shift their meaning. For example, for each word, we consider three words before and three words after to be its context. Apart from similarity in meaning, we can also do simple algebra on this space (King - Man + Woman = Queen), this means that word representation space is a multi cluster distribution. In other words, we don’t just find similar words together, on a global scale, the space also forms clusters of words as its inherent structure. Note that the inherent structure can display bias learnt from the natural language, for example, man is to coder as woman is to homemaker.

word2vec

In the paper “Efficient Estimation of Word Representations in Vector Space”, 2013, the authors use a proposed techniques for measuring the quality of the resulting word vector, with the assumption that not only will similar words tend to be close, but words can have multiple degrees of similarity. For example, nouns can have multiple endings, and if we search for similar words in a subspace of the original vector space, we can find words with similar endings. And the similarity goes beyond simple syntactic regularities. We can perform simple algebraic operations on the vector space as well. For example: King - Man + Woman = Queen. This is surprising since the inherent structure of the vector space allows this, and words learn their position in the space after computation about them and their context are carried on.

The authors want to maximize this kind of accuracy in vector operations and optimize the computation. Apart from the usual feed forward and the recurrent neural net, they propose a new log linear neural network architectures for Word2Vec. We would look into those models respectively. A probabilistic feedforward neural net for language model has input, project, hidden and output layers. At input, N previous words are encoded using 1-of-V coding, with V being the size of the vocabulary. The input layer is then projected onto a projection layer P that has dimensionality N x D, using a shared project matrix. The architecture becomes complex between projection and hidden layer. The output layer is all the words in the vocabulary, since it has V dimensionality. Hence the complexity mostly resides in hidden and output layers. To reduce this, the authors use hierarchical softmax and vocabulary is represented as a Huffman binary tree (with frequent words assigned short binary codes) reducing the output.

A recurrent neural net theoretically can represent more complex data patterns than a shallow net. It doesn’t have projection layer, and it has recurrent matrix that connects hidden layers. This forms some sort of short term memory since it stores and update the hidden layer in each time step. Using hierarchical softmax can reduce some complexity in hidden to output layer.

For their contribution, they propose two new architectures that train model in two steps: first, learn the word vectors, then train n-gram model on top of those. They also train parallelly on CPU of different machines to investigate different models. The framework is named DistBelief, with mini batch gradient descent and adaptive learning rate of Adagrad.

The first architecture (a log linear one) is similar to the feedforward, where they remove the nonlinear hidden layer, and average the word vectors (projected them into the same position). This is called a bag of word (BOW) model since they don’t take into account the order of words. During calculation, they count 4 words in the past and 4 words in the future of the current word. Continuous BOW predicts the word based on those context words.

The Continuous Skip-Gram model predicts context words given a target word. Specifically, the current word is inputted into a log linear classifier with continuous project layer and predict the word before and after it.

Screen Shot 2023-04-07 at 16 54 47

They train on Google News corpus (6B tokens) and vocabulary size is restricted at 1 million most frequent words at 640 dimension (a vector of 640 to represent each word). To avoid curse of dimensionality (adding dimension or data diminishes the performance at some point), they increase both dimensions and data points at the same time.

The after trained space can do simple algebra. For example, X = vector(‘biggest’) - vector(‘big’) + vector(‘small’). If we search in the vector space for words that cosine similar (Euclidean distance) to X, we might find ‘smallest’. This kind of training on high dimensional word vectors can answer other semantic relationships too. For example, France to Paris is as Germany to Berlin. A word space with such inherent relationship can be very useful for other down the line machine learning tasks.

Their architectures perform well, better than the RNN and feedforward network, even in the Microsoft Sentence Completion Challenge.

GloVe

GloVe (Global Vectors for Word Representation) is a word embedding method that came out in 2014. It is designed to learn word representations that capture the meaning of words in a more global context. It is a new global log bilinear regression model that combines the advantages of the two major model families: global matrix factorization and local context window methods. It results in a vector space with meaningful substructure, scoring high on word analogy task and named entity recognition task.

How did they do that? They argue that global log bilinear regression can produce linear direction of meaning (the simple algebra part). And then they propose a weighted least squares to train on global word-word co-occurence counts, this is to make use of global statistics of word occurence.

First we count the co-occurence of word-word matrix X, \(X_{ij}\) be the number of times word j occurs in the context of word i. \(X_i = \sum_k X_{ik}\) be the number of times word appears in the context of word i. \(P_{ji} = P(j\mid i) = \frac{X_{ij}}{X_i}\) be the probability of word j appearing in the context of word i. Now meaning can be implied from co-occurance probability. Take i = ice and j = steam, in thermodynamic phase context. We can study the relationship of these words by the ratio of their co-occurence probabilities using prob words k. For k = solid, that is related to ice but not steam, the ratio \(\frac{P_{ik}}{P_{jk}}\) will be large. For k = gas that is related to steam but not ice, the ratio is small. For k = water that is not related, the ratio is close to 1. The author also proposed a weighted least squares regression to address the drawback that the model weighs all co-occurrences equally, even the ones that never happen or rarely happen.

They train the model on corpora of Wikipedia (6B tokens) and CommonCrawl (42B tokens). They achieve superior resutls due to some factors: the choice to use negative sampling (which works better than the hierarchical softmax), the number of negative samples, and the choice of the corpus.

Screen Shot 2023-04-07 at 18 54 48

Term frequency - inverse document frequency TFIDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that is commonly used to evaluate the importance of words in a document or corpus.

The basic idea behind TF-IDF is that words that occur frequently in a document, but rarely in other documents, are more important and informative for that document’s content. Conversely, words that occur frequently in all documents (such as the, and) are less important and may be considered stopwords.

TF-IDF is calculated by multiplying two values: term frequency (TF) and inverse document frequency (IDF). The term frequency is simply the number of times a word appears in a document f(w) divided by the total word of that document f(d), while the inverse document frequency is log of total number of documents in the corpus \(\mid D \mid\) divided by the number of documents in which the term appears. For the TF, sometimes they use log scaled frequency, sometimes only the raw count. For the IDF, sometimes they add 1 into the nominator, denominator and the term, to avoid divided by zero error.

\[tfidf = \frac{f(w)}{f(d)} . log \frac{\mid D \mid}{\{\mid d \in D: w \in d \}\mid}\]

After having the TF-IDF we can measure word similarity by cosine similarity (the vector version of the Euclidean distance). Or after having the rank of word importance, we can retrieve documents based on those search term. Another application is after having the most important words, we summarize the document accordingly.

Code example

In this example we will examine Word2Vec method. After preprocess the tokens, we input them into the model and then do a simple algebraic operation: husband + woman - man = wife

import nltk
from nltk import word_tokenize, sent_tokenize 
from nltk.corpus import stopwords
from nltk.stem.porter import * 
nltk.download('gutenberg') 
nltk.download('punkt') 
nltk.download('stopwords')
import string
import gensim
from gensim.models.phrases import Phraser, Phrases 
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
import pandas as pd
from bokeh.io import output_notebook, output_file 
from bokeh.plotting import show, figure 
%matplotlib inline
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
from nltk.corpus import gutenberg
# Tokenize into sentences
gberg_sent_tokens = sent_tokenize(gutenberg.raw())
gberg_sent_tokens[0]
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.'
# Example to tokenize into words
word_tokenize(gberg_sent_tokens[1])

['She',
 'was',
 'the',
 'youngest',
 'of',
 'the',
 'two',
 'daughters',
 'of',
 'a',
 'most',
 'affectionate',
 ',',
 'indulgent',
 'father',
 ';',
 'and',
 'had',
 ',',
 'in',
 'consequence',
 'of',
 'her',
 'sister',
 "'s",
 'marriage',
 ',',
 'been',
 'mistress',
 'of',
 'his',
 'house',
 'from',
 'a',
 'very',
 'early',
 'period',
 '.']
# Tokenize into sentence and then words
gberg_sents = gutenberg.sents()
gberg_sents[0:2]
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I']]
# example to lowercase, remove stop words (the, and, etc) and punctuation
stpwrds = stopwords.words('english') + list(string.punctuation)
[w.lower() for w in gberg_sents[4] if w.lower() not in stpwrds]
['youngest',
 'two',
 'daughters',
 'affectionate',
 'indulgent',
 'father',
 'consequence',
 'sister',
 'marriage',
 'mistress',
 'house',
 'early',
 'period']
# example to stem
stemmer = PorterStemmer()
[stemmer.stem(w.lower()) for w in gberg_sents[4] if w.lower() not in stpwrds]
['youngest',
 'two',
 'daughter',
 'affection',
 'indulg',
 'father',
 'consequ',
 'sister',
 'marriag',
 'mistress',
 'hous',
 'earli',
 'period']
# example to do N-grams (for example New-York)
phrases = Phrases(gberg_sents)
bigram = Phraser(phrases)
bigram.phrasegrams
{'two_daughters': 11.966987886528118,
 'her_sister': 17.796341912611076,
 "'_s": 31.066694850762417,
 'very_early': 11.01230173457644,
 'Her_mother': 13.529621959045564,
 'long_ago': 63.22435639270114,
 'more_than': 29.024006819814797,
 'had_been': 22.306349272800997,
 'an_excellent': 39.064443355850045,
 'Miss_Taylor': 453.76578390553544,
 'very_fond': 24.134631699685762,
 'passed_away': 12.350716162995981,
 'too_much': 31.376458650431253,
 'did_not': 11.72858690304441,
 'any_means': 14.097169263925728,
 'wedding_-': 17.469774011299435,
 'Her_father': 13.129762639674155,
 'after_dinner': 21.528861425991604,
 'self_-': 47.79087603091109,
 'sixteen_years': 107.04772502472798,
 'five_years': 40.129339674923365,
 'years_old': 54.73622181125361,
 'seven_years': 52.59487691468612,
 'each_other': 79.41799630087341,
 'a_mile': 12.783277635060301,
 'must_be': 10.230138529643797,
 'difference_between': 220.52858240070222,
 'could_not': 10.871141494497287,
 'having_been': 11.538186246573856,
 'miles_off': 34.78731999672721,
 'at_Hartfield': 27.28262410321685,
 'her_husband': 27.544796053941578,
 'in_spite': 13.442110585867532,
 'Emma_could': 11.335276219802779,
 'every_body': 36.973121045951494,
 'no_means': 32.57409228176136,
 'his_own': 10.402539077343869,
 'obliged_to': 10.436780686118585,
 'able_to': 11.446995392578943,
 'very_much': 16.21051090525822,
 'have_been': 17.98145273154076,
 'great_deal': 118.04185550664424,
 '"_Poor': 10.125733768993836,
 'agree_with': 13.61194200678363,
 '-_humoured': 33.94127522195319,
 'for_ever': 12.476295381735138,
 'This_is': 11.381193790408082,
 'three_times': 35.42629642564782,
 'my_dear': 24.47929874292135,
 'How_often': 12.378148857690217,
 'My_dear': 84.8082171116688,
 'so_far': 10.161780363169663,
 '"_No': 15.063925495032132,
 'We_must': 18.765920000462394,
 'last_night': 23.5929422985217,
 'doubt_whether': 22.92446435569112,
 'anywhere_else': 16.100335841295465,
 'I_am': 16.95154402454624,
 'very_glad': 18.284606842044536,
 'am_sure': 65.1455501364289,
 'very_pretty': 20.06847092419522,
 'be_able': 11.34777742133673,
 'immediately_afterwards': 41.0611372267814,
 'sensible_man': 14.541599717835169,
 'intimate_friend': 21.899079320113312,
 'connected_with': 18.3761217091579,
 'than_usual': 28.952390048051893,
 'Brunswick_Square': 10881.466275659825,
 'some_time': 12.92674187618596,
 'poor_Isabella': 41.30301208842584,
 'It_is': 11.70604053201059,
 'am_afraid': 25.627764827764825,
 'moonlight_night': 14.74558893657606,
 'Look_at': 13.630663096064625,
 '"_Well': 21.191639295191656,
 'vast_deal': 61.90490490490491,
 'an_hour': 41.75817958294139,
 'pretty_well': 17.716673032849503,
 'tolerably_well': 18.357847866419295,
 '"_Ah': 17.2797604782697,
 'Ah_!': 37.53350320557592,
 "'_Tis": 23.23968214944021,
 'Miss_Woodhouse': 294.53138332704935,
 'you_please': 13.036170437015532,
 'any_rate': 83.92156482630273,
 ',"_said': 36.033065722366544,
 'My_dearest': 26.665660572611245,
 'so_much': 20.564737038651,
 'much_less': 19.104713600467317,
 'any_body': 21.71675477576872,
 'has_been': 29.261102552816904,
 'been_used': 14.094306941975477,
 'Well_,"': 12.493728094244247,
 'tell_you': 11.61233454195183,
 'Every_body': 72.20115873502328,
 '"_Dear': 20.048952862607795,
 'every_thing': 27.27756547657033,
 'very_sorry': 20.256026727225084,
 'turned_away': 19.344906255300334,
 'divided_between': 35.82858268446422,
 'knows_how': 14.801172739783402,
 'how_much': 15.41788827060771,
 'four_years': 16.257841484533913,
 'years_ago': 163.33385119704198,
 'any_thing': 35.72856040672197,
 'need_not': 13.47902882398845,
 'his_wife': 10.871008962598552,
 'Ever_since': 99.63963480128893,
 'leave_off': 10.507399991635475,
 'you_mean': 10.574149798763324,
 'young_lady': 113.30676689703486,
 'depend_upon': 66.33781993881054,
 'quarrel_with': 10.691561721691869,
 '-_hearted': 49.03796213698087,
 'their_own': 10.1646586470654,
 'You_are': 12.600380088963897,
 'more_likely': 11.17704564648048,
 'have_done': 12.664289823059754,
 ',"_rejoined': 11.956807548045319,
 'any_longer': 16.396440186651585,
 'very_well': 13.844638642769484,
 'young_man': 25.86418892544471,
 'dine_with': 13.884180846919302,
 'much_better': 10.763540796435533,
 'I_dare': 13.676667311275946,
 'dare_say': 128.21273285427895,
 'Depend_upon': 92.29609730617119,
 'take_care': 72.94080901625021,
 'CHAPTER_II': 335.55615843733045,
 'entering_into': 16.437697132934048,
 'never_seen': 14.015410764872522,
 'refrain_from': 12.438191682463382,
 'at_once': 21.418483948514538,
 'three_years': 37.35580371637182,
 'any_other': 10.208550533393703,
 'twenty_years': 85.2916825593849,
 'an_easy': 10.427619035266346,
 'according_to': 12.093503438006989,
 'had_begun': 12.033151826531773,
 'passed_through': 31.462657712657712,
 'its_being': 16.0647072143383,
 'deal_better': 19.993210914263546,
 'fine_young': 10.40328871973337,
 'belonging_to': 10.51254678910311,
 'Frank_Churchill': 1750.7034552293792,
 'Miss_Bates': 400.4319048418427,
 'a_few': 11.554768952918474,
 'few_days': 35.91581912291018,
 'I_suppose': 12.338337969117696,
 'very_handsome': 19.759725217669143,
 'an_irresistible': 11.369243496644911,
 'good_sense': 17.373623742833203,
 'had_already': 11.99089493884663,
 'She_felt': 13.338526859809706,
 'most_fortunate': 11.471739412714017,
 'long_enough': 15.189751032711847,
 'know_how': 12.783055562146046,
 'dear_Emma': 28.3901872294143,
 'at_Randalls': 27.034148473861507,
 'few_weeks': 134.4705370732768,
 'no_longer': 44.45405534922727,
 'CHAPTER_III': 354.19816723940437,
 'Donwell_Abbey': 753.4937557112397,
 'card_-': 15.662556010130528,
 'drawing_-': 20.085009641733485,
 '-_room': 10.86355694820301,
 'thrown_away': 14.820859395595178,
 'After_these': 11.092149558498896,
 'an_invitation': 10.459704016913319,
 'old_lady': 10.886162039504109,
 'those_who': 15.975875581352883,
 'as_possible': 11.709669181717734,
 'young_ladies': 113.63645786708757,
 '-_fashioned': 34.93954802259887,
 "Goddard_'": 15.295062282208422,
 'found_herself': 11.226219866395585,
 's_sake': 28.09240978588405,
 'much_pleased': 13.276377411833092,
 'be_allowed': 10.13342226127823,
 'Miss_Smith': 165.24557352585302,
 'Harriet_Smith': 180.55133848365074,
 'several_years': 17.577623156769786,
 'pretty_girl': 40.456222524597024,
 'blue_eyes': 35.5958547926145,
 'They_were': 10.653526594023209,
 'due_time': 21.041915854217102,
 'its_own': 10.834528109355436,
 'better_than': 42.51429907056041,
 'body_else': 39.470411637491935,
 'apple_-': 28.220404172099087,
 'You_need': 14.652845388359971,
 'half_-': 14.008021557447474,
 'much_more': 10.556098666120453,
 'little_girl': 35.0572859257393,
 'at_last': 22.7569403988895,
 'CHAPTER_IV': 335.55615843733045,
 'every_respect': 12.225158144438586,
 'guided_by': 23.954886635563895,
 'different_sort': 14.49105678356635,
 '-_Mill': 12.705290190035953,
 'good_deal': 39.17741946239356,
 'very_happy': 11.360922087440127,
 'drink_tea': 32.50446757069274,
 'large_enough': 10.829225583329686,
 'had_taken': 10.962706594062695,
 'doing_something': 10.717002712046513,
 'three_miles': 16.651991868276856,
 'thing_else': 12.21037677485836,
 'very_obliging': 25.34964748319397,
 'on_purpose': 10.833519216418129,
 'very_clever': 21.695644242373216,
 '"_You': 11.717232331679766,
 'know_what': 10.688016314679755,
 'Miss_Nash': 337.6861647669101,
 'does_not': 13.23044396496933,
 '"_Oh': 20.296981145444178,
 'Oh_yes': 23.468344823224335,
 'very_entertaining': 16.05477673935618,
 'soon_as': 12.011817412794189,
 'Oh_!': 31.12744137552917,
 'have_seen': 13.438992082992083,
 'on_horseback': 54.889830696518516,
 'their_families': 35.2636280696419,
 'no_doubt': 40.19018109445808,
 'very_respectable': 10.884594399563511,
 'respectable_young': 27.705368476069587,
 'very_odd': 18.20644784875443,
 'perfectly_right': 16.999175371083012,
 'years_hence': 17.99121428987025,
 'young_woman': 30.400597455143597,
 'very_desirable': 14.59525158123289,
 'Dear_Miss': 32.27882457330758,
 'thirty_years': 72.53374931093936,
 'can_afford': 26.391976955083752,
 'good_luck': 51.57827795790286,
 'acquainted_with': 27.73123821563829,
 'your_own': 10.134839838816427,
 '"_Yes': 27.04643052838071,
 'next_day': 33.668880662020904,
 'an_opportunity': 39.4962781888654,
 'few_yards': 127.2018593936402,
 'Robert_Martin': 1963.7493893502685,
 'few_minutes': 316.3684419939749,
 'Only_think': 11.416782816581593,
 'been_able': 15.91791244543283,
 '-_morrow': 31.191707231247435,
 'should_happen': 20.434509648427174,
 'Do_you': 17.543227495018566,
 'compared_with': 15.313434757631585,
 '"_Certainly': 25.246829530691297,
 'You_must': 12.865027201533609,
 'an_old': 10.46791414565501,
 'old_man': 11.807281822173856,
 'more_valuable': 17.666198180904065,
 ',"_replied': 68.63356681944516,
 'very_bad': 15.601820655800676,
 'deal_too': 12.720185939364022,
 'no_more': 17.351013056536,
 'very_agreeable': 21.40636898580824,
 'fixed_on': 10.723013373773426,
 'same_time': 18.367692434617606,
 'pleasing_young': 23.351667715544366,
 'CHAPTER_V': 236.13211149293625,
 'very_differently': 48.164330218068535,
 '"_Perhaps': 10.879276747151517,
 'ever_since': 42.92864084409274,
 'twelve_years': 39.38985552857957,
 'very_neatly': 22.935395341937397,
 'ten_years': 36.45901703775031,
 'being_able': 14.369127392027659,
 'her_mother': 11.543789344775181,
 'have_spoken': 11.590388219544845,
 'Yes_,"': 26.304976605699704,
 '"_Thank': 24.57613576706762,
 'Thank_you': 27.919290974436674,
 '"_Why': 10.5490954241727,
 'could_possibly': 30.82056265729735,
 'How_can': 16.088895681879897,
 'much_mistaken': 10.09912469789013,
 'Very_well': 84.54272043745728,
 'oh_!': 22.811112601435184,
 'look_at': 10.282146776177967,
 'any_harm': 10.323684561965813,
 '"_Very': 19.596720843150475,
 'an_angel': 25.81964752074191,
 'an_end': 18.271533362878365,
 'many_years': 19.71931776771305,
 ',"_cried': 34.72447183030883,
 'much_obliged': 42.98951729507285,
 'John_Knightley': 175.90626358469606,
 'ill_-': 22.2768930345429,
 'cared_for': 11.00409252669039,
 'I_assure': 13.11768264383995,
 'assure_you': 32.47647355375373,
 'soon_afterwards': 80.81087688682625,
 'CHAPTER_VI': 151.79921453117328,
 'most_agreeable': 28.296957218027913,
 'no_scruple': 27.381121048437084,
 'infinitely_superior': 278.5720720720721,
 'am_glad': 17.03178537511871,
 'Exactly_so': 26.019985274008626,
 'Did_you': 15.110106642904366,
 'very_interesting': 17.838640821506864,
 'No_sooner': 65.68793372043619,
 "Don_'": 25.406093197269378,
 "'_t": 30.670409254097855,
 't_pretend': 22.21571621014818,
 'why_should': 22.803466076696164,
 'cannot_imagine': 50.341789024899015,
 'back_again': 19.21576017940612,
 'almost_every': 10.072406158544345,
 'higher_than': 46.27235316124205,
 'ten_times': 32.71691507115478,
 'dear_Isabella': 10.167866890269968,
 'must_allow': 16.270925888340134,
 'sitting_down': 17.45900165672385,
 'fore_-': 15.528688010043942,
 'must_confess': 12.590597413596534,
 'depended_on': 20.583686511194443,
 'no_sooner': 28.750177100858938,
 'after_breakfast': 12.174058544459536,
 'sooner_than': 16.389570366331984,
 'at_home': 15.749758060408496,
 'at_least': 41.37119228766489,
 'Upon_my': 19.147455857896038,
 'Will_you': 14.912362397847469,
 "'_d": 30.547355552456107,
 'She_paused': 28.954070883468326,
 'replied_Emma': 16.331281539133734,
 'can_hardly': 26.42101103314215,
 'am_persuaded': 14.509837439249205,
 'Are_you': 16.465572091753142,
 'I_beg': 10.231792462195163,
 'beg_your': 44.83857997838066,
 'your_pardon': 42.18341802803452,
 'dear_Miss': 19.298196342757286,
 'little_while': 10.255019543477895,
 '`_No': 11.44562481492449,
 'entered_into': 58.47508652207685,
 'older_than': 59.83493943264058,
 'advise_you': 10.800315623690194,
 'run_away': 41.7018298679982,
 'At_last': 78.82934999295968,
 '"_Indeed': 16.7365171722639,
 'Dear_me': 11.639793716121261,
 'have_borne': 11.407140974967062,
 'good_opinion': 14.83472462099405,
 'good_natured': 34.33179126572909,
 'thank_you': 16.887766247951937,
 'merely_because': 11.301263472594304,
 'Emma_felt': 16.552506003089487,
 'no_difficulty': 13.258227033980063,
 'protest_against': 10.157345815882403,
 'Let_us': 32.264510238685276,
 'cried_Emma': 14.204966402031332,
 '"_Has': 11.45654449291874,
 'next_morning': 76.77909286541964,
 'dear_sir': 30.139015802234486,
 'am_going': 10.081024759890738,
 'sat_down': 58.92371592771372,
 'depends_upon': 18.878747176262287,
 'has_happened': 16.88674150485437,
 'presently_added': 24.987070707070707,
 'could_afford': 16.18079539508111,
 'Certainly_,"': 17.049521874064624,
 'stood_up': 10.286479413623711,
 '"_Nonsense': 10.024476431303897,
 'are_mistaken': 10.345444812472815,
 'does_seem': 14.179296113722343,
 'few_moments': 388.7952484944742,
 'nobody_knows': 38.60362047440699,
 'very_likely': 25.18396351271557,
 'all_probability': 13.453835276434582,
 'no_harm': 27.98959040506902,
 'cannot_help': 20.515795346592878,
 'very_different': 17.756435103435404,
 'common_sense': 145.7873644507308,
 '.--_She': 30.441114197863847,
 '-_natured': 48.04187853107345,
 'an_hundred': 30.382299526934904,
 'exactly_what': 14.387714269761208,
 'every_man': 12.828982038889636,
 'be_satisfied': 12.27308721373599,
 'less_than': 36.58696933460825,
 'large_fortune': 38.26326372776489,
 'no_use': 14.694534962661235,
 'these_words': 26.137791068580544,
 'well_acquainted': 11.682266824085007,
 'twenty_thousand': 77.04216497473693,
 'thousand_pounds': 448.57108317214704,
 'Good_morning': 19.263571686664424,
 'walked_off': 10.917066798474792,
 'cast_down': 15.322932013410155,
 'its_effects': 29.72292312498498,
 'deal_more': 10.737653188633582,
 'longer_than': 18.302452061748884,
 'perfectly_satisfied': 93.75833838690116,
 'three_hundred': 49.30380882147769,
 'looking_at': 12.747194191690063,
 'next_moment': 24.667637262918568,
 'ready_wit': 70.13003213003213,
 'very_pleasant': 11.645952038911217,
 'an_idea': 14.230889818929686,
 'Give_me': 25.719795758473396,
 'arrive_at': 11.926830209056545,
 'very_superior': 10.70318449290412,
 'pre_-': 49.326420737786634,
 'have_chosen': 12.418273092369478,
 'without_exception': 55.118538324420676,
 'her_cheeks': 10.316214846771857,
 'sit_down': 36.050105760056724,
 'reason_why': 38.7228669226916,
 'could_hardly': 23.719372787618898,
 'It_seemed': 10.04555801255742,
 'an_offering': 11.109162763061533,
 'let_us': 32.25392476944686,
 'Have_you': 15.500848246912406,
 '"_Aye': 19.5070892717265,
 'Very_true': 128.88708979271206,
 'can_easily': 15.162057467882711,
 'Nobody_could': 15.93563182848897,
 'dear_mother': 13.46066364121149,
 'those_things': 16.270079692293628,
 'next_week': 51.50149900066622,
 'Why_should': 13.317115021941754,
 '.--_Poor': 33.94982433025911,
 'taken_away': 33.88283563223159,
 'stay_longer': 32.079572569768644,
 'three_days': 38.36047093343905,
 'cannot_bear': 21.20561660980335,
 'We_are': 13.053617112780595,
 'four_o': 13.636624231911329,
 "o_'": 29.052217020325397,
 "'_clock": 18.374166774488803,
 'ask_whether': 11.353068061866079,
 're_-': 17.206410583993414,
 'Of_course': 64.5306321807182,
 'ran_away': 15.685737132208159,
 'who_lived': 14.515587325296062,
 'A_few': 27.799197568033183,
 '.--_Emma': 10.09086002610704,
 'thus_began': 10.569989454451607,
 'Never_mind': 58.81966901274492,
 'good_fortune': 19.836146064643472,
 'Those_who': 18.71431093178666,
 'Jane_Fairfax': 897.7114059953713,
 'nothing_else': 34.14858386055199,
 'present_instance': 15.748153806977337,
 'These_are': 23.590397700190255,
 'once_more': 21.46176651397069,
 'still_greater': 12.724373482572732,
 'here_comes': 13.200933526553092,
 'turned_back': 28.86287494639118,
 'will_bring': 12.922132077825832,
 'each_side': 22.901925567260612,
 'waiting_for': 11.228665843561624,
 'still_remained': 13.211098151305023,
 'she_hoped': 14.771949542264764,
 'ten_minutes': 192.59908585456108,
 'most_favourable': 12.483951713835843,
 'ten_days': 17.36327679451949,
 'many_months': 10.650833562965005,
 'little_ones': 64.86319239593576,
 '-_tempered': 31.944729620661825,
 'passed_over': 23.114649934790215,
 'sir_,"': 26.986181179154077,
 'cannot_deny': 34.786741854284145,
 'talking_about': 19.191470943716723,
 'never_forget': 21.69051665992176,
 'cannot_tell': 18.28934324805818,
 'two_years': 12.946058959906635,
 'indeed_!--': 15.372686467521769,
 'most_amiable': 17.207609119071027,
 ',"_observed': 11.71111972171562,
 'our_lives': 19.655161454360538,
 'think_differently': 12.463321241434906,
 'shake_hands': 44.840574981420055,
 'How_long': 18.446894705078492,
 'South_End': 1381.451973194341,
 'perfectly_convinced': 75.6828750917843,
 'tells_me': 12.933104129023622,
 'bad_cold': 14.30739511156867,
 'far_off': 27.665534339247987,
 'am_sorry': 26.67562904385334,
 'Ah_!"': 17.412606445880755,
 'an_interval': 11.127344698843958,
 'perfectly_well': 14.848259303721488,
 'He_paused': 28.882402391182513,
 'can_tell': 12.431003638264086,
 'morrow_morning': 22.316414535277676,
 'own_feelings': 10.723971700076298,
 'sore_throat': 129.1624895572264,
 '&_c': 4365.388235294118,
 'well_satisfied': 19.87189717498996,
 'looked_at': 13.030941652621292,
 'well_pleased': 19.25167566515881,
 'set_forward': 30.104688954112678,
 'eldest_daughter': 86.18032329988851,
 'short_time': 10.749540101684603,
 'Ha_!': 53.03328712107136,
 '"_Quite': 17.433872054441558,
 ',"_continued': 41.18031481403468,
 'dining_-': 27.58385370205174,
 'such_circumstances': 10.213103979019895,
 'enter_into': 69.92082379259851,
 'gone_through': 11.359011093968112,
 'turn_away': 25.693197151088075,
 ',"_repeated': 16.0233360034719,
 'several_times': 100.66348633961886,
 'great_curiosity': 12.762317494711862,
 'upper_end': 50.59973817705776,
 'an_odd': 26.958000043591028,
 'In_short': 17.859418769192267,
 'dearest_Emma': 41.199369337360096,
 'continued_Mrs': 12.8091714520948,
 'go_home': 10.887794606718627,
 'covered_with': 10.080615337595193,
 'hardly_knew': 30.25775488600073,
 'knew_how': 10.771742964262854,
 'set_off': 12.87641808850673,
 'can_get': 10.915918025964645,
 'got_home': 13.023221532639205,
 'most_extraordinary': 46.22770238588718,
 'an_inch': 63.920413436692506,
 'at_ease': 17.60627316575014,
 'tete_-': 10.750630160799654,
 'well_known': 14.650399763103346,
 'Smith_!--': 18.89143450635386,
 'extremely_sorry': 73.4710121970537,
 'Every_thing': 22.468754541491062,
 'many_weeks': 20.758257250268528,
 'Am_I': 15.552324542536647,
 'madam_,"': 15.249261800405625,
 'extremely_well': 29.94818401937046,
 '!--_Such': 19.885720533004065,
 'poor_Harriet': 12.146024108216924,
 '-_headed': 37.26885122410546,
 'an_instant': 43.26164345230693,
 'thirty_thousand': 42.256696240854424,
 'so_easily': 10.348857779435248,
 'worth_having': 21.66509020844281,
 'poor_girl': 16.403616188855242,
 'laugh_at': 11.791298047589994,
 'knowing_what': 15.750760884791218,
 'many_days': 14.230461886034641,
 'whole_party': 21.73407276203329,
 'six_weeks': 18.238468797923794,
 'too_late': 87.81582024724356,
 '-_minded': 20.81504988580358,
 'her_companions': 11.854753785126626,
 'drew_near': 135.3933203484773,
 'three_months': 82.38812730639597,
 'other_side': 28.114218416314966,
 'an_unnatural': 19.227397089914188,
 'get_rid': 302.7068037200196,
 'watering_-': 20.552675307411103,
 'while_ago': 15.477310722473048,
 'at_Weymouth': 43.731710766540665,
 'present_occasion': 32.73631972474029,
 'No_,"': 11.567334989477068,
 'their_hearts': 18.284844184258766,
 'break_through': 11.455134820459898,
 'burst_forth': 49.66572766472688,
 'young_men': 28.058548107027942,
 '-_bred': 27.951638418079096,
 'nobody_else': 96.33066107291948,
 'something_else': 38.925897096435115,
 'walking_together': 11.844821972381299,
 'burst_out': 11.102331509877692,
 '-_sized': 27.1752040175769,
 'how_long': 10.246965742926971,
 'Miss_Fairfax': 273.2315441060061,
 'extremely_happy': 19.519300571284287,
 "don_'": 30.893027225924477,
 "ma_'": 29.826951267640514,
 's_handwriting': 11.483028817587641,
 "Ma_'": 17.287522502879252,
 'without_seeming': 24.2521568627451,
 'Colonel_Campbell': 896.7839354391274,
 'those_days': 24.942982765152095,
 'Miss_Campbell': 75.31725733771769,
 'most_charming': 10.480354525195523,
 'caught_hold': 25.41214661406969,
 'four_months': 21.513976100607053,
 'may_guess': 13.367124175942937,
 'Bless_me': 16.960842272062408,
 'running_away': 11.650622091724552,
 'My_father': 11.189057813492585,
 'five_minutes': 145.59430269856685,
 'nine_years': 22.04328958038157,
 'hundred_pounds': 62.91037943779458,
 'more_honourable': 10.648119451503819,
 'rather_than': 19.03838981947655,
 'few_months': 59.13887494322121,
 'she_wished': 11.81925902403813,
 'without_feeling': 13.66868744277099,
 'twelve_thousand': 59.18646236299162,
 'passed_between': 19.791026625704045,
 ",'_said": 30.38023579892928,
 'Miss_Hawkins': 356.68101153504875,
 'dear_Jane': 28.08747388500318,
 'three_minutes': 10.882525806031556,
 'have_suffered': 12.705290190035953,
 'hour_ago': 35.6591784486934,
 'looked_round': 11.609514648854786,
 'help_thinking': 32.39549502357255,
 'a_series': 10.46445052916564,
 'laughed_at': 12.564141746945063,
 'weeks_ago': 66.07864088043594,
 'She_wished': 10.48200653568184,
 'twenty_miles': 32.079572569768644,
 'elder_sister': 20.892905405405404,
 'alas_!': 57.08877378327094,
 'no_fault': 10.496096401900884,
 'driven_away': 15.10864307317955,
 'setting_off': 17.077411634756995,
 'little_farther': 13.37803343166175,
 'spot_where': 40.48947421434327,
 'front_door': 45.13527518483108,
 'they_parted': 10.44325386818452,
 'without_delay': 17.83246828143022,
 'six_months': 149.7273250007566,
 'months_ago': 33.90422411666347,
 'leaned_back': 12.550583460172502,
 'at_Oxford': 14.908537761320682,
 'turned_round': 11.2590300905922,
 'pass_through': 22.188217566016075,
 'clock_struck': 287.9462433862434,
 'four_hours': 47.34066169603626,
 'faster_than': 39.88995962176039,
 'musical_society': 113.41096644049148,
 'worth_while': 39.41555130656469,
 'mixed_with': 10.67000615370459,
 'extremely_glad': 72.79072504708098,
 'knew_nothing': 12.449045733530072,
 'make_amends': 67.18048992450166,
 'amends_for': 15.10365640918289,
 'oftener_than': 50.683713401766134,
 'old_woman': 19.444732663616787,
 'post_-': 12.228841807909605,
 'just_going': 13.260330720277203,
 'At_least': 28.03569269825919,
 'their_lives': 14.906122976297906,
 'six_days': 14.208028157365117,
 'may_prove': 10.69369934075435,
 'stronger_than': 58.08695243797917,
 'particular_friend': 19.818171330419286,
 'Hum_!': 26.958587619877946,
 'good_tidings': 31.69088424528839,
 'among_themselves': 14.912909361688993,
 'next_summer': 20.5140424590889,
 'breaking_up': 10.278411586632634,
 'perfectly_safe': 16.445856823742155,
 'two_ladies': 10.207136726744569,
 'same_moment': 15.561086589572348,
 'well_worth': 11.682266824085007,
 ',"_added': 22.000525888403388,
 'little_girls': 24.596997116436313,
 'be_ashamed': 15.575445327520244,
 'been_staying': 13.958784759841098,
 'shut_up': 33.02129026711253,
 'too_large': 13.75905031306614,
 'At_first': 13.622549163579302,
 'worse_than': 50.56473754871035,
 'opposite_side': 21.485013505649793,
 'short_pause': 86.62091182855941,
 'large_party': 14.159924899255099,
 'six_years': 24.750919080861966,
 'who_knows': 17.15478502080444,
 'extremely_fond': 34.254458845685164,
 'or_twice': 10.480088120657514,
 'somebody_else': 146.9730657512543,
 'five_couple': 31.782814266625554,
 '"_Don': 12.435426459085846,
 'bad_news': 32.31086729362592,
 'baked_apples': 613.5218253968253,
 'will_send': 16.248127069496444,
 'William_Larkins': 5074.297435897436,
 'low_voice': 55.284472898891764,
 'one_leg': 10.74596003475239,
 'an_immediate': 14.761679056127669,
 'Tell_me': 25.093033554681703,
 ',"_resumed': 28.980058972381027,
 'many_times': 11.523320146772159,
 'Nothing_can': 15.51466345550789,
 'few_words': 11.709874520256639,
 'no_objection': 30.845671058647493,
 'It_seems': 18.2281699492411,
 'astonished_at': 13.983180245100776,
 'four_times': 17.904877713359244,
 'other_end': 10.241169643435896,
 'few_hours': 78.0796666877091,
 'an_extraordinary': 10.356142591003287,
 'look_forward': 11.731760911835844,
 'Alas_!': 24.207711332135293,
 'immediately_followed': 12.604814218453825,
 'wait_till': 29.399581656260896,
 '-_bye': 39.93091202582728,
 'contrast_between': 166.24462365591398,
 'dared_not': 11.783553500216318,
 'three_weeks': 21.40970383064167,
 '-_sighted': 32.251890482398956,
 'Maple_Grove': 16731.716961498438,
 'My_brother': 11.063412365232326,
 'at_Maple': 11.542093750699882,
 'almost_fancy': 16.725625422582826,
 'left_behind': 29.19181841393264,
 'barouche_-': 13.975819209039548,
 '-_landau': 19.96545601291364,
 'whose_name': 31.461202630580967,
 'most_serious': 10.186904598490049,
 'We_cannot': 11.868641936045467,
 'waited_for': 10.711948477309232,
 'E_.,': 566.3278388278388,
 'person_who': 10.40945693009978,
 'greater_part': 16.18384415693171,
 'drew_back': 15.195773696172985,
 'Her_manners': 10.694300338936156,
 'third_time': 13.583823987016237,
 'very_extraordinary': 11.127072987672598,
 'better_acquainted': 13.44997825141366,
 'According_to': 10.652714079624486,
 'have_committed': 16.773728020950244,
 'hardly_less': 13.000147148472808,
 'will_shew': 12.349874144320705,
 'little_boys': 17.749724946185125,
 'easily_believe': 21.824887069452284,
 'my_lord': 36.09264842223215,
 '"_Excuse': 17.18481673937811,
 'Excuse_me': 37.690760604583126,
 'put_forth': 10.696233535526662,
 'drawing_near': 18.70289723583137,
 'great_joy': 10.008185299508712,
 'eight_o': 65.1527602191319,
 'spread_abroad': 194.3118977796397,
 'few_lines': 43.57841479226563,
 'good_news': 24.79518258080434,
 'most_likely': 19.419480443744643,
 'talk_about': 21.824887069452284,
 'tells_us': 30.44567755366135,
 'dear_madam': 68.52258121703674,
 'eleven_years': 16.99170238487746,
 'your_sister': 15.965251961999174,
 'two_hours': 15.078877429107843,
 'two_months': 19.98674940210717,
 'door_opened': 32.96045744045013,
 'Who_can': 10.97976511828241,
 'began_talking': 13.575011249766773,
 'mean_?"': 26.41782366663845,
 'In_spite': 11.90627917946151,
 'many_hours': 13.124575551782684,
 'few_steps': 22.235285657785926,
 'most_excellent': 12.940681654585935,
 'later_than': 11.966987886528118,
 'whole_story': 36.39536252354049,
 'whole_history': 19.2441498630819,
 'lined_with': 13.540300206747927,
 '-_plaister': 17.469774011299435,
 'Lord_bless': 16.701427469135805,
 'these_things': 42.22085680150196,
 'laid_down': 12.408697820671941,
 'forty_years': 161.26670263465516,
 'faint_smile': 24.271193092621665,
 'turned_towards': 11.117376349756116,
 'totally_different': 158.32821300563236,
 'Box_Hill': 8589.305555555555,
 'some_surprise': 22.397609685430467,
 'may_depend': 21.16461327857632,
 ',"_interrupted': 30.235605293907703,
 'whatever_else': 18.73973515954062,
 'mid_-': 25.824883321051338,
 'larger_than': 19.006392525662303,
 'were_assembled': 18.151746404461402,
 'insisted_on': 11.61131033964815,
 'clothed_with': 12.659106066308777,
 'twenty_minutes': 11.181261808550069,
 'quite_alone': 11.341420176217916,
 'etc_.,': 3964.2948717948716,
 'As_soon': 20.83462134191184,
 'without_knowing': 34.56996044031648,
 ',"_whispered': 26.71599186516376,
 "shan_'": 22.473779253743025,
 'looking_round': 17.251931821351857,
 'Pardon_me': 10.14751247046469,
 ',"_answered': 22.57516648996616,
 'An_old': 19.494934210941096,
 'Shall_we': 13.344625941350367,
 'old_age': 48.23121704303023,
 'an_infant': 23.772054583893908,
 'be_forgiven': 18.33341939975366,
 'lie_down': 31.292094007783852,
 'four_miles': 16.306227917523596,
 'great_hurry': 19.653968941856267,
 'without_waiting': 19.24774354186119,
 'comes_back': 15.420512101235838,
 'heightened_by': 11.875072007373555,
 'In_fact': 39.68320704393532,
 'cut_off': 155.50163439778729,
 'never_mind': 11.197398746902147,
 'trembling_voice': 11.95022270316229,
 'More_than': 24.23315047021944,
 'time_past': 13.326716908397632,
 'second_time': 20.945614179827093,
 'five_hundred': 85.8882839842231,
 'turning_away': 10.732346458879267,
 'an_arrow': 14.857534114933692,
 '--_oh': 12.81100676702113,
 'presented_themselves': 11.894714571472534,
 'at_random': 18.668082066349374,
 'far_distant': 26.63579981049186,
 'few_seconds': 96.54294969363463,
 'passing_through': 15.455340630779228,
 'will_heal': 10.577600656791981,
 'rose_early': 50.25143069404622,
 'east_wind': 148.38828510938603,
 'gone_mad': 20.604717798360767,
 'freed_from': 25.54271506220159,
 'sinned_against': 81.41747505543238,
 'locked_up': 13.137819321259759,
 'deep_sigh': 47.76258881680568,
 'ten_thousand': 127.07863651308065,
 'happier_than': 23.413671951902835,
 'contend_with': 10.67000615370459,
 'had_formerly': 11.686041677689511,
 'little_boy': 26.45202064896755,
 'fancying_herself': 29.03972577009767,
 'right_hand': 45.67341533298018,
 'surrounded_by': 30.407381352214102,
 'infinitely_more': 12.605071134482898,
 'such_cases': 14.91498581087056,
 'No_wonder': 19.25381091925171,
 'poor_fellow': 72.63322416713721,
 'Poor_fellow': 45.43103764921947,
 'days_ago': 15.442862018162295,
 'help_laughing': 20.6971218206158,
 'draw_near': 83.0347441697135,
 'at_intervals': 31.386395286990908,
 'into_temptation': 11.801423582619316,
 'stood_before': 10.39258282946439,
 'Sir_Walter': 1001.3265848443275,
 'Walter_Elliot': 158.52745152870992,
 'Kellynch_Hall': 4945.357744107744,
 'arising_from': 10.217086024880635,
 'Charles_Musgrove': 248.92084078711986,
 'first_year': 36.52590150555186,
 'Lady_Elliot': 34.95647609819121,
 'seventeen_years': 50.975107154632376,
 'an_awful': 15.611498532706445,
 'Lady_Russell': 1370.642422350554,
 'Anne_Elliot': 69.51776079136691,
 'Miss_Elliot': 81.92993320516612,
 'everybody_else': 116.64529027877325,
 'her_mistress': 10.118550146240644,
 'Mr_Elliot': 154.42474881796693,
 'Mr_Shepherd': 153.51099290780144,
 'anybody_else': 167.79799555698756,
 'reference_to': 10.087797423886824,
 'an_honest': 22.111506653401317,
 'descend_into': 19.585341264772484,
 'Mrs_Clay': 287.0487212850306,
 'Miss_Anne': 13.817194691451808,
 'their_fathers': 21.038778684865164,
 'an_example': 17.829040937920432,
 'Admiral_Croft': 1020.8859134262656,
 'Mrs_Croft': 207.37606885374169,
 'walked_along': 11.42244112667385,
 'Frederick_Wentworth': 23.25274477365017,
 'either_side': 19.36359410488185,
 'Captain_Wentworth': 976.2801057938673,
 'eldest_son': 39.75891221190009,
 'removed_from': 14.303920434832891,
 'good_humour': 57.21965210954848,
 'The_Crofts': 10.24976796605675,
 'startled_by': 14.177381886354143,
 'most_important': 33.80609933127228,
 'replied_Anne': 13.874646644430818,
 'at_Uppercross': 13.940450893702456,
 'Great_House': 1177.9619047619049,
 'left_alone': 14.135954084898055,
 'Mr_Musgrove': 32.3891325695581,
 'Miss_Musgroves': 227.52634882160712,
 'Mrs_Musgrove': 156.77276316336284,
 'flower_-': 13.524986331328595,
 'grown_up': 12.21909069739544,
 'their_faces': 22.65730692397282,
 'surprised_at': 14.056621317816642,
 'ere_long': 33.882862152092926,
 'anything_else': 74.81176994319226,
 'quite_different': 19.349519727167486,
 'their_sakes': 16.878317708546554,
 'twentieth_year': 185.54755475547557,
 'on_board': 34.25748298789808,
 'eight_years': 61.898344402053596,
 '-_bone': 13.4993708269132,
 'their_heads': 21.621494582846132,
 'Your_sister': 24.014833799316555,
 'dressing_-': 29.64567711008389,
 'up_stairs': 14.512707389763687,
 'waited_till': 12.054695723363611,
 'third_part': 80.67984559777145,
 'Phoo_!': 23.963188995447062,
 'dear_fellow': 20.631526271893243,
 'good_cheer': 58.85449931267844,
 'Mrs_Harville': 84.64015847289754,
 '"_Ay': 18.45776612748019,
 'fifteen_years': 52.059683902603275,
 'Charles_Hayter': 2649.332925336597,
 'came_near': 11.627447632578933,
 'Her_husband': 21.944148747427437,
 'two_hundred': 34.529513816937666,
 'Dr_Shirley': 1086.3943785682916,
 'went_up': 10.893037774183895,
 'within_reach': 18.479352178330245,
 '-_yard': 16.03782532184866,
 'turn_back': 10.596177405398922,
 'walking_along': 19.124729409339242,
 'leaning_against': 23.70047357039227,
 'trodden_under': 72.2464953271028,
 'under_foot': 22.229690869877786,
 'Louisa_Musgrove': 189.5280416794361,
 'provoke_me': 16.48970776450512,
 'Very_good': 10.325350756610252,
 'good_humoured': 26.157555250079305,
 'Captain_Harville': 475.4296696696697,
 'at_Lyme': 20.293412594514123,
 'earnest_desire': 32.79056203605514,
 'Captain_Benwick': 811.83861003861,
 'an_officer': 10.895525017618041,
 'place_where': 29.88379217094472,
 '-_coat': 13.074153453617642,
 'an_introduction': 10.895525017618041,
 'preceding_evening': 48.94191199746755,
 'an_agony': 15.203058164118197,
 'catching_hold': 135.5314486083717,
 'raised_up': 20.917757019882856,
 'could_scarcely': 17.38432563107888,
 'passed_along': 16.544408774745854,
 'leaning_over': 33.66105049605383,
 't_talk': 11.570685526118845,
 'Camden_Place': 11505.67441860465,
 'straight_forward': 11.997865942380443,
 'same_hour': 12.921871463147081,
 '-_glasses': 16.354682053131388,
 'poring_over': 41.31128924515698,
 'thirty_feet': 12.120929017084244,
 'Colonel_Wallis': 967.3885461023725,
 'Mrs_Wallis': 54.179333304130715,
 '-_haired': 60.68447814451383,
 'at_length': 17.89024531358482,
 'carried_away': 68.9387205762538,
 'greater_than': 48.18287227996847,
 'Miss_Carteret': 320.0983436853002,
 'contact_with': 11.02567302549474,
 'Lady_Dalrymple': 1027.2923588039866,
 'Laura_Place': 777.4104336895035,
 'be_established': 13.256300819785361,
 'Mrs_Smith': 112.00140587397476,
 'Westgate_Buildings': 8589.305555555555,
 'buried_him': 10.212164360501543,
 'at_liberty': 14.03156495183123,
 'human_nature': 43.511573911208046,
 'five_thousand': 37.911149464312665,
 'whose_names': 20.291362480518416,
 'her_ladyship': 34.12286449316844,
 '-_maker': 26.620608017218185,
 'old_gentleman': 27.054706126823717,
 'almost_entirely': 36.106112023353404,
 'lower_part': 15.927696983224877,
 'staring_at': 25.046343439018745,
 'an_oath': 44.98797426629384,
 'wiser_than': 29.373515721478107,
 'prejudice_against': 39.17833386126069,
 'both_sides': 99.67218081951572,
 'my_soul': 16.443748679233014,
 'rejoice_over': 10.117050427385383,
 'same_instant': 25.16281097419205,
 'every_one': 14.671605951506953,
 'their_seats': 12.658738281409915,
 'their_mouths': 26.497528436510585,
 'short_silence': 19.427323846323,
 '-_blooded': 17.469774011299435,
 'general_character': 10.833683694205032,
 'fifty_pounds': 35.381314720521765,
 'be_saved': 13.40991848436365,
 'threw_himself': 10.030058440961653,
 'some_moments': 21.006453804347828,
 'exclaimed_Mrs': 12.149305043956582,
 'compassion_on': 13.012675380640166,
 'an_explanation': 16.343287526427062,
 'our_hearts': 14.94442027934851,
 'minutes_afterwards': 22.217312424781305,
 'make_haste': 50.385367443376246,
 "'_n": 22.53339140030468,
 "n_'": 16.0952795716462,
 'rising_sun': 13.065928609910948,
 '-_faced': 42.60920490560838,
 'an_atonement': 89.61263272917309,
 'atonement_for': 24.316159515907604,
 '"_Look': 10.092670148523652,
 'Look_here': 26.312064784218066,
 ...}
# PREPROCESS THE CORPUS
## remove punctuation
lower_sents = []
for s in gberg_sents:
    lower_sents.append([w.lower() for w in s if w.lower() 
                        not in list(string.punctuation)])

# handle bigram
lower_bigram = Phraser(Phrases(lower_sents,
                               min_count=32, threshold=64))
lower_bigram
<gensim.models.phrases.FrozenPhrases at 0x7f7b00b3fb20>
clean_sents = []
for s in lower_sents:
    clean_sents.append(lower_bigram[s])

clean_sents[6]
['sixteen',
 'years',
 'had',
 'miss_taylor',
 'been',
 'in',
 'mr_woodhouse',
 's',
 'family',
 'less',
 'as',
 'a',
 'governess',
 'than',
 'a',
 'friend',
 'very',
 'fond',
 'of',
 'both',
 'daughters',
 'but',
 'particularly',
 'of',
 'emma']
# remove stopword
preprocessed = []
for w in clean_sents:
    preprocessed.append([stemmer.stem(w.lower()) for w in w if w.lower() not in stpwrds])
preprocessed[6]
['sixteen',
 'year',
 'miss_taylor',
 'mr_woodhous',
 'famili',
 'less',
 'gover',
 'friend',
 'fond',
 'daughter',
 'particularli',
 'emma']
# input into model
model = Word2Vec(sentences=preprocessed, 
                 sg=1, window=10, min_count=10, workers=4)
# result
model.wv.key_to_index
{'shall': 0,
 'said': 1,
 'unto': 2,
 'lord': 3,
 'one': 4,
 '."': 5,
 'god': 6,
 'man': 7,
 'thi': 8,
 '--': 9,
 'thou': 10,
 'ye': 11,
 'thee': 12,
 ',"': 13,
 'say': 14,
 'day': 15,
 'upon': 16,
 'come': 17,
 'would': 18,
 'thing': 19,
 'son': 20,
 'like': 21,
 'go': 22,
 'could': 23,
 'king': 24,
 'hand': 25,
 'know': 26,
 'came': 27,
 'see': 28,
 'time': 29,
 'hous': 30,
 'look': 31,
 'good': 32,
 'even': 33,
 'littl': 34,
 'everi': 35,
 '1': 36,
 'peopl': 37,
 '2': 38,
 'us': 39,
 'make': 40,
 'made': 41,
 'men': 42,
 'great': 43,
 'let': 44,
 'father': 45,
 '3': 46,
 'israel': 47,
 'may': 48,
 'hath': 49,
 'well': 50,
 '4': 51,
 'two': 52,
 '7': 53,
 'word': 54,
 'must': 55,
 '5': 56,
 '6': 57,
 '?"': 58,
 'much': 59,
 'went': 60,
 'land': 61,
 'children': 62,
 'way': 63,
 'yet': 64,
 '9': 65,
 '8': 66,
 'also': 67,
 '10': 68,
 'give': 69,
 '11': 70,
 'think': 71,
 'take': 72,
 'old': 73,
 'mr': 74,
 'away': 75,
 'eye': 76,
 'call': 77,
 '12': 78,
 'might': 79,
 'place': 80,
 'first': 81,
 '13': 82,
 '14': 83,
 'never': 84,
 'name': 85,
 'long': 86,
 '15': 87,
 'head': 88,
 'pass': 89,
 'heart': 90,
 '!"': 91,
 'seem': 92,
 'though': 93,
 'thought': 94,
 'put': 95,
 '16': 96,
 'earth': 97,
 'turn': 98,
 'ever': 99,
 'saw': 100,
 'therefor': 101,
 '18': 102,
 'heard': 103,
 'mani': 104,
 'cri': 105,
 'citi': 106,
 'face': 107,
 'year': 108,
 'without': 109,
 'love': 110,
 'hear': 111,
 '19': 112,
 '17': 113,
 'sea': 114,
 'behold': 115,
 'whale': 116,
 '20': 117,
 'noth': 118,
 '21': 119,
 'life': 120,
 'live': 121,
 '22': 122,
 'work': 123,
 'speak': 124,
 'among': 125,
 'hast': 126,
 'answer': 127,
 'last': 128,
 'heaven': 129,
 'still': 130,
 'voic': 131,
 'night': 132,
 'side': 133,
 'world': 134,
 'done': 135,
 'tell': 136,
 'found': 137,
 'water': 138,
 '23': 139,
 'thou_shalt': 140,
 'set': 141,
 'mother': 142,
 'anoth': 143,
 'took': 144,
 'three': 145,
 'friend': 146,
 'brought': 147,
 'sure': 148,
 'back': 149,
 'quit': 150,
 'forth': 151,
 'round': 152,
 'right': 153,
 'servant': 154,
 '24': 155,
 'walk': 156,
 'command': 157,
 'young': 158,
 'part': 159,
 'end': 160,
 'bring': 161,
 'return': 162,
 'stand': 163,
 'david': 164,
 'neither': 165,
 'ask': 166,
 'soul': 167,
 'spirit': 168,
 'mean': 169,
 'soon': 170,
 'priest': 171,
 '25': 172,
 'oh': 173,
 'fear': 174,
 'open': 175,
 'thine': 176,
 '26': 177,
 'alway': 178,
 'mine': 179,
 'left': 180,
 'till': 181,
 'offer': 182,
 'death': 183,
 'want': 184,
 'feel': 185,
 'eat': 186,
 'woman': 187,
 'seen': 188,
 'ship': 189,
 '27': 190,
 'poor': 191,
 'believ': 192,
 'toward': 193,
 'light': 194,
 'door': 195,
 'far': 196,
 'fire': 197,
 'someth': 198,
 'morn': 199,
 'whole': 200,
 'mind': 201,
 'stood': 202,
 'half': 203,
 'hope': 204,
 'thu': 205,
 'thereof': 206,
 '.--': 207,
 'gave': 208,
 'find': 209,
 'sent': 210,
 'accord': 211,
 'daughter': 212,
 '28': 213,
 'high': 214,
 'get': 215,
 'better': 216,
 'togeth': 217,
 'brother': 218,
 'die': 219,
 'inde': 220,
 'law': 221,
 'emma': 222,
 'mose': 223,
 'keep': 224,
 'talk': 225,
 'bodi': 226,
 'evil': 227,
 'given': 228,
 'leav': 229,
 'present': 230,
 'appear': 231,
 'rest': 232,
 'enter': 233,
 'white': 234,
 'don_t': 235,
 'saith': 236,
 'jerusalem': 237,
 'full': 238,
 'whose': 239,
 'judah': 240,
 'cannot': 241,
 'new': 242,
 'moment': 243,
 'wish': 244,
 'rather': 245,
 'enough': 246,
 'happi': 247,
 '29': 248,
 'follow': 249,
 'power': 250,
 'tree': 251,
 'sight': 252,
 'jesu': 253,
 'told': 254,
 'sister': 255,
 'almost': 256,
 'captain': 257,
 'room': 258,
 'began': 259,
 'hundr': 260,
 'wife': 261,
 'dear': 262,
 'gener': 263,
 'hold': 264,
 'dead': 265,
 'art': 266,
 'knew': 267,
 'receiv': 268,
 'blood': 269,
 'sir': 270,
 '!--': 271,
 'sin': 272,
 'thousand': 273,
 'sword': 274,
 '30': 275,
 'arm': 276,
 'natur': 277,
 'use': 278,
 '31': 279,
 'dark': 280,
 'boy': 281,
 'ladi': 282,
 'mouth': 283,
 'gone': 284,
 'hour': 285,
 'home': 286,
 'kind': 287,
 'famili': 288,
 'continu': 289,
 'bless': 290,
 'none': 291,
 'elinor': 292,
 'understand': 293,
 'howev': 294,
 'pleas': 295,
 'realli': 296,
 'manner': 297,
 'gold': 298,
 'caus': 299,
 'reason': 300,
 'spake': 301,
 '32': 302,
 'holi': 303,
 'person': 304,
 'cast': 305,
 'taken': 306,
 'sort': 307,
 'air': 308,
 'pray': 309,
 'known': 310,
 'fall': 311,
 'perhap': 312,
 'mariann': 313,
 'egypt': 314,
 'bear': 315,
 'four': 316,
 'deliv': 317,
 'sit': 318,
 'miss': 319,
 'carri': 320,
 'field': 321,
 'wonder': 322,
 'within': 323,
 'near': 324,
 'strong': 325,
 'feet': 326,
 'ahab': 327,
 'felt': 328,
 'whether': 329,
 'wait': 330,
 '33': 331,
 'nation': 332,
 'peac': 333,
 'best': 334,
 'stone': 335,
 'matter': 336,
 'rememb': 337,
 'sun': 338,
 'drink': 339,
 'repli': 340,
 'five': 341,
 'gate': 342,
 'brethren': 343,
 'point': 344,
 'ground': 345,
 'small': 346,
 'meet': 347,
 'lay': 348,
 'joy': 349,
 'faith': 350,
 'host': 351,
 'destroy': 352,
 'seven': 353,
 'book': 354,
 'boat': 355,
 'tri': 356,
 'lie': 357,
 'smile': 358,
 'next': 359,
 'sat': 360,
 'ann': 361,
 'help': 362,
 'run': 363,
 'child': 364,
 'judg': 365,
 'glori': 366,
 'els': 367,
 'fell': 368,
 'ear': 369,
 '34': 370,
 'turnbul': 371,
 'suppos': 372,
 'truth': 373,
 'enemi': 374,
 'sinc': 375,
 'war': 376,
 'certain': 377,
 'save': 378,
 'sound': 379,
 'garden': 380,
 'month': 381,
 'shew': 382,
 ",'": 383,
 'prophet': 384,
 'sing': 385,
 'dwell': 386,
 'syme': 387,
 'care': 388,
 'true': 389,
 'honour': 390,
 'least': 391,
 'other': 392,
 'wind': 393,
 'flesh': 394,
 'doubt': 395,
 'remain': 396,
 'judgment': 397,
 'countri': 398,
 'harriet': 399,
 'second': 400,
 'less': 401,
 'gather': 402,
 'black': 403,
 'rise': 404,
 'beauti': 405,
 'wall': 406,
 'laugh': 407,
 'begin': 408,
 'mountain': 409,
 'alon': 410,
 'afraid': 411,
 'comfort': 412,
 'form': 413,
 'princ': 414,
 'got': 415,
 'angel': 416,
 'letter': 417,
 'bow': 418,
 'tabl': 419,
 '--"': 420,
 'prais': 421,
 'fill': 422,
 'send': 423,
 'behind': 424,
 'street': 425,
 'desir': 426,
 'number': 427,
 '36': 428,
 'watch': 429,
 'play': 430,
 'wood': 431,
 'twenti': 432,
 'chang': 433,
 'hors': 434,
 '35': 435,
 'altar': 436,
 'busi': 437,
 'strang': 438,
 'concern': 439,
 'hill': 440,
 'fruit': 441,
 'touch': 442,
 'beast': 443,
 'larg': 444,
 'pleasur': 445,
 'move': 446,
 'along': 447,
 'depart': 448,
 'river': 449,
 'cours': 450,
 'midst': 451,
 'kill': 452,
 'bed': 453,
 'sleep': 454,
 'haue': 455,
 'money': 456,
 'close': 457,
 'silver': 458,
 'seek': 459,
 'master': 460,
 'delight': 461,
 'state': 462,
 'thus_saith': 463,
 'possibl': 464,
 'read': 465,
 'often': 466,
 'serv': 467,
 'kept': 468,
 'burn': 469,
 'lift': 470,
 'observ': 471,
 'readi': 472,
 'mighti': 473,
 'kingdom': 474,
 'red': 475,
 'saul': 476,
 'thank': 477,
 'hair': 478,
 'line': 479,
 'macian': 480,
 'visit': 481,
 'differ': 482,
 'reign': 483,
 'wick': 484,
 'promis': 485,
 'consid': 486,
 'stop': 487,
 '37': 488,
 'born': 489,
 'glad': 490,
 'possess': 491,
 'bread': 492,
 'need': 493,
 'spoken': 494,
 'rose': 495,
 'besid': 496,
 'fish': 497,
 'show': 498,
 'suffer': 499,
 'cut': 500,
 'expect': 501,
 'marri': 502,
 'immedi': 503,
 'sweet': 504,
 'certainli': 505,
 'women': 506,
 'window': 507,
 'equal': 508,
 'short': 509,
 'case': 510,
 'ten': 511,
 'becom': 512,
 'fine': 513,
 'cloth': 514,
 'merci': 515,
 'prepar': 516,
 'alic': 517,
 'stay': 518,
 'order': 519,
 'cloud': 520,
 'abl': 521,
 'song': 522,
 'broken': 523,
 'christ': 524,
 'mount': 525,
 'written': 526,
 'wise': 527,
 'anyth': 528,
 'shalt': 529,
 'cover': 530,
 'build': 531,
 'ought': 532,
 'question': 533,
 '38': 534,
 'congreg': 535,
 'laid': 536,
 'tribe': 537,
 'parti': 538,
 'fight': 539,
 'strength': 540,
 'wherefor': 541,
 'troubl': 542,
 'break': 543,
 'jacob': 544,
 'either': 545,
 'ran': 546,
 'girl': 547,
 'top': 548,
 'seed': 549,
 'sail': 550,
 'fast': 551,
 'wit': 552,
 'yea': 553,
 '?--': 554,
 'admir': 555,
 'mere': 556,
 'rich': 557,
 'pain': 558,
 'deep': 559,
 'tabernacl': 560,
 'ham': 561,
 '40': 562,
 'happen': 563,
 'stranger': 564,
 'husband': 565,
 'except': 566,
 'suddenli': 567,
 'lost': 568,
 'silent': 569,
 'sens': 570,
 'fair': 571,
 'seat': 572,
 'silenc': 573,
 'beyond': 574,
 'dream': 575,
 'idea': 576,
 'hard': 577,
 'perfect': 578,
 'rais': 579,
 ',)': 580,
 'interest': 581,
 'aaron': 582,
 'chapter': 583,
 'engag': 584,
 'green': 585,
 'sake': 586,
 'declar': 587,
 ';--': 588,
 'minut': 589,
 'fellow': 590,
 'piec': 591,
 'draw': 592,
 'ill': 593,
 'rejoic': 594,
 'knowledg': 595,
 'compani': 596,
 'respect': 597,
 'object': 598,
 'subject': 599,
 'grow': 600,
 ',--': 601,
 'acquaint': 602,
 'forward': 603,
 ".'": 604,
 'write': 605,
 'iniqu': 606,
 '39': 607,
 "!'": 608,
 'bad': 609,
 'fanci': 610,
 'wild': 611,
 'sacrific': 612,
 'trust': 613,
 'account': 614,
 'free': 615,
 'jew': 616,
 'step': 617,
 'age': 618,
 'past': 619,
 'grace': 620,
 'chief': 621,
 'third': 622,
 'town': 623,
 'battl': 624,
 'wine': 625,
 'self': 626,
 'low': 627,
 'oblig': 628,
 'gentleman': 629,
 'late': 630,
 'wilder': 631,
 '.)': 632,
 'righteous': 633,
 'inhabit': 634,
 'inherit': 635,
 'cross': 636,
 'star': 637,
 'afterward': 638,
 '41': 639,
 'solomon': 640,
 'iron': 641,
 'breath': 642,
 'sign': 643,
 'tongu': 644,
 'anger': 645,
 'mark': 646,
 'john': 647,
 'measur': 648,
 'vessel': 649,
 'roll': 650,
 'hate': 651,
 'alreadi': 652,
 'common': 653,
 'affect': 654,
 'longer': 655,
 'philistin': 656,
 'lest': 657,
 'coven': 658,
 'mari': 659,
 'charg': 660,
 'hardli': 661,
 'offic': 662,
 'attent': 663,
 'express': 664,
 'oil': 665,
 'multitud': 666,
 'escap': 667,
 'rock': 668,
 'secret': 669,
 'clear': 670,
 'remov': 671,
 'six': 672,
 'human': 673,
 'creatur': 674,
 'plain': 675,
 'cold': 676,
 'join': 677,
 'assur': 678,
 'babylon': 679,
 'curs': 680,
 'sometim': 681,
 'vain': 682,
 'companion': 683,
 'drop': 684,
 'spoke': 685,
 'wave': 686,
 'reach': 687,
 'pretti': 688,
 'thyself': 689,
 'wrong': 690,
 'imagin': 691,
 'act': 692,
 'levit': 693,
 'morrow': 694,
 'throne': 695,
 'susan': 696,
 'grave': 697,
 'ad': 698,
 'stori': 699,
 'surpris': 700,
 'entir': 701,
 'wisdom': 702,
 'earli': 703,
 'blue': 704,
 'foot': 705,
 'length': 706,
 'doth': 707,
 'lamb': 708,
 'templ': 709,
 'wilt': 710,
 'becam': 711,
 'sever': 712,
 'spread': 713,
 'forc': 714,
 'fli': 715,
 'cometh': 716,
 'regard': 717,
 'particular': 718,
 'worship': 719,
 'mad': 720,
 'bright': 721,
 'purpos': 722,
 'pull': 723,
 'learn': 724,
 'sky': 725,
 'tast': 726,
 'bound': 727,
 'danc': 728,
 'pharaoh': 729,
 'discipl': 730,
 'tear': 731,
 'neighbour': 732,
 'mr_knightley': 733,
 'listen': 734,
 'shape': 735,
 'hat': 736,
 'devil': 737,
 'opinion': 738,
 'church': 739,
 'edward': 740,
 'instant': 741,
 'court': 742,
 'south': 743,
 'st': 744,
 'increas': 745,
 'bird': 746,
 'caesar': 747,
 'bare': 748,
 'bone': 749,
 'allow': 750,
 'burnt_off': 751,
 '42': 752,
 'twelv': 753,
 'fact': 754,
 'satisfi': 755,
 'repeat': 756,
 'favour': 757,
 'pay': 758,
 'forget': 759,
 'figur': 760,
 'thirti': 761,
 'wast': 762,
 'lip': 763,
 'real': 764,
 '44': 765,
 'held': 766,
 'start': 767,
 'colour': 768,
 'accept': 769,
 'abraham': 770,
 'dress': 771,
 'dread': 772,
 'stubb': 773,
 'captiv': 774,
 'sorrow': 775,
 'cubit': 776,
 'east': 777,
 'hell': 778,
 'fit': 779,
 'queen': 780,
 'week': 781,
 'harpoon': 782,
 'lead': 783,
 'spoil': 784,
 'crown': 785,
 'pour': 786,
 'wing': 787,
 'louisa': 788,
 'joseph': 789,
 'righteou': 790,
 'complet': 791,
 'fortun': 792,
 'therein': 793,
 'deck': 794,
 'flower': 795,
 'ark': 796,
 'safe': 797,
 'queequeg': 798,
 'imag': 799,
 'chanc': 800,
 'excel': 801,
 'settl': 802,
 'view': 803,
 'mrs_weston': 804,
 'north': 805,
 'blow': 806,
 'circumst': 807,
 'wrath': 808,
 'everyth': 809,
 'servic': 810,
 'paul': 811,
 'attend': 812,
 'danger': 813,
 'smote': 814,
 'charact': 815,
 'fix': 816,
 'fifti': 817,
 'caught': 818,
 'stretch': 819,
 'met': 820,
 'carriag': 821,
 'fool': 822,
 'nois': 823,
 'lion': 824,
 'leg': 825,
 'perfectli': 826,
 'hearken': 827,
 'dwelt': 828,
 'direct': 829,
 'convers': 830,
 'sick': 831,
 'youth': 832,
 'desol': 833,
 'shadow': 834,
 'unclean': 835,
 'cheer': 836,
 'shame': 837,
 'thenc': 838,
 'hurri': 839,
 'valley': 840,
 'corner': 841,
 'shut': 842,
 'fat': 843,
 'across': 844,
 'built': 845,
 'lo': 846,
 'appoint': 847,
 'determin': 848,
 'shoulder': 849,
 'armi': 850,
 'greater': 851,
 'heavi': 852,
 '43': 853,
 'elder': 854,
 'separ': 855,
 'usual': 856,
 'moon': 857,
 'hang': 858,
 'road': 859,
 'big': 860,
 'glass': 861,
 'probabl': 862,
 'lean': 863,
 'arriv': 864,
 'struck': 865,
 'shout': 866,
 'border': 867,
 'luci': 868,
 'spring': 869,
 'buri': 870,
 'villag': 871,
 'blind': 872,
 'begat': 873,
 'mrs_jen': 874,
 'meat': 875,
 'drew': 876,
 'grass': 877,
 'flock': 878,
 'whatev': 879,
 'plant': 880,
 'belong': 881,
 'rain': 882,
 'led': 883,
 'goe': 884,
 'mention': 885,
 'prison': 886,
 'dri': 887,
 'around': 888,
 'nobodi': 889,
 'wors': 890,
 'ceas': 891,
 'compass': 892,
 'neck': 893,
 'exactli': 894,
 '48': 895,
 'wash': 896,
 'archer': 897,
 'attach': 898,
 'sheep': 899,
 'divid': 900,
 'prove': 901,
 'flame': 902,
 'afflict': 903,
 'willoughbi': 904,
 'presenc': 905,
 'moreov': 906,
 'tent': 907,
 'won_t': 908,
 'journey': 909,
 'coast': 910,
 'joshua': 911,
 'clean': 912,
 'astonish': 913,
 'cut_off': 914,
 'feast': 915,
 'temper': 916,
 'lower': 917,
 'slain': 918,
 'ma_am': 919,
 'sorri': 920,
 'counten': 921,
 'refus': 922,
 'bath': 923,
 'fail': 924,
 'mr_elton': 925,
 'sudden': 926,
 'perceiv': 927,
 'teach': 928,
 'fled': 929,
 'aros': 930,
 'rule': 931,
 'aris': 932,
 'woe': 933,
 'asham': 934,
 'slew': 935,
 'dinner': 936,
 'pleasant': 937,
 'finger': 938,
 'wound': 939,
 'finish': 940,
 '46': 941,
 'enjoy': 942,
 'counsel': 943,
 'fallen': 944,
 'habit': 945,
 'garment': 946,
 'speech': 947,
 "?'": 948,
 '45': 949,
 'lot': 950,
 'broad': 951,
 'shore': 952,
 'father_brown': 953,
 'wherein': 954,
 'sought': 955,
 'labour': 956,
 'advanc': 957,
 'chariot': 958,
 'branch': 959,
 'prayer': 960,
 'meant': 961,
 'wheel': 962,
 'asid': 963,
 'board': 964,
 'consider': 965,
 'situat': 966,
 'piti': 967,
 'fresh': 968,
 'minist': 969,
 'jordan': 970,
 'aye': 971,
 'jesus_christ': 972,
 'abomin': 973,
 'hide': 974,
 'smoke': 975,
 'ah': 976,
 'commit': 977,
 'ring': 978,
 'th': 979,
 'invit': 980,
 'ti': 981,
 'crowd': 982,
 'starbuck': 983,
 'notic': 984,
 'instead': 985,
 'brown': 986,
 'import': 987,
 'deal': 988,
 'peter': 989,
 'paper': 990,
 'extrem': 991,
 'drive': 992,
 'gentlemen': 993,
 'imposs': 994,
 'wide': 995,
 'coat': 996,
 'secur': 997,
 'loos': 998,
 'captain_wentworth': 999,
 ...}
# we have 7745 words in the vocabulary
len(model.wv.key_to_index)
7745
# let's see the vector representation of the word men
model.wv['men']
array([-0.18468249,  0.06325892,  0.01308708,  0.18653318, -0.19613682,
       -0.40422338,  0.2299384 ,  0.25596562,  0.10555264, -0.30124772,
        0.37354293, -0.25737157,  0.21009737,  0.13068293,  0.25980756,
        0.25370252, -0.22621223, -0.16719353, -0.13919002, -0.29158577,
        0.24395864, -0.3127991 , -0.04485767, -0.44950372, -0.4746415 ,
        0.31596923, -0.09829171, -0.2788213 ,  0.23204322, -0.07246452,
        0.22872812, -0.15918386, -0.24236207, -0.1425959 ,  0.21417455,
        0.6150764 , -0.04175409, -0.477133  , -0.03276837, -0.29636404,
        0.07169929, -0.3279534 ,  0.16641249, -0.30916795,  0.6396977 ,
       -0.0885635 , -0.05042264,  0.20307241, -0.47921538,  0.0342183 ,
       -0.0859681 ,  0.43662393, -0.11143676,  0.07161714, -0.00264144,
        0.05555701, -0.10695332, -0.25567538, -0.10583918, -0.0602743 ,
        0.37836388, -0.16090545,  0.1339406 ,  0.04201669, -0.4194944 ,
        0.32112533,  0.16019395,  0.3029806 , -0.18761338,  0.05004926,
        0.00800556, -0.0225536 ,  0.15617049,  0.15568808,  0.30200157,
        0.02181411, -0.01589741, -0.29849273, -0.07492296,  0.35400125,
       -0.23482019, -0.02553824,  0.20082664, -0.15909846,  0.10595362,
       -0.25370923, -0.10757717,  0.44330776,  0.16920084,  0.21475472,
       -0.14774562, -0.28972393,  0.4389147 , -0.10880499,  0.32498556,
       -0.0642518 , -0.11173607, -0.35820797, -0.14992775, -0.26721698],
      dtype=float32)
# most similar to men
model.wv.most_similar('men',topn=3)
[('gong', 0.5398657917976379),
 ('valiant', 0.5377486944198608),
 ('barbarian', 0.5352712273597717)]
# similarity between men and women
model.wv.similarity('men','women')
0.4954386
# simple algebra: men + woman - man = women
model.wv.most_similar(positive=['men','woman'], negative=['man'])
[('women', 0.6154292821884155),
 ('virgin', 0.4893409013748169),
 ('concubin', 0.4869789183139801),
 ('wive', 0.48526743054389954),
 ('nabal', 0.4788726568222046),
 ('harlot', 0.47875019907951355),
 ('offspr', 0.4723259210586548),
 ('sarah', 0.4715254604816437),
 ('childless', 0.46972396969795227),
 ('parent', 0.4664917588233948)]
# husband + woman - man = wife
model.wv.most_similar(positive=['husband','woman'], negative=['man'])
[('wife', 0.6940380930900574),
 ('sister', 0.6235106587409973),
 ('wive', 0.5963566899299622),
 ('younger', 0.5844662189483643),
 ('wean', 0.5826187133789062),
 ('womb', 0.5700362920761108),
 ('amnon', 0.5695263743400574),
 ('chast', 0.5670580863952637),
 ('daughter', 0.5650200247764587),
 ('begotten', 0.5636038780212402)]
# plot the vocabulary
tsne = TSNE(n_components=2, n_iter=1000)
X_2d = tsne.fit_transform(model.wv[model.wv.key_to_index])
coords_df = pd.DataFrame(X_2d, columns=['x','y'])
    
coords_df['token'] = model.wv.key_to_index.keys()
_ = coords_df.plot.scatter('x','y', figsize=(12,12),
                           marker='.', s=10, alpha=0.2)

png

21NLPRecipe_25_0