NLP, Word Embeddings & Tensorboard
I asked myself after my last project:
What does a NLP pipeline look like that can determine the similarity of words from a text corpus and at the same time illustrate it clearly? And, is it really possible to find relations within a text corpus without having explicitly programmed it? I was a little sceptical, to be honest.
After I took some protocols from the German Bundestag and spent some hours with the Tensorboard, I ended up with that findings:
The following video shows the basic workflow in a nutshell. The main entry point is listed further below. We will get through each of the lines shortly in more detail.
# Main
create_checkpoint()
model = load_model(SAVEDVECTORDATA_PATH)
if model is None:
sentences = get_text_from_files(TEXTSOURCE)
model = train_new_datafiles(sentences, SAVEDVECTORDATA_PATH)
vocabs, embeddings = create_word_embedding(model)
summary_writer = save_checkpoint()
visualize_embeddings(summary_writer, embeddings.name, METADATA)
But let’s start from scratch. How does this work? On the right side you can see the basic workflow of the Python code used for this project. The details within the workflow as well as an already trained model based on the text corpus of the German Parliament can be found on my GitHub-Space. Feel free to clone the repository and to adapt new data and to play around with your own dataset.
Under the hood basically, one looks for a data source with texts, tokenizes the words, creates the word embedding, trains the documents with e.g. Word2Vec and then visualizes the result with Tensorboard.
Generally speaking, word embeddings a.k.a. converting words to vectors a.k.a word vectorization, is a natural language processing (NLP) process. The method uses language models and techniques to map individual words in vector space. It is thus an attempt to represent each word by a vector of real numbers. This finally leads to the fact that words with similar vectors also seems to have similar meanings.
Some constants first
# Constants
CHECKPOINT = 'checkpoints/'
CHECKPOINT_OUT_FILE = 'word_embeddings'
METADATA = 'metadata.tsv'
METADATA_PATH = CHECKPOINT + METADATA
TENSORNAME = 'word_embedding_sample'
MODELPATH = r'/Path/to/your/model/'
MODELNAME = 'word2vec.model'
SAVEDVECTORDATA = r'saved_vectordata.dat'
SAVEDVECTORDATA_PATH = MODELPATH + SAVEDVECTORDATA
TEXTSOURCE = r'/Path/to/your/textfiles/'
Create checkpoint
We’ll start off very relaxed. There is nothing special in here. Just a method which creates the checkpoints folder in case it is not available at start. The files here are taken by Tensorboard to visualize the result.
def create_checkpoint():
if not os.path.exists(CHECKPOINT): os.mkdir(CHECKPOINT)
Load model
Once you have trained your data it makes sense to store the model in a file in order to be able to load it in later sessions. However, in the initial call the program will not find such a file. In this case we have to read the data source and to train the content first.
model = load_model(SAVEDVECTORDATA_PATH)
if model is None:
sentences = get_text_from_files(TEXTSOURCE)
model = train_new_datafiles(sentences, SAVEDVECTORDATA_PATH)
Get text corpus
I have used some protocols that have been drawn up in the German Parliament over a period of about two years (2016-2018). They are available here.
A small side note: I do not intend to examine the parties belonging to the Parliament and their contributions in the same – at least not yet 🙂
In order to extract the relevant information we will start with the following code:
def get_text_from_files(inputdir):
text = []
for filename in os.listdir(inputdir):
print(f'Getting file: {filename}')
path = inputdir + filename
with open(path, 'rb') as f:
for i, line in enumerate(f):
if (i % 100 == 0):
print(f'read {i} lines')
# do some pre-processing and return list of words for each line
text.append(simple_preprocess(line))
print(f'Processed file: {filename}')
return text
The special handling here is the simple_preprocess-method. It is part of the Gensim library and can be utilized with:
from gensim.utils import simple_preprocess
This method lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, stored in a text-array that won’t be changed any further.
Train new data
Next step is to run the algorithm on the preprocessed text. According Gensim’s Word2Vec tutorial it is required that the input yields one sentence after another. In our case we will pass a list of tokenized sentences as the input to Word2Vec.
Here comes the main part – the Word2Vec-Algorithm:
def train_new_datafiles(documents, outfile):
print('Build vocabulary and train model...')
epoch_logger = epoch.EpochLogger()
model = Word2Vec(
documents,
size=100,
window=20,
min_count=4,
workers=10,
iter=500,
callbacks=[epoch_logger])
print('Build vocabulary and train model...done.')
print('Save model...')
model.save(outfile)
return model
Parameters:
size
In this context, one sometimes also speaks of the dimensions of a token. Values of 100-300 are often used for the similarity.
window
Can be omitted in case of uncertainty because a default value is set in the background. This parameter defines how many words are read before and after the target word. The smaller the window, the stronger the connection or relationship to individual terms or words.
iter
Number of iterations (epochs) during the model training process. Depends mostly on the use case. From my point of view 5 to 10 iterations is a good starting point.
min_count
Minimum frequency of words.
This parameter represents the minimum frequency of the words. Words that do not occur at least as often as the specified value are ignored. It is advisable to adjust this parameter to keep the memory requirements of the final model file small.
workers
Specifies how many threads can be started in the background. Faster training is possible depending on how many CPUs can be used.
Create word embedding
What do we have so far?
We processed some files, extracted the content and fed the system to train vector values for each word. In the following step we are mapping the words to the vector values and storing the result in a *.tsv-file. This file is then used to enable Tensorboard to link the points within the point cloud with the corresponding word.
def create_word_embedding(model):
EMBED_SIZE = model.vector_size
VOCAB_LEN = len(model.wv.index2word)
modelspace = model.wv.index2word[:100000]
placeholder = np.zeros((VOCAB_LEN, EMBED_SIZE))
tsv_row_template = '{}\t{}\n'
print('Creating meta file...')
with open(METADATA_PATH, 'w+', encoding='UTF-8') as file_metadata:
header_row = tsv_row_template.format('Name', 'Index')
file_metadata.write(header_row)
for value,word in enumerate(modelspace):
placeholder[value] = model[word]
data_row = tsv_row_template.format(word, str(value))
file_metadata.write(data_row)
print(f'{METADATA_PATH} has been written.')
print(f'Vocab length is: {VOCAB_LEN}.')
print(f'Embedding size is: {EMBED_SIZE}.')
# initialize tensors with start values (not trained)
sess = tf.InteractiveSession()
embedding = tf.Variable(placeholder, trainable=False, name=TENSORNAME)
tf.global_variables_initializer().run()
return modelspace, embedding
Save checkpoint
Theoretically it is possible to save each training period or training version in several different files and open the different results for later use with Tensorboard. This can be done e.g. with adding a timestamp to the CHECKPOINTS constant. In this example we do not use this option.
def save_checkpoint():
# Saver
saver = tf.train.Saver(tf.global_variables())
# start session
session = tf.Session()
summary_writer = tf.summary.FileWriter(CHECKPOINT, graph=session.graph)
session.run([tf.global_variables_initializer()]) # init variables
#... do stuff with session
# save checkpoints periodically
filename = os.path.join(CHECKPOINT, CHECKPOINT_OUT_FILE)
saver.save(session, filename, global_step=1)
print(f'Word_embeddings checkpoint saved under: {filename}')
return summary_writer
Visualize
For the final visualization it is only necessary to configure Tensorboard and to announce the paths. The rest is done by Tensorboard.
Attention: during my first attempts to display the point cloud in the Safari browser there were problems. The best display is in Chrome – also according to Tensorboard documentation.
def visualize_embeddings(summary_writer, word_embeddings_name, metadata_path = METADATA):
# Link metadata tsv file to embedding
config = projector.ProjectorConfig()
embedding = config.embeddings.add() # could add more metadata files here
embedding.tensor_name = word_embeddings_name
embedding.metadata_path = metadata_path
projector.visualize_embeddings(summary_writer, config)
print('Metadata linked to checkpoint')
print('Run: tensorboard --logdir checkpoints/')
Results
We are done:
You can start the application in your console with:
„tensor board –logdir checkpoints/“
In the above example we can see that the system „learned“ what time units are. One can search for „days“ now and will get results like „years“ and „weeks„. This is remarkable, because not a single line of code was written to search for exactly this connection. Wow! I wouldn’t have thought so, especially since it took me no more than half an hour on my MacbookPro to train the text corpus.
Just try it out for yourself. Of course you can also try out your own data sources. There are many examples. The list is big: Reviews, recipes, log files…
Thank you for reading.
Initially posted on LinkedIn
Machine Learning Natural Language Processing NLP Word Embeddings