Larry's Blog

A blog documenting the journey of a computational biophysicist at the intersection of molecular simulation, artificial intelligence, chemistry, gene editing, and beyond.

ORCID ORCID | GitHub GitHub | LinkedIn LinkedIn | CV

Machine Learning Projects


coffee and code


Current Machine Learning Projects:

52Predict

Description: A card counting predictor and machine learning project

SentimentAnalysis

Description: Working off of the Sentiment Analysis section from Hands On book

Updates:

April 04, 2022:

52Predict:

52predict is one of my more ambitious projects. I coded a little bit of it after watching the movie 21. At this point I have come up with a simple algorithm to give me a randomized set of the one-thousand cards distributed by numpys choice function.

def CardRandomizer():
    cards = ['ace','king','queen','jack','10','9','8','7','6','5','4','3','2']
    is_this_your_card = np.random.choice(cards,size=1000)
    with open('cards.csv','w') as fi_gen_counts:
        writer = csv.writer(fi_gen_counts)
        writer.writerow(is_this_your_card)

then using a simple high-low card counting algorithm I found online I made a loop that assigns a certain count as each card is ‘flipped’ (or looped over) by python. Here are the first few lines of the loop:

   lIst_fi_hilo = []
    for list in csv_read:
        for card in list:
            if card == 'ace':
                count = count - 1
                print('card:   ',card,' | count: ' ,count)
                lIst_fi_hilo.append(count)
            elif card == 'queen':
                count = count - 1
                print('card: ',card,' | count: ' ,count)
                lIst_fi_hilo.append(count)

At this point, I still have many details of the game to work out in terms of being able to win games and such. In the meantime while I think of how to construct the neural network for this repo I am working on the following (below)

SentimentAnalysis:

I want to start building a neural network so I will begin with the sentiment analysis section from Aurélien Géron's Hands-on Machine Learning with Scikit-Learn, Keras & Tensorflow. So, working from chapter 16 and the Hands On book public repository I want to recreate the code in a repository and then build off of it to create a sentiment analyzer for earnings sentiment and use the stock market as a supervision tool.

Just blindly copying and running the code I got it to work with just one depreciation warning.


    
WARNING:absl:Dataset is using deprecated text encoder API which will be removed soon.
Please use the plain_text version of the dataset and migrate to `tensorflow_text`.
    

I found a fix from this post, but at the moment its not working, will come back to this later.


Now, Let's analyze what is happening. I cleaned up the original notebook to only the essentials so I could analyze what was written. The first thing to notice, according to Géron, is that the IMDB data has already been tokenized where tokenization is the process of splitting text into smaller pieces called tokens [1]. This apparently comes from the data security sector where each token takes the data that is assigned to it and replaces it with some other encoding in order to save space and computational capacity without sacrificing the meaning behind the thing being tokenized [2] [3].

Now let's take a look at Géron's code:

Here we have our boilerplate imports and dataset import (running this only on google colab gives me no error, so I think this might be an error on githubs side?)

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

Then, we have the preprocessing step which limits the paragraph sizes to 300 words for performance and replaces some html tags with whitespaces using regular expressions followed by splitting X_batch and finally returning X_batch.to_tensor(default_value=b"<pad>"), y_batch

def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

To limit the paragraph size, tf.strings.substr() is called with X_batch as the input and creates a substring with .substr from position 0 to 300 [4]. The cleaned strings are then put into a tensor called a ‘Ragged Tensor’ which according to Tensorflow is stored data (feature sets) of non-uniform length [4][6].For the last line, we have a conversion of the Ragged Tensor to tf.tensor while a default_value of <pad> is assigned to empty values and values not defined by X_batch [4].

Next, the counter function from python collections is imported and loops over X_batch and y_batch in datasets["train"].batch(32).map(preprocess). According to cloudxlab.com the function is creating batches of 32 and iteratively applying the preprocess function to each batch using map() [7].

from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

We have a second loop nested from the first loop that grabs each individual review from X_batch and then updates the counter function with a list containing a numpy array of each review in list(review.numpy()).

Following this, Géron asserts that limiting the vocabulary to the most common one-thousand words is enough to contain enough data to make accurate classifications stating earlier that “. . . . you can generally tell whether a review is positive or not in the first sentence or two” (pg. 536). Therefore, a list comprehension is composed and a most_common() function is applied to the counter object defined as vocabulary = Counter() earlier. Where list comprehension is defined by w3schools as “a shorter syntax when you want to create a new list based on the values of an existing list” [8].

vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In the following part, the truncated vocabulary is made into a tensor object by being passed through tensor.constant(); apparently, this works because truncated_vocabulary is a tensorlike object [9]. This method is found in chapter 13 page 432 of Géron’s book [5].

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

After the tensor object is made from truncated vocabulary, a lookup table is made with the keys being words and the values being words_ids, which is made in the line previous using a range list constructed from the length of truncated_vocabulary, then the lookup table is finally created by passing through vocab_init and an integer value specified by num_oov_buckets to create the lookup table and oov is out-of-vocabulary buckets for words that are not not in the vocabulary [10].

In this small function that is defined after constructing the lookup table, the table.lookup() is used and passes through X_batch and returns all y_batch. The training set then defined and undergoes batching and preprocessing followed by passing through encode_words.

def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

After all of the cleaning up of the data and preprocessing is done, the neural network is created. According to Géron, keras.layers.Embedding() creates a trainable dense vector for the words passed through it. The advantages of using a Dense vector rather than a sparse vector is the ability to contain more information since the feature set that is encoded in the dense layer has more non-zero values than the sparse vector feature set [11].

embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

The actual neural network is composed of two GRU layers [5]. GRU stands for “Gated Recurrent Unit” which was first described in K. Cho and Co Worker’s 2014 paper as a strategy against forgetting. [12][13][14][15]. The sigmoid activation function follows the following mathematical function:


$ s(h) = \frac{1}{(1 + \exp(-h))} $


This is part of the linear regression function which is:


$ y_i = x_{i}^{T}\beta + \epsilon_i $


From what I understand, the error term, $\epsilon_i$, is set to zero and the sigmoid activation function is composed inside the linear regression function in the following way:


let

$ s(h) = \frac{1}{(1 + \exp(-h))} $

and

$ h = x_{i}^{T}\beta $

then


$ y_i = s(x_{i}^{T}\beta) = \frac{1}{1 + \exp(-x_{i}^{T}\beta)} = \frac{1}{1 + e^{-x_{i}^{T}\beta}} $


Where $X^T_i$ is a tensor feature set and $\beta$ is a parameter vector [13][14][15][16][17][18].

So, at this point, I know quite a bit about the sentiment analysis neural network that was written in Géron’s book. I will need to update myself more on the specific details of the recurrent neural network (RNN) that was developed by Cho and co workers, but for now I think I have enough to begin analyzing earnings sentiment using a similar approach to that of this example RNN.


References:

[1]    Johnson. D., NLTK Tokenize: Words and Sentences Tokenizer with Example., Guru99.com.,
             (accessed Apr. 4, 2022).

[2]    Lutkevich. B., Tokenization., techtarget.com., (accessed Apr. 4, 2022).

[3]    Wikipedia: Tokenization (accessed Apr. 4, 2022).

[4]    CloudxLab: Defining The Preprocess Function (accessed Apr. 4, 2022).

[5]    A. Géron., Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,              Tools, and Techniques to Build Intelligent Systems. 2nd Edition. O’Reilly Media. Sebastopol,              CA. 2019.

[6]    Tensorflow: Ragged Tensors (accessed Apr. 4, 2022).

[7]    CloudxLab: Creating the Final Train and Test Sets (accessed Apr. 4, 2022).

[8]    W3Schools: Python - List Comprehension (accessed Apr. 4, 2022).

[9]    Tensorflow: tf.constant (accessed Apr. 4, 2022).

[10]    Tensorflow: tf.lookup.KeyValueTensorInitializer (accessed Apr. 4, 2022).

[11]    C. Horan., Understanding Vectors From a Machine Learning Perspective neptune.ai.                (accessed Apr. 4, 2022).

[12]    Tensorflow: Gated Recurrent Unit - Cho et al (accessed Apr. 4, 2022).

[13]    Cho K. et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical                Machine Translation. Cornel University. 2014.

[14]    Deepmind: Improving the Gating Mechanism of Recurrent Neural Networks (accessed Apr.                4, 2022).

[15]    Jin H. et al. Gating Mechanism in Deep Neural Networks for Resource-Efficient Continual                Learning Institute of Electrical and Electronics Engineers. 10. 18776 - 18786.

[13]    Yale: Linear Regression (accessed Apr. 4, 2022).

[14]    Wikipedia: Linear regression (accessed Apr. 4, 2022).

[15]    G. Chavez., Understanding Logistic Regression step by step towardsdatascience.com.                (accessed Apr. 4, 2022).

[16]    IBM: Logistic Regression: What is logistic regression? (accessed Apr. 4, 2022).

[17]    Paperswithcode.com Sigmoid Activation (accessed Apr. 4, 2022).

[18]    M. Saeed A Gentle Introduction To Sigmoid Function machinelearningmastery.com                (accessed Apr. 4, 2022).