/ TensorFlow

A Guide to TensorFlow: Logistic Regression (Part 6)

Series Introduction

This blog is a part of "A Guide To TensorFlow", where we will explore the TensorFlow API and use it to build multiple machine learning models for real-life examples. Uptil now we've explored much about TensorFlow API, in this guide we will try to use our knowledge to build simple machine learning models. This guide is about linear regression.
Check out the other parts of the series: Part 1, Part 2, Part 3, Part 4 and Part 5

Logistic Regression

In the previous blog we highlighted a few points on why linear regression doesn't always work. There needs to be a way to add non-linearity to our algorithm. Logistic Regression is one of the ways to do that. It's borrowed from statistics and usually is the go-to algorithm for yes-no type questions, or to put it in more general terms, binary classification.

There is a function used commonly in machine learning called the logistic function. It is also known as the sigmoid function, because its shape is an S (and sigma is the greek letter equivalent to s).

Mathematically sigmoid is expressed as: $$\sigma(x) = \frac{1}{1+e^{-x}}$$

Sigmoid

It is essentially an 'S' shaped curve that drops off towards 0 as values approach a higher negative number and ascends towards 1 gradually as we approach higher positive values. Essentially, the logistic function is a probability distribution function that, given a specific input value, computes the probability of the output being a success, and thus the probability for the answer to the question to be “yes.”

In tensorflow you can simply use tf.sigmoid() to apply sigmoid on a particular input.

Sigmoid for Yes or No

Logistic regression models the probability of the default class (e.g. the first class). For example, if we are modeling people’s gender as male or female from the length of their hair, then the first class could be male and the logistic regression model could be written as the probability of male given a person’s height, or more formally:
$$P (gender = male|hair\text{-}lenght)$$
This value will range anywhere between 0 and 1, for any given value of hair-length a prediction can be made for gender

Given \(X\) as the explanatory variable and \(Y\) as the response variable, the linear regression model represents the relationship between \(P(X)\) and X as: $$P(X)=W\cdot X + B$$
Now this has to be transformed into binary values, in principal a linear function can output values greater than 1 and even less than 1, so in order to make an actual probability prediction we need to use to logistic function to encode the linear output to a value between 0 and 1
$$P(X) = \frac{1}{1+e^{W\cdot X + B}}$$
This can be rewritten as
$$\log_n\Bigl(\frac{P(X)}{1-P(X)}\Bigr) = W\cdot X + B$$
The term \(\frac{P(X)}{1-P(X)}\) inside the log function is called the odds ratio, it can take a value between 0 and \(\infty\) and \(\log_n(odds)\) is called the logit

The Dataset

We are going to use the Titanic survivor Kaggle contest dataset. The model will have to infer, based on the passenger age, sex and ticket class if the passenger survived or not.
The following is the data dictionary for our dataset:

Variable Definition Key
survival Survival Prediction 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The Code

The following template gives the gist of the overall code skeleton of our graph.

import tensorflow as tf

...  # Declare Variables Here

def combine_inputs(X):
    ... # Multiplies the input and weight matrix, adds bias and returns the value
    
def inference(X):
    ... # Returns the sigmoid of the comined inputs

def loss(X, Y):
    ... # Implementation of loss function

def read_csv(batch_size, file_name, record_defaults):
    ... # Function to import data from csv file
    
def inputs():
    ... # Define Inputs, convert categorical data to float and to stack it all up in a matrix with one example per row
    
def train(total_loss):
    ... # Using gradient descent optimizer to minimize the total loss

def evaluate(sess, X, Y):
    ... # Evalute the regression model

# Launch the graph in a session, setup boilerplate
with tf.Session() as sess:

Lets start by importing tensorflow and declaring the variables we are going to use. The first variable is W for storing the weights, this is a matrix of shape [5,1]. The matrix is initialized with all values equal to zero. The next variable is bias. We use tf.Variable to create each of them.

import tensorflow as tf
import os

# same params and variables initialization as log reg.
W = tf.Variable(tf.zeros([5, 1]), name="weights")
b = tf.Variable(0., name="bias")

Defining Input

def inputs():
    passenger_id, survived, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked = \
        read_csv(100, "train.csv", [[0.0], [0.0], [0], [""], [
                 ""], [0.0], [0.0], [0.0], [""], [0.0], [""], [""]])

So first we use a custom function read_csv to read our training samples. In the dataset we have, the only classes that are relevant for our problem are pclass, sex, age, survived.

Here pclass and sex is a categorical type of data, which we need to represent in some numerical form that can be used for computation. A naive way to do this is assign a numerical value to each label. So by this method let's say for pclass, we assign "1" for First Class, "2" for Second Class, "3" for Third Class. There is one major issue with this approach, it assumes a linear relationship amongst them which does not really exists.

What this form of organization presupposes is First Class > Second Class > Third Class on the categorical values. Say supposing our model internally calculates average, then accordingly we get, 1+2+3+2 = 8/4 = 2. This implies that the the sum of two 2nd Class Tickets, one 1st and 3rd Class Tickets is a 2nd Class Ticket. This is definitely a recipe for disaster. This model’s prediction would have a lot of errors. In intuition it may seem okay to do this with tickets, but suppose these labels were shoe brand preferences (say Nike, Adidas, Asics) introducing such a scheme will give the model a false pretext that a linear relationship exists between these shoe brands.

To appropriately represent these classes we use a technique called one hot encoding, it is the process of "binarization" of data. What we essentially do is convert these categorical labels to individual classes. In the case of passenger class, we can create three new classes is_first_class, is_second_class and is_third_class with each having a value either 1 or 0. For gender, with only to values it is okay to go with only one variable, that’s because you can express a linear relationship between the values. For instance if possible values are female = 1 and male = 0, then male = 1 - female, a single weight can learn to represent both possible states.

    # convert categorical data
    is_first_class = tf.to_float(tf.equal(pclass, [1]))
    is_second_class = tf.to_float(tf.equal(pclass, [2]))
    is_third_class = tf.to_float(tf.equal(pclass, [3]))

    gender = tf.to_float(tf.equal(sex, ["female"]))

Finally we stack all the features in a single matrix; We then transpose to have a matrix with one example per row and one feature per column. tf.stack is used to stack the desired variables in one tensor, the we use tf.transpose to perform a 2D transpose on the data. We save this in a variable features, we also create a variable survived in order to save the survival status of the passengers.

    features = tf.transpose(tf.stack([is_first_class, is_second_class, is_third_class, gender, age]))
    survived = tf.reshape(survived, [100, 1])

    return features, survived

Importing The Dataset

In the previous tutorial we entered the data in code itself, here we shall write a generic function to input csv data:

def read_csv(batch_size, file_name, record_defaults):
    filename_queue = tf.train.string_input_producer(
        [os.path.join(os.getcwd(), file_name)])

    reader = tf.TextLineReader(skip_header_lines=1)
    key, value = reader.read(filename_queue)

    decoded = tf.decode_csv(value, record_defaults=record_defaults)

    return tf.train.shuffle_batch(decoded,
                                  batch_size=batch_size,
                                  capacity=batch_size * 50,
                                  min_after_dequeue=batch_size)

The parameters that this function takes is batch_size, file_name, record_defaults, we shall understand what they are soon but before that let's look at a few new functions we can see above.

  1. tf.train.string_input_producer: Output strings (e.g. filenames) to a queue for an input pipeline.
  2. tf.TextLineReader: It outputs the lines of a file delimited by newlines. we use the read function to read values from the csv file.
  3. tf.decode_csv: This Convert CSV records to tensors such that each column maps to one tensor. decode_csv will convert a Tensor from type string (the text line) in a tuple of tensor columns with the specified defaults, which also sets the data type for each column.
  4. tf.train.shuffle_batch: This function actually reads the file and loads "batch_size" rows in a single tensor. It creates batches by randomly shuffling tensors and returns it.

With read_csv defined by simply calling the inputs function, we can get access to the data whenever required. We can store these value using the following statement: X, Y = inputs()

Combine Inputs Method

Before applying sigmoid function in inference we need to combine the inputs, this function multiplies the input and weight matrix, adds bias and returns the value.

def combine_inputs(X):
    return tf.matmul(X, W) + b

Inference Method

This method applies the sigmoid function on the value returned by combine_inputs function.

def inference(X):
    return tf.sigmoid(combine_inputs(X))

The Loss Function

We could have used the L2 Loss function just like in our previous tutorial, however since the output we expect from our model is a probability value between 0 and 1, we will use a much more suited loss function called cross entropy.

Consider two scenarios, suppose for a particular example, the expected answer is "yes", however our model predicts a very low probability for it, close to 0. This means that out model is almost 100% sure that the answer is a "no". Now consider a scenario where our model predicts 20% or 30% or even 50% for a "no" ouput. L2 penalizes both of these scenarios equally.
If we plot cross-entropy loss against the L2 loss function we see that cross entropy penalizes much more as the ouput is further from expected.
L2-vs Logistic

With cross entropy, as the predicted probability comes closer to 0 for the “yes” example, the penalty increases closer to infinity. This makes it impossible for the model to make that misprediction after training. That makes the cross entropy better suited as a loss function for this model.
$$Loss = \sum_i ( y_i . log(y_{predicted_i}) + (1-y_i).log(1-y_{predicted_i}))$$

In tensorflow, we can implement this as follows.

def loss(X, Y):
    return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=combine_inputs(X), labels=Y))

Training and Evaluation

We define the training unction just like the previous tutorial, as follows:

def train(total_loss):
    learning_rate = 0.01
    return tf.train.GradientDescentOptimizer(learning_rate).minimize(total_loss)

To evaluate the results we are going to run the inference against a batch of the training set and count the number of examples that were correctly predicted. We call that measuring the accuracy.

def evaluate(sess, X, Y):
    predicted = tf.cast(inference(X) > 0.5, tf.float32)
    print(sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32))))

As the model computes a probability of the answer being yes, we convert that to a positive answer if the output for an example is greater than 0.5. Then we compare equality with the actual value using tf.equal. Finally, we use tf.reduce_mean, which counts all of the correct answers (as each of them adds 1) and divides by the total number of samples in the batch, which calculates the percentage of right answers.

Launching the Session

The following piece of code create a session initalize all variables and train as well as test our model.

with tf.Session() as sess:

    tf.global_variables_initializer().run()

    X, Y = inputs()
    total_loss = loss(X, Y)
    train_op = train(total_loss)

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)

    # actual training loop
    training_steps = 1000
    for step in range(training_steps):
        sess.run([train_op])
        # for debugging and learning purposes, see how the loss gets decremented through training steps
        if step % 10 == 0:
            print("loss: ", sess.run([total_loss]))

    evaluate(sess, X, Y)
    import time
    time.sleep(5)

    coord.request_stop()
    coord.join(threads)
    sess.close()

Wrapping Up

In the next guide we will try to build our own neural network using tensorflow. Before that it is very important to understand the math behind neural networks, In my blog Understanding Neural Networks, you will find a comprehensive explanation of what neural networks are, and a very intuitive explanation on how they work.