/ TensorFlow

A Guide to TensorFlow: Linear regression (Part 5)

Series Introduction

This blog is a part of "A Guide To TensorFlow", where we will explore the TensorFlow API and use it to build multiple machine learning models for real-life examples. Uptil now we've explored much about TensorFlow API, in this guide we will try to use our knowledge to build simple machine learning models. This guide is about linear regression.
Check out the other parts of the series: Part 1, Part 2, Part 3 and Part 4

Linear Regression

If you don't know what linear regression exactly is I would recommend you to go through Understanding Gradient Descent, it covers linear regression in detail as well as gradient descent, the algorithm that makes linear regression work. TL;DR Linear regression is a way to find an equation that models a relationship between a dependent variable \(X\) and a explanatory variable \(Y\).
The overall idea of regression is to examine two things:

  1. Does a set of explanatory variables do a good job in predicting an dependent variable?
  2. Which explanatory variables in particular have significant influence on th value of the dependent variable

A general equation will look like
$$y(x_1, x_2, \cdots x_k) = w_1x_1 + w_2x_2 + \cdots + w_3x_3 + b$$
This expression can be represented as follows: \(Y = WX + B\)
Here, \(Y\) is the dependent value matrix, \(W\) is the weight matrix, \(X\) is the input matrix and \(B\) is the bias matrix.

So let's try to model this in tensorflow.

The Problem

A health organization has asked us to build a model that should predict the ideal blood fat content based on the patients weight and age. The company has provided us with the data of 25 healthy individuals along. (Link to Problem)

Data

The following is a sample from the dataset.

Weight Age Blood Fat Content
84 46 354
73 20 190
65 52 405
70 30 263
76 57 451

The first parameter is the weight of the person, the second parameter is the corresponding age of the person. These constitute to our input variables, the third variable is the blood fat content, which we have to predict.

The Code

The following template gives the gist of the overall code skeleton of our graph.

import tensorflow as tf

...  # Declare Variables Here

def inference(X):
    ... # Return the predicted value for given input `X`

def loss(X,Y):
    ... # Return the loss value for a given prediction

def inputs():
    ... # Define Inputs

def train(total_loss):
    ... # Set the learing rate and use an optimizer to minimize the `total_loss`

def evaluates(sess, X, Y):
    ... # Evaluate the model for sample values of weight and age

with tf.Session() as sess:
    ... # Initialize variables and run the session

Lets start by importing tensorflow and declaring the variables we are going to use. The first variable is W for storing the weights, this is a matrix of shape [2,1]. The matrix is initialized with all values equal to zero. The next variable is bias. We use tf.Variable to create each of them.

import tensorflow as tf

W = tf.Variable(tf.zeros([2, 1]), name="weights")
b = tf.Variable(0., name="bias")

Defining The Input

For this guide we will hard-code the data, however it is fairly easy to use a csv file. Which will be the approach we will take in future guides.
Weight and age is defined as a 2-D Matrix and the corresponding output value as a list or a 1-D Matrix. These can be returned after converting the value to float data type.

def inputs():
    weight_age = [[84, 46], [73, 20], [65, 52], [70, 30], [76, 57], [69, 25], [63, 28], [72, 36], [79, 57], [75, 44], [27, 24], [89, 31], [65, 52], [57, 23], [59, 60], [69, 48], [60, 34], [79, 51], [75, 50], [82, 34], [59, 46], [67, 23], [85, 37], [55, 40], [63, 30]]
    blood_fat_content = [354, 190, 405, 263, 451, 302, 288, 385, 402, 365, 209, 290, 346, 254, 395, 434, 220, 374, 308, 220, 311, 181, 274, 303, 244]
    return tf.to_float(weight_age), tf.to_float(blood_fat_content)

By calling this function, we can get access to the data whenever required. We can store these value using the following statement: X, Y = inputs()

Inference Method

This method returns the prediction using general equation we just devised above.

def inference(X):
    return tf.matmul(X, W) + b

Computing the Loss

Now we have to define how to compute the loss. For this simple model we will use a squared error, which sums the squared difference of all the predicted values for each training example with their corresponding expected values. Algebraically it is the squared euclidean distance between the predicted output vector and the expected one. Graphically in a 2-D dataset is the length of the vertical line that you can trace from the expected data point to the predicted regression line.

$$Loss = \sum_i ( y_i - y_{predicted} )^2$$

Programatically this can be modeled as follows:

def loss(X,Y):
    Y_predicted = inference(X)
    return tf.reduce_sum(tf.squared_difference(Y, Y_predicted))

Training Function

We will first define the learning rate for the model and then we will use the gradient descent optimizer for optimizing the model parameters.

def train(total_loss):
    learning_rate = 0.0000001
    return tf.train.GradientDescentOptimizer(learning_rate).minimize(total_loss)

Let's break this down to further understand what's happening inside the train function.

learning_rate

Learning rate, in an intuitive sense is how quickly a model abandons old beliefs for new ones. . Deciding upon the learning rate to be used is a highly complex task and often is an iterative process, here since the dataset is very small, we shall prefer to go for a very small learning rate.

Note: In the article 'Understanding Gradient Descent' we talk about a parameter \(\alpha\) in an equation that looks like this:
$$w_0 ← w_0 + \alpha (y − h_w(x))$$

tf.train.GradientDescentOptimizer

This is the optimizer the implements the Gradient Descent Algorithm, you can see it's implementation here, this program in turn uses optimizer.py.
It is important to note that tf.train.GradientDescentOptimizer is designed to use a constant learning rate for all variables in all steps. TensorFlow also provides out-of-the-box adaptive optimizers including the tf.train.AdagradOptimizer and the tf.train.AdamOptimizer, and these can be used as drop-in replacements.

Evaluate Function

This function will check our model for a specified weight and age value.

def evaluates(sess, X, Y):
    print(sess.run(inference([[80. ,25.]])))
    print(sess.run(inference([[65. ,25.]])))

Now this function requires a session, that will be our next step, we will create our session and call evaluate from there.

Session

The first step is to initialize all variables

with tf.Session() as sess:
    tf.global_variables_initializer().run()

In the next step we take the data input using our input function and the define our loss function for the data in total_loss.

    X, Y = inputs()
    total_loss = loss(X, Y)

Now we define our training op for our model, we use the train function we defined earlier and pass it the total_loss parameter to it

    train_op = train(total_loss)

In TensorFlow tf.Session objects are design to run multithreaded, Here's how it works, multiple threads prepare training examples and push them in the queue, a training thread then executes a training op that dequeues mini-batches from the queue. This is great simply because it makes maximum utilization of available computational resources. However TensorFlow queues can’t run without proper threading, and threading isn’t exactly pleasant in Python. To handle this TensorFlow library comes with tf.train.Coordinator and tf.train.QueueRunner. QueueRunner creates a number of threads cooperating to enqueue tensors in the same queue. Coordinator helps multiple threads stop together and report exceptions to a program that waits for them to stop.
So we first start off by defining our coordinator

    coord = tf.train.Coordinator()

Now we start our QueueRunner using tf.train.start_queue_runners, this function starts all queue runners collected in the graph.

    threads = tf.train.start_queue_runners(sess=sess, coord=coord)

The rest of the code is pretty straightforward, we define the number of training steps and then run our training op for that many times. After this, we can call our evaluates function to check our model. Following this we stop all the threads using coord.request_stop and then we join all the threads using coord.join(threads) and close the session.

    training_steps = 1000
    for step in range(training_steps):
        sess.run([train_op])

    evaluates(sess, X, Y)

    coord.request_stop()
    coord.join(threads)
    sess.close()

Problems with Linear Regression

Linear regression works only if certain underlying assumptions. In order to actually be usable in practice, the model should conform to these assumptions of linear regression.

  1. The regression model is linear in parameters
    This means that the data more or less conforms to the following general equation
    $$Ax + By + C = 0$$
    Even if the power of \(A\) and \(B\) changes, the equation remains linear in \(x\) and \(y\).

  2. Average residual/loss is zero
    For the given line we find fit for the data, the mean of the loss value with respect to every data point is zero.

There are many more of these assumptions, you can read them here

The nature of this model makes it extremely sensitive to outliers, data with higher standard deviation don't give good enough results.

Most important of all, most real world problems are for the most part non-linear. This makes linear regression a model fit for only a few certain kind of problems. Linear regression still has many applications and real world use cases, but it is extremely important to find a way to add non-linearity in the model in order to work for the rest of the cases.
This is what we will cover in the next guide.