Friendly Introduction to Gradient Descent with Logistic Regression

Friendly introduction to Gradient Descent with Logistic Regression

Hi Folks! in the last post, we got an abstract idea about what a neural network is and what kind of role each neuron in the network plays. We learned how a set of parameter values(weight matrix) determine what kind of task the neural network will perform. Meaning, when we say a neural network has learned to identify a cat, it means we have figured out a weight matrix which enables the neural network to make the most accurate predictions. The algorithm used by the neural network to figure out the weight matrix is Gradient Descent. In this post we will try to understand this algorithm with Logistic Regression.

It is near to impossible to explain gradient descent without using mathematical terms. So in this post we will use a bit calculus to derive certain concepts but we will try to do it in an interesting way.

What is Logistic Regression?

Logistic Regression is a learning algorithm we use in binary classification problems like cat or not cat, dog or not dog or spam or not spam. The Logisitic Regression concept we need in supervised learning problems can be expressed in simple mathemetical terms:

Given a pic of a dog, what is the probability that it is a dog? This can also be stated as

Given an input \(X\), \(\hat{y} = P(y=1|X)\)

Where \(P(y=1|X)\) is the probability that \(y=1\) given \(X\).

Here it would be helpful to understand what kind of input \(X\) are we talking about. First have a look the pic below:

input vector x of a pic

The left most pic is what we see and the three-dimensional matrix is what the computer sees as the cat and rightmost column matrix is how we feed in the pixel values of the pic as input vector \(X\).

In the last post, we also derived an output equation of a basic (unit) neural network as:

$$y=\sigma(B + w_1x_1 + w_2x_2)$$

where \(B\),\(w_1\) and \(w_2\) are the parameters(weights and biases) of the network and \(x_1\), \(x_2\) are the inputs. Now we may write the above equation as:

$$\hat{y}=\sigma(z)$$

where \(z = B + w_1x_1 + w_2x_2 = W^TX+B\), where \(W=\begin{bmatrix}w_1\\w_2\end{bmatrix}\) and \(X=\begin{bmatrix}x_1\\x_2\end{bmatrix}\) and \(W^TX\) is a dot product. As a quick reference of dot product, see the example below:

$$\begin{bmatrix}1&2\end{bmatrix}\begin{bmatrix}1&2&3\\2&3&-4\end{bmatrix}=\begin{bmatrix}5&8&-5\end{bmatrix}$$

Now, in the case of our Logistic Regression, we may say given \(x\), and parameters \(w\),\(b\) \(\in\) \(R\)(real number space), how can we get the output \(\hat{y}\)? One possible option would be to say,\(\hat{y} = w^Tx+b\), but we have already defined above that \(\hat{y}\) is a probability meaning \(0\le\hat{y}\le1\) . So it best fits to write \(\hat{y}\) as

$$\hat{y} = \sigma(w^Tx+b)$$

And this equation perfectly defines a logistic regression model.

Now considering the above, the logistic regression model works as shown in the following pic:

logistic regression neural network

You can see that logistic regression model is actually a very simple(unit) neural network. The model predicted that it is 73% sure that it is a cat.

$$\sigma(\begin{bmatrix}w_0\\w_1\\w_2\\.\\.\\w_{12287}\end{bmatrix}^T\begin{bmatrix}x_0\\x_1\\x_2\\.\\.\\x_{12287}\end{bmatrix}+\begin{bmatrix}b\\x_1\\b\\.\\.\\b\end{bmatrix})=0.73$$

Now for a labelled example(meaning we knew the pic was that of a cat) in which case the output was \(y=1\), we can caculate that the error is \(1-0.73=0.27\). However this is not how we calculate the prediction error. One way of calculating the prediction error is \(\frac{1}{2}(y-\hat{y})^2\), where \(y\) is the true output and \(\hat{y}\) is the predicted output. But again, this is also not found to be very useful, so we use \(log\) based function as defined below:

$$L(\hat{y},y) = -ylog\hat{y}-(1-y)log(1-\hat{y})$$

This is not complicated as it may seem to be. The idea is that:

if \(y=1\), then \(L(\hat{y},y) = -ylog\hat{y}\).

if \(y=0\), then \(L(\hat{y},y) = -(1-y)log(1-\hat{y})\).

This basically means that if y=1, we want \(\hat{y}\) to be as large as possible and if y=0, we want \(\hat{y}\) to be as small as possible.

Now for \(m\) labelled examples or training examples, we define something called cost function as shown below:

$$L(w,b) = -\frac{1}{m}\sum_{i=1}^m(y^{(i)}log\hat{y}^{(i)}+(1-y^{(i)})log(1-\hat{y}^{(i)})$$

where \((y^{(i)}\), \(\hat{y}^{(i)})\) are the \(i^{th}\) true value, predicted value pair. Notice that the cost function is defined in terms of parameter values \(w\) and \(b\). This means that we need to find parameters \(w\) and \(b\) that will minimise the Cost Function \(L(w,b)\). And this is the job of Gradient Descent.

Gradient Descent

The Cost Function is the main thing in Learning Algorithms because it tells us how good are our parameters in making accurate predictions. For the purpose of illustration, consider the following plot:

Gradient Descent Cost function

This is a plot of cost function \(J(w,b)\) with respect \(w\) and \(b\). Now suppose we are at a random point J(w,b). The question is can we know in which direction should we move in order to reach the minimum? The answer is gradient because the gradient points in the direction of greatest increase. For curious cat and advanced learners, you may refer this post. And consequently, the negative of the gradient gives us the direction of the steepest descent. This means that \(-\frac{\delta}{\delta{w}}J(w,b)\) and \(-\frac{\delta}{\delta{b}}J(w,b)\) will give us the direction in which we need to move to reach the minimum of \(J(w,b)\). So one step of gradient descent would look like:

$$w = w-\alpha{\frac{\delta}{\delta{w}}J(w,b)}$$ $$b = b-\alpha{\frac{\delta}{\delta{b}}J(w,b)}$$

Where \(\alpha\) is usually a small fraction(like 0.01, 0.1, 0.001) that tells us how much should we move at each step of gradient descent. This parameter is called Learning rate. Now how do we decide the best value of \(\alpha\)? Well, there is a whole art of doing that but it is not part of this post.

So this is how we learn the parameter values \(w\) and \(b\). There is a lot of subtle details I have not mentioned for the sake of simplicity and overall idea. Moreover, this is only the second post and the purpose is to inspire you with core idea so that we can confidently go into the details. And again, the equations derived here can be best appreciated when we implement them in actual programming. In the next post I will show you how to implement gradient descent with python/numpy. Till then, have a great time Exploring.






Author:


Ratul Doley
Ratul Doley
Entrepreneur and AI researcher. Currently learning and working on Unsupervised learning and Data Clustering. Professional Android app developer and designer. Updated Nov 15, 2018

Beginner's intro into Neural Networks and Machine Learning

Beginner's intro into Neural Networks and Machine Learning


Folks! Be ready to welcome the amazing AI-powered world in the next decade

Folks! Be ready to welcome the amazing AI-powered world in the next decade