Blog Logo
Research Assistant @ Purdue University
Image Source:
· · ·

Basics of Machine Learning Series


· · ·


The posts Cost Function of Linear Regression and Gradient Descent introduced the linear regression cost function and the gradient descent algorithm individually. If the gradient descent is applied to the linear regression cost function would help reach an optimal solution without much manual intervention.

Gradient Descent

Gradient descent algorithm can be summarized as,

  • Where
    • := is the assignment operator
    • \(\alpha\) is the learning rate which basically defines how big the steps are during the descent
    • \( \frac {\partial} {\partial \theta_j} J(\theta_0, \theta_1)\) is the partial derivative term
    • j = 0, 1 represents the feature index number

Linear Regression

Linear regression hypothesis is given by,

And the corresponding cost function is given by,

Gradient Descent for Linear Regression

Gradient descent is applied to the optimazation problem of the cost function of the linear regression given in (2) in order to find parameters that minimize the cost.

In order to apply gradient descent, the derivative term i.e. \(\frac {\partial} {\partial \theta_j} J(\theta_0, \theta_1)\) needs to be calculated.

Updating the values of the derivative term in the gradient descent algorithm given in (1) where the update must be done simultaneously,

The cost function for a linear regression is a convex function i.e. a bow-shaped function and has only one global minimum and no local minima. So it does not face the issue of getting stuck in local minima. It can be understood better with the below plots.

Convex Function

The gradient descent technique that uses all the training examples in each step is called Batch Gradient Descent. This is basically the calculation of the derivative term over all the training examples as it can be seen it the equation above.

The equation of linear regression can also be solved using Normal Equations method, but it poses a disadvantage that it does not scale very well on larger data while gradient descent does.


import random
import matplotlib.pyplot as plt
import math
import numpy as np

Dummy Data for Linear Regression
data = [(1, 1), (2, 2), (3, 4), (4, 3), (5, 5.5), (6, 8), (7, 6), (8, 8.4), (9, 10), (5, 4)]    

Plot the line using theta_values
def plot_line(formula, x_range):
    x = np.array(x_range)  
    y = formula(x)
    plt.plot(x, y)

Hypothesis Function
def h(x, theta_0, theta_1):
    return theta_0 + theta_1 * x

Partial Derivative w.r.t. theta_1
def j_prime_theta_1(data, theta_0, theta_1):
    result = 0
    m = len(data)
    for x, y in data :
        result += (h(x, theta_0, theta_1) - y) * x
    return (1/m) * result

Partial Derivative w.r.t. theta_0
def j_prime_theta_0(data, theta_0, theta_1):
    result = 0
    m = len(data)
    for x, y in data :
        result += (h(x, theta_0, theta_1) - y)
    return (1/m) * result

Cost Function
def j(data, theta_0, theta_1):
    cost = 0
    m = len(data)
    for x, y in data:
        cost += math.pow(h(x, theta_0, theta_1) - y, 2)
    return (1/(2*m)) * cost

Simultaneous Update
def update_theta(data, alpha, theta_0, theta_1):
    temp_0 = theta_0 - alpha * j_prime_theta_0(data, theta_0, theta_1)
    temp_1 = theta_1 - alpha * j_prime_theta_1(data, theta_0, theta_1)
    theta_0 = temp_0
    theta_1 = temp_1
    return theta_0, theta_1
Gradient Descent For Linear Regression
def gradient_descent(data, alpha, tolerance, theta_0=None, theta_1=None):
    if not theta_0:
        theta_0 = random.random() * 100
    if not theta_1:
        theta_1 = random.random() * 100
    prev_j = 10000
    curr_j = j(data, theta_0, theta_1)
    theta_0_history = []
    theta_1_history = []
    cost_history = []
    while(abs(curr_j - prev_j) > tolerance):
            theta_0, theta_1 = update_theta(data, alpha, theta_0, theta_1)
            prev_j = curr_j
            curr_j = j(data, theta_0, theta_1)
    print("Stopped with Error at %.5f" % prev_j)
    return theta_0, theta_1

theta_0, theta_1 = gradient_descent(data, alpha=0.01, tolerance=0.00001)

Gradient Descent In Action

The plot shows the adjustment of the values of \(\theta_0\) and \(\theta_1\) for fitting the best line through the given data points.


Machine Learning: Coursera - Gradient Descent for Linear Regression

· · ·