Blog Logo
Research Assistant @ Purdue University
Image Source:
· · ·

Basics of Machine Learning Series


· · ·


The intuition and implementation of logistic regression is implemented in Classifiction and Logistic Regression and Logistic Regression Model

Similar to the linear regression, even logistic regression is prone to overfitting if there are large number of features. If the decision boundary is overfit, the shape might be highly contorted to fit only the training data while failing to generalise for the unseen data.

So, the cost function of the logistic regression is updated to penalize high values of the parameters and is given by,

  • Where
    • \({\lambda \over 2m } \sum_{j=1}^n \theta_j^2\) is the regularization term
    • \(\lambda\) is the regularization factor
import numpy as np
mul = np.matmul

X is the design matrix
y is the target vector
theta is the parameter vector
lamda is the regularization parameter

def sigmoid(X):
    return np.power(1 + np.exp(-X), -1)

hypothesis function
def h(X, theta):
    return sigmoid(mul(X, theta))

regularized cost function
def j(theta, X, y, lamda=None):
    m = X.shape[0]
    theta[0] = 0
    if lamda:
        return (-(1/m) * (mul(y.T, np.log(h(X, theta))) + \
                          mul((1-y).T, np.log(1 - h(X, theta)))) + \
                (lamda/(2*m))*mul(theta.T, theta))[0][0] 
    return -(1/m) * (mul(y.T, np.log(h(X, theta))) + \
                     mul((1-y).T, np.log(1 - h(X, theta))))[0][0]

Regularization for Gradient Descent

Previously, the gradient descent for logistic regression without regularization was given by,

  • Where \(j \in \{0, 1, \cdots, n\} \)

But since the equation for cost function has changed in (1) to include the regularization term, there will be a change in the derivative of cost function that was plugged in the gradient descent algorithm,

Because the first term of cost fuction remains the same, so does the first term of the derivative. So taking derivative of second term gives \(\frac {\lambda} {m} \theta_j\) as seen above.

So, (2) can be updated as,

  • Where \(j \in \{1, 2, \cdots, n\} \) and h is the sigmoid function

It can be noticed that, for case j=0, there is no regularization term included which is consistent with the convention followed for regularization.

regularized cost gradient
def j_prime(theta, X, y, lamda=None):
    m = X.shape[0]
    theta[0] = 0
    if lamda:
        return (1/m) * mul(X.T, (h(X, theta) - y)) + (lamda/m) * theta 
    return (1/m) * mul(X.T, (h(X, theta) - y)) 

Simultaneous update
def update_theta(theta, X, y, lamda=None):
    return theta - alpha * j_prime(theta, X, y, lamda)

Link to Rough Working Code. Change the value of lamda in the code to get different decision boundaries for the data as shown below.

Regularization with \(\lambda = 0.01\) Regularization with \(\lambda = 0.1\) Regularization with \(\lambda = 1\) Regularization with \(\lambda = 10\) Regularization with \(\lambda = 100\)


Machine Learning: Coursera - Logistic Regression Model

· · ·