Source: View original notebook on GitHub
Category: Machine Learning / Learn ML

Gradient descent we learned till now is Batch(complete sample) Gradient descent

Earlier in Gradient descent algorithm we were making one update(taking one step toward minima) to our thetas by going through all the samples of our data, suppose if we have millions of sample points then to make a single update we go over the complete sample, it is time consuming.
but why were we doing that !! beacause that was the right thing to do !! as we derived that using error function, so yes that should have happened.

def gradient(X,Y,theta):
    grad = np.zeros(2)
    hx = theta[0] + theta[1]*X
    grad[0] = np.sum(hx-Y)
    grad[1] = np.sum((hx-Y)*X)
    return grad

# code similar to we written in 3rd notebook for univarate gradient descent
def batch_gradientDescent(X,Y,learning_rate=0.0001):
    theta = np.zeros(2)
    itr = 0
    max_itr = 1000
    
    while itr<max_itr:        
        # calulate the gradent over complete sample
        grad = gradient(X,Y,theta)
        # updating after it
        theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
        theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)        
        itr += 1
        
    return theta

Stochastic(Approximation) Gradient Descent

now instead of going through the complete sample and then making a update, we go for an approximation,and  update our theta for each and every sample points .
what will happen??
    - now there is no guarantee that each update will decrease our loss , or we move toward the minima !! why? !! beacuse now every single sample point is trying to push the parameters(theta) in the direction more favourable to it, without being thinking how will it affect other points. 
    
    - so for now  in one single step of moving toward minima , we are doing oscillation before taking a single step.
This is called stochastic gradient Descent.
    

def gradient(X,Y,theta):
    grad = np.zeros(2)
    hx = theta[0] + theta[1]*X
    grad[0] = np.sum(hx-Y)
    grad[1] = np.sum((hx-Y)*X)
    return grad

def stochastic_gradientDescent(X,Y,learning_rate=0.0001):
    theta = np.zeros(2)
    epoch = 0 # epoch means iteration
    max_epoch = 1000
    
    while epoch<max_epoch:
        for each_sample in range(X.shape[0]):
            grad = gradient(X[each_sample],Y[each_sample],theta)
            # updating for each example after it
            theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
            theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)        
        epoch += 1
        
    return theta

Mini-Batch Gradient Descent

we can do better than stochastic by making an update to parameters in some batches(dividing our sample into some batches) instead of doing that at every single sample point, this way we would have less oscillations than stochastic ,but we will have the trajectory of this mini-batch contained inside the stochastic (because of less oscillations).

so we can say that if batch size is one , mini-batch is similar to stochastic Gradient Descent.

def gradient(X,Y,theta):
    grad = np.zeros(2)
    hx = theta[0] + theta[1]*X
    grad[0] = np.sum(hx-Y)
    grad[1] = np.sum((hx-Y)*X)
    return grad

def mini_batch_gradientDescent(X,Y,learning_rate=0.0001,batch_size = 4):
    theta = np.zeros(2)
    epoch = 0
    max_epoch = 1000
    
    while epoch<max_epoch:
        grad = np.zeros(2)
        for each_sample in range(X.shape[0]):
            # calculating graient over the current batch
            grad += gradient(X[each_sample],Y[each_sample],theta)
            
            # updating for each batch only 
            if (each_sample % batch_size) == 0 :
                theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
                theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)  
                grad = (0,0) # resetting gradient
            
        epoch += 1
        
    return theta

Gradient descent we learned till now is Batch(complete sample) Gradient descent

Stochastic(Approximation) Gradient Descent

Mini-Batch Gradient Descent

Thank you!