Skip to main content

Gradient descent we learned till now is Batch(complete sample) Gradient descent

· 4 min read
Shaurya Singhal

Source: View original notebook on GitHub

Category: Machine Learning / Learn ML

nptel explains it better

Gradient descent we learned till now is Batch(complete sample) Gradient descent


Earlier in Gradient descent algorithm we were making one update(taking one step toward minima) to our thetas by going through all the samples of our data, suppose if we have millions of sample points then to make a single update we go over the complete sample, it is time consuming.
but why were we doing that !! beacause that was the right thing to do !! as we derived that using error function, so yes that should have happened.

def gradient(X,Y,theta):
grad = np.zeros(2)
hx = theta[0] + theta[1]*X
grad[0] = np.sum(hx-Y)
grad[1] = np.sum((hx-Y)*X)
return grad

# code similar to we written in 3rd notebook for univarate gradient descent
def batch_gradientDescent(X,Y,learning_rate=0.0001):
theta = np.zeros(2)
itr = 0
max_itr = 1000

while itr<max_itr:
# calulate the gradent over complete sample
grad = gradient(X,Y,theta)
# updating after it
theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)
itr += 1

return theta

Stochastic(Approximation) Gradient Descent

now instead of going through the complete sample and then making a update, we go for an approximation,and  update our theta for each and every sample points .
what will happen??
- now there is no guarantee that each update will decrease our loss , or we move toward the minima !! why? !! beacuse now every single sample point is trying to push the parameters(theta) in the direction more favourable to it, without being thinking how will it affect other points.

- so for now in one single step of moving toward minima , we are doing oscillation before taking a single step.
This is called stochastic gradient Descent.

def gradient(X,Y,theta):
grad = np.zeros(2)
hx = theta[0] + theta[1]*X
grad[0] = np.sum(hx-Y)
grad[1] = np.sum((hx-Y)*X)
return grad

def stochastic_gradientDescent(X,Y,learning_rate=0.0001):
theta = np.zeros(2)
epoch = 0 # epoch means iteration
max_epoch = 1000

while epoch<max_epoch:
for each_sample in range(X.shape[0]):
grad = gradient(X[each_sample],Y[each_sample],theta)
# updating for each example after it
theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)
epoch += 1

return theta

Mini-Batch Gradient Descent

we can do better than stochastic by making an update to parameters in some batches(dividing our sample into some batches) instead of doing that at every single sample point, this way we would have less oscillations than stochastic ,but we will have the trajectory of this mini-batch contained inside the stochastic (because of less oscillations).


so we can say that if batch size is one , mini-batch is similar to stochastic Gradient Descent.
def gradient(X,Y,theta):
grad = np.zeros(2)
hx = theta[0] + theta[1]*X
grad[0] = np.sum(hx-Y)
grad[1] = np.sum((hx-Y)*X)
return grad

def mini_batch_gradientDescent(X,Y,learning_rate=0.0001,batch_size = 4):
theta = np.zeros(2)
epoch = 0
max_epoch = 1000

while epoch<max_epoch:
grad = np.zeros(2)
for each_sample in range(X.shape[0]):
# calculating graient over the current batch
grad += gradient(X[each_sample],Y[each_sample],theta)

# updating for each batch only
if (each_sample % batch_size) == 0 :
theta[0] = theta[0] - learning_rate * grad[0] # grad 0 is d(cost) / d(theta 0)
theta[1] = theta[1] - learning_rate * grad[1] # grad 1 is d(cost) / d(theta 1)
grad = (0,0) # resetting gradient

epoch += 1

return theta

Thank you!