Batch-size selection

  1. Full-batch gradient descent: compute the average gradient over the whole dataset, and update once. Very slow and conservative.
  2. Stochastic gradient descent: weights being updated over one example. Network gets updated in a very fast pace, but noisy.
  3. Mini-batch gradient descent: weights being updated over every n examples. Compromise.

1. Full-Batch Gradient Descent

I don’t think anyone is using full-batch gradient descent. It is very rarely used. This is because updating the model’s weight once for a whole training set is too slow. This will almost always result in the model’s undertraining, or model’s inability to converge to a local minimum. Its pro is that it is extremely safe and conservative, and will almost result in no error when the training data is overall accurate, because all of the noises of the training sets cancel out.

2. Stochastic Gradient Descent

The gradient descent algorithm written in the backprop function in [[(3) Backpropagation + Gradient Descent]]] is stochastic gradient descent, updating over every one example. It is very fast, but might also make mistakes in more complex tasks. This is because the chances that one sample is erroneous is pretty high, thus leading to noises in the training. We will see how mini-batch gradient descent can mitigate this problem.

3. Mini-batch gradient descent

Mini-batch is like a compromise between stochastic and full-batch, random and noisy vs. safe and conservative. Usually we choose a BATCH_SIZE = 4, 8, 16, 32, 64, 128, or …