Satin's blog

Now we have a feedforward and multiple activation functions in between, we need to correct the model every time it makes a mistake.
We can use a cost function to measure how wrong a model is. When a cost function is high, the model is bad. When a cost function is low, the model is good, relatively.

Definition
$$
C(W+b) = \sum{(predicted - actual)^2}
$$
To find the minimal $C(W + b)$ (local minimum only), we can adjust the weights W and biases b.

Prerequisite Knowledge

Because there are so many weight and bias variables in C(W + b), we need partial derivative $\partial$ (called partial d). So that when we are the rate of change of C(W + b) given the change of w1, we we view all other variables constant. This is written as:
$$\frac{\partial C} {\partial w_1}$$

Methodology

[!Intuition] We can get the slope/derivative/gradient of the cost function, and go wherever the slope is steeper.
We are given the current weight matrices, the bias vectors, and the predicted value.

Example:
Say we have a network with input $a_1$ and $a_2$, and weights $w_1$, $w_2$, and a bias $b$.
In one training, we have a data of (1, 2) -> 10
This means ```

1
2
3

a1 = 1
a2 = 2
y = 10

We are trying to figure out how the cost function changes when each weight change,
Basically finding: $$\frac{\partial C} {\partial w_1}, \frac{\partial C} {\partial w_2}, \frac{\partial C} {\partial b}$$
Right now, cost function is $$C(w_1, w_2, b) = (\hat{y} - y)^2$$$$= (1 \cdot w_1 + 2 \cdot w_2 + b - 10)^2$$let $$ \ u = 1 \cdot w_1 + 2 \cdot w_2 + b - 10,$$
then $$C(w_1, w_2, b) = u^2$$
$$\frac{\partial C} {\partial u} = 2u$$
From this, we can get $$\frac{\partial C} {\partial w_1} = \frac{\partial C} {\partial u} \cdot \frac {\partial u } {w_1} = 2u \cdot 1 = 2u \cdot a_1 $$,
$$\frac{\partial C} {\partial w_2} = \frac{\partial C} {\partial u} \cdot \frac {\partial u } {w_2} = 2u \cdot 2 = 2u \cdot a_2 $$, and
$$\frac{\partial C} {\partial b} = \frac{\partial C} {\partial u} \cdot \frac {\partial u } {b} = 2u \cdot 1 = 2u$$

[!Intuition]
We can simplify a large network into a network with 4 nodes connected by only one weight in between each. It goes like $$a_0 \ \rightarrow w_1 \rightarrow a_1 \rightarrow w_2 \rightarrow a_2 \rightarrow w_3 \rightarrow a_3$$
Now, we need to find $$\frac{\partial C}{\partial w_3}, \frac{\partial C}{\partial w_2}, \text{ and }\frac{\partial C}{\partial w_1}$$
We can first construct a long, nested definitions:
$$a_3 = a_2 \cdot w_3 + b_3$$
$$a_2 = relu(a_1 \cdot w_2 + b_2)$$
$$a_1 = relu(a_0 \cdot w_1 + b_1)$$
$$error = a_3 - y$$
$$C(w_1, w_2, w_3, b_1, b_2, b_3) = (a_3 - y)^2$$
By taking derivatives, $$\frac{\partial C}{\partial w_3} = 2 * error * a_2$$
$$\frac{\partial C}{\partial w_2} = 2 * error * a_2 / a_2 * w_3* \text{relu_derivative} (z_2) * a_1$$
$$=\frac{\partial C}{\partial w_3} / a_2 * w_3* \text{relu_derivative} (z_2) * a_1$$
$$\frac{\partial C}{\partial w_1}=\frac{\partial C}{\partial w_2} / a_1 * w_2* \text{relu_derivative} (z_1) * a_0$$

Converting them into code:

def backprop(self, input_image, target):

	error = self.a3 - target

	delta3 = error
	dW3 = delta3 @ self.a2.T
	db3 = delta3

	delta2 = self.W3.T @ delta3 * relu_derivative(self.z2)
	dW2 = delta2 @ self.a1.T
	db2 = delta2

	delta1 = self.W2.T @ delta2 * relu_derivative(self.z1)
	dW1 = delta1 @ input_image.T
	db1 = delta1

	# Update weights and biases
	self.W3 -= LR * dW3
	self.b3 -= LR * db3

	self.W2 -= LR * dW2
	self.b2 -= LR * db2

	self.W1 -= LR * dW1
	self.b1 -= LR * db1

Understanding more about gradients

A gradient is a vector that shows the slopes and directions of C(W, b) in all dimensions.
-$\nabla C(W + b)$ is a vector of partial C(W, b) with respect to: [w1, w2, … wn, b1, b2, … bn].

Another intuition for getting the gradient of a weight:

When getting the rate of change of A with respect to Z (aka dA/dZ), $$\frac{dA}{dZ}

\frac{dA}{dB}

\cdot

\frac{dB}{dC}

\cdot

\ldots

\cdot

\frac{dX}{dY}

\cdot

\frac{dY}{dZ}$$

This is very similar to dimensional analysis: $$\frac{meters}{seconds}

\cdot

\frac{seconds}{minutes}$$

[!Tool Idea] Compositional Transformation System: Both dimensional analysis and chain rule in backpropagation are converting the initial value to the desired value by multiplying intermediate values. Composition means chaining simpler, foundational rules together. We can always express compositional systems in graphs.

For example, when we compute the gradient of a network, we can express the relationship between entities in a tree: a3 to z3, z3 to a2, w2, and b2, and a2 to z2, …
We can also express dimensional analysis using a linked list (a connected, acyclic graph): $$mL \rightarrow cm^3 \rightarrow m^3 \rightarrow km^3.$$

![[graph_for_gradient_calculation.png]]