We are given test points , together with measurements , where is some unknown function.

Our goal is to approximate at the given test points by a function computed by a feed-forward neural network. The approximation is obtained by minimizing the sum of squared errors over the test points.

Without loss of generality, we assume that:

  • there are input neurons,
  • and two hidden layers, each containing neurons.

Let denote the -th neuron in the first hidden layer.

Let be the weight connecting neuron in layer 1 with neuron in layer 2.

As activation function, we take for all neurons

Recall that the derivative of can be expressed in terms of itself.

The problem variables are the weights .

We disregard biases.

In this way, the output of always belongs to . If the measurements take different values, one first has to perform the well-known normalization steps.

Let denote the output signal of neuron . Then neuron receives as input

,

and its output is given by

.

The objective function is

(sum of squared errors)

Goal: find an unconstrained minimizer of such that for all .

The objective function is a:

high-dimensional, smooth, non-convex empirical risk functional defined over a nonlinear parameterization of a feedforward neural network

In general, we cannot expect numerical optimization methods such as steepest (gradient) descent to converge to a global minimum of the objective. Gradient descent is a local method: it follows the negative gradient toward stationary points, but in non-convex landscapes there is no general guarantee of reaching the global minimum.

Remark :

  • New test points can be added without changing the structure of the model, since the objective is defined as a sum over samples.
  • The structure of the network determines the structure of the objective function and therefore has a fundamental impact on how well different target functions can be approximated. This is a central topic in approximation theory.
  • For the application of steepest descent methods, the activation function must be differentiable, since gradient-based optimization requires the existence of , which is computed via the chain rule through .

Backpropagation (high-level view)

Choose an initial parameter vector .

Step

Assume is already computed. Then compute via:

where:

  • is the learning rate

  • is the gradient of the loss at

Backpropagation (high-level view)

Choose an initial parameter vector .

In Step

Assume is already computed. Then compute via:

where:

  • is the learning rate

  • is the gradient of the loss at

  • is the steepest descent direction of at

  • The update moves parameters in the direction of maximal local decrease of the objective

Backpropagation is the procedure used to compute efficiently using the chain rule through the network, while the update above is the actual optimization step (gradient descent).

The main point of backpropagation is that the computation of aligns extremely well with the layered structure of the neural network. This structural compatibility allows the gradient to be computed efficiently by reusing intermediate quantities computed during the forward pass. To see this, one must carry out a sequence of explicit (and somewhat tedious) derivative computations, applying the chain rule systematically through the layers of the network.

Backpropagation is efficient because the chain rule factorization matches the computational graph of the network.

  • The network is a compositional map
  • gradients propagate naturally in reverse order of this composition So instead of recomputing derivatives repeatedly, backprop:
  • stores intermediate activations
  • reuses them during gradient computation

First rule

The earlier a weight appears in the network, the more indirect and complex its influence on the objective function (E(w)). This is because its effect propagates through multiple subsequent layers via repeated nonlinear transformations.

For this reason, backpropagation proceeds from the output layer backward toward the input layer.

We therefore start by computing partial derivatives with respect to the weights in the last layer, i.e.

and then propagate these derivatives backward through the network using the chain rule.

Influence of weights

The weights , for , influence neuron through the second hidden layer.

The loss function is:

where the output neuron is given by:

Substituting this into the objective gives:

Thus from those weight the steepest descent update is wij 2= wij2+ n*sum t=1 to T e1 3 (t). y1 3(t) . (1-y)1 3(t). yj 2(t) where e1 3 (t). y1 3(t) . (1-y)1 3(t) is S1 3(t) is local grafient

note the update is easlity computable from the

  • erro of the ouput
  • the ouput itslef y1 3(t) itself
  • the output signal yj 2 t in layer 2