Well, remember that the gradient is only correct as

“Well, remember that the gradient is only correct as the Δx gets smaller and smaller, infinitely small.

This is the cool bit! What happens to the Δx in the expression ∂s / ∂t = 3 t 2 + Δx 2 as Δx gets smaller and smaller? It disappears! If that sounds surprising, think of a very small value for Δx . If you try, you can think of an even smaller one. And an even smaller one .. and you could keep going forever, getting ever closer to zero. So let’s just get straight to zero and avoid all that hassle.

That gives is the mathematically precise answer we were looking for:

That’s a fantastic result, and this time we used a powerful mathematical tool to do calculus, and it wasn’t that hard at all.

Patterns

As fun as it is to work out derivatives using deltas like Δx and seeing what happens when we make them smaller and smaller, we can often do it without doing all that work.

See if you can see any pattern in the derivatives we’ve worked out so far:

You can see that the derivative of a function of t is the same function but with each power of t reduced by one. So t 4 becomes t 3 , and t 7 would become t 6 , and so on. That’s really easy! And if you remember that t is just t 1 , then in the derivative it becomes t 0 , which is 1.

Constant numbers on their own like 3 or 4 or 5 simply disappear. Constant variables on their own, which we might call a , b or c , also disappear, because they too have no rate of change. That’s why they’re called constants .

But hang on, t 2 became 2 t not just t . And t 3 became 3 t 2 not just t 2 . Well there is an extra step where the power is used as a multiplier before it is reduced. So the 5 in 2 t 5 is used as an additional multiplier before the power is reduced 5*2 t 4 = 10 t 4 .

The following summarises this power rule for doing calculus.

Let’s try it on some more examples just to practice this new technique.

So this rule allows us to do quite a lot of calculus and for many purposes it’s all the calculus we need. Yes, it only works for polynomials , that is expressions made of variables with powers like y = ax 3 + bx 2 + cx + d , and not with things like sin( x ) or cos( x ). That’s not a major flaw because there are a huge number of uses for doing calculus with this power rule.

However for neural networks we do need one extra tool, which we’ll look at next.

Functions of Functions

Imagine a function

where y is itself

We can write this as f = ( x 3 + x ) 2 if we wanted to.

How does f change with y ? That is, what is ∂f / ∂y ? This is easy as we just apply the power rule we just developed, multiplying and reducing the power, so ∂f / ∂y = 2 y .

What about a more interesting question – how does f change with x ? Well we could expand out the expression f = ( x 3 + x ) 2 and apply this same approach. We can’t can’t apply it naively to ( x 3 + x ) 2 to become 2( x 3 + x ).

If we worked many of these out the long hard way, using the diminishing deltas approach like before, we’d stumble upon another set of patterns. Let’s jump straight to the answer.

The pattern is this:

This is a very powerful result, and is called the chain rule .

You can see that it allows us to work out derivatives in layers, like onion rings, unpacking each layer of complexity. To work out ∂f / ∂x we might find it easier to work out ∂f / ∂y and then also easier to work out ∂y / ∂x . If these are easier, we can then do calculus on expressions that otherwise look quite impossible. The chain rule allows us to break a problem into smaller easier ones.

Let’s look at that example again and apply this chain rule:

We now work out the easier bits. The first bit is ( ∂f / ∂y ) = 2 y . The second bit is ( ∂y / ∂x ) = 3 x 2 + 1. So recombining these bits using the chain rule we get

We know that y = x 3 + x so we can have an expression with only x

Magic!

You may challenge this as ask why we didn’t just expand out f in terms of x first and then apply simple power rule calculus to the resulting polynomial. We could have done that, but we wouldn’t have illustrated the chain rule, which allows us to crack much harder problems.”