We’ve updated our Terms of Use to reflect our new entity name and address. You can review the changes here.
We’ve updated our Terms of Use. You can review the changes here.

Activation functions

by Main page

about

KDnuggets

※ Download: persilesslet.fastdownloadportal.ru?dl&keyword=activation+functions&source=bandcamp.com


It is nonlinear in nature, so great we can stack layers! We then have another variant made form both ReLu and Leaky ReLu called Maxout function.

I am sure you will love it and be more productive. As with Swish and Tanh, new activation functions are being discovered, replacing older functions, getting closer and closer to the. Let us understand the same concept again but this time using an artificial neuron.

Notation

I would recommend reading up on the basics of neural networks before reading this article for better understanding. Activation functions So what does an artificial neuron do? So consider a neuron. Now, the value of Y can be anything ranging from -inf to +inf. So how do we decide whether the neuron should fire or not why this firing pattern? Step function The first thing that comes to our minds is how about a threshold based activation function? If the value of Y is above a certain value, declare it activated. So this makes an activation function for a neuron. However, there are certain drawbacks with this. To understand it better, think about the following. Suppose you are creating a binary classifier. A Step function could do that for you! Now, think about the use case where you would want multiple such neurons to be connected to bring in more classes. Class1, class2, class3 etc. All neurons will output a 1 from step function. Now what would you decide? Which class is it? This is harder to train and converge this way. The first thing that comes to our minds would be Linear function. This way, it gives a range of activations, so it is not binary activation. We can definitely connect a few neurons together and if more than 1 fires, we could take the max or softmax and decide based on that. So that is ok too. Then what is the problem with this? If you are familiar with gradient descent for training, you would notice that for this function, derivative is a constant. That means, the gradient has no relationship with X. It is a constant gradient and the descent is going to be on constant gradient. If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta x!!! This is not that good! There is another problem too. Think about connected layers. Each layer is activated by a linear function. That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function. No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer! Pause for a bit and think about it. That means these two layers or N layers can be replaced by a single layer. We just lost the ability of stacking layers this way. No matter how we stack, the whole network is still equivalent to a single layer with linear activation a combination of linear functions in a linear manner is still another linear function. What are the benefits of this? Think about it for a moment. First things first, it is nonlinear in nature. Combinations of this function are also nonlinear! Now we can stack layers. What about non binary activations? It will give an analog activation unlike step function. It has a smooth gradient too. And if you notice, between X values -2 to 2, Y values are very steep. Which means, any small changes in the values of X in that region will cause values of Y to change significantly. Ah, that means this function has a tendency to bring the Y values to either end of the curve. Making clear distinctions on prediction. Another advantage of this activation function is, unlike linear function, the output of the activation function is always going to be in range 0,1 compared to -inf, inf of linear function. So we have our activations bound in a range. Sigmoid functions are one of the most widely used activation functions today. Then what are the problems with this? If you notice, towards either end of the sigmoid function, the Y values tend to respond very less to changes in X. What does that mean? The gradient at that region is going to be small. Gradient is small or has vanished cannot make significant change because of the extremely small value. There are ways to work around this problem and sigmoid is still very popular in classification problems. Tanh Function Another activation function that is used is the tanh function. Ok, now this has characteristics similar to sigmoid that we discussed above. It is nonlinear in nature, so great we can stack layers! It is bound to range -1, 1 so no worries of activations blowing up. One point to mention is that the gradient is stronger for tanh than sigmoid derivatives are steeper. Deciding between the sigmoid or tanh will depend on your requirement of gradient strength. Like sigmoid, tanh also has the vanishing gradient problem. Tanh is also a very popular and widely used activation function. It gives an output x if x is positive and 0 otherwise. At first look this would look like having the same problems of linear function, as it is linear in positive axis. First of all, ReLu is nonlinear in nature. And combinations of ReLu are also non linear! Any function can be approximated with combinations of ReLu. Great, so this means we can stack layers. It is not bound though. This means it can blow up the activation. Another point that I would like to discuss here is the sparsity of the activation. Imagine a big neural network with a lot of neurons. Using a sigmoid or tanh will cause almost all neurons to fire in an analog way remember? That means almost all activations will be processed to describe the output of a network. In other words the activation is dense. We would ideally want a few neurons in the network to not activate and thereby making the activations sparse and efficient. ReLu give us this benefit. Imagine a network with random initialized weights or normalised and almost 50% of the network yields 0 activation because of the characteristic of ReLu output 0 for negative values of x. This means a fewer neurons are firing sparse activation and the network is lighter. ReLu seems to be awesome! Yes it is, but nothing is flawless.. Because of the horizontal line in ReLu for negative X , the gradient can go towards 0. For activations in that region of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. This is called dying ReLu problem. This problem can cause several neurons to just die and not respond making a substantial part of the network passive. There are variations in ReLu to mitigate this issue by simply making the horizontal line into non-horizontal component. This is leaky ReLu. There are other variations too. The main idea is to let the gradient be non zero and recover during training eventually. ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. That is a good point to consider when we are designing deep neural nets. Ok, now which one do we use? Now, which activation functions to use. Does that mean we just use ReLu for everything we do? Or sigmoid or tanh? Well, yes and no. When you know the function you are trying to approximate has certain characteristics, you can choose an activation function which will approximate the function faster leading to faster training process. Which will lead to faster training process and convergence. You can use your own custom functions too!. ReLu works most of the time as a general approximator! In this article, I tried to describe a few activation functions used commonly. There are other activation functions too, but the general idea remains the same. Research for better activation functions is still ongoing. Hope you got the idea behind activation function, why they are used and how do we decide which one to use.

The benefits of ReLU is the sparsity, it allows only values which are positive and negative values are not passed which will speed up the process and it will negate or bring down possibility of occurrence of a dead neuron. With a large positive input we get a large negative output which tends to not fire and with a large negative input we get a large positive output which tends to fire. Since these networks are biologically inspired, one of the first activation functions that was ever used was the step function, also known as the. Surely there must activation functions strings attached. Thus weights do not get updated, and the network does not learn. So after completing it, you will be able to apply deep learning to a your own applications.

credits

released November 25, 2018

tags

about

scapberconan Washington

contact / help

Contact scapberconan

Streaming and
Download help

Report this album or account

If you like Activation functions, you may also like: