When I first learned this function, its name was a logistic function, and sometime later I have realized that there are people who call this function as a sigmoid function. The reason was simple. The shape looks like the alphabet S, so the name sigmoid. OK. I accept it.
Nowadays so many people have started to call this as a softmax function for some reason. In this session, I will explain from the out-of-fashioned named logistic function to the so-called fancy softmax function and how they are related to each other.
The original purpose of the logistic function to me was to transform any numbers into a probability. In some sense, it is merely a mapping from the whole real line (-infinity, infinity) to the unit interval (0,1).
One aspect of the function is that as we move away from its center, getting a higher (also lower if you are on the left side of the graph) probability is getting harder and harder. In other words, suppose your score is 50 out of 100. If you work a little more, it is very likely to get a better score. But if you already have 98 of 100, it takes much more effort to increase 1 point.
Now take a look at the generalization of the function. f_1 is the logistic function. Why gₓ (x, y) is the generalization of f(x)? If you plug 0 in x, g_x(0, y) becomes f(y), so it is the generalization. Or if you plug 0 in y of g(x, y), you will get the same result.
As you can see there are two g’s, g₁ and g₂. If you add these two functions you will get 1. It implies that if you know one of the g’s, the other one is automatically set to (1-your g).
The final thing to note is that these g functions are invariant under translation by the same value in each coordinate. This property helps us to understand the relationship between f(x) and g’s.
Suppose you have a number, say 2. Then you can transform this number into a probability using f(x). At the same, you will have another probability 1-f(x) = f(-x), which adds up to 1.
By the way, the fact that you have a number 2 in one-dimensional space is equivalent to have (0, 2) in two-dimensional space or (2,0) if you want. How can be this fact interpreted with g’s?
From the first property, we know that f(2) = g_x(0,2). And from the second property g_x(0,2) + g_y(0,2) should be 1. Right?
Here from the third property, g_y(0,2) = g_y(-2, 0). When you look at g_y(-2, 0) carefully, it is just f(-2).
Another way to explain the softmax function.
The sigmoid function refers to the special case of the logistic function, as shown below.
OK. Then what is the formula for the softmax? Let’s take a look
The index j ranges from 1 to K, and z is a K-dimensional vector. At first, we cannot easily see the relationship between these two formulas. So, let’s start with some numerical numbers, say 3 and 5. Since we have two numbers, K should be 2, and z is a 2-dimensional vector, which is just [3, 5].
By plugging into the softmax equation, we have the following two numbers that behave like probabilities.
Ok, it looks fine. However, think about the case where one number is really bigger than the other number, say [3, 5¹⁰⁰]. In this case, the exponential value of 5¹⁰⁰ will be really large, so that the computer might treat this number as almost infinity. Of course, modern computers can handle this problem very well, but in older times there needed a remedy for this. One simple solution for this is to normalize the maximum value to be zero. In other words, we take the max value from the vector and subtract that max value from all the elements in the vector.
When it goes further, it is related to the entropy, so wait for the next story.