The natural starting point in explaining the cross-entropy is the Bernoulli distribution. It is the discrete probability distribution of a random variable that takes the value 1 with probability p and the value 0 with probability 1- p.
It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent “heads” and “tails” (or vice versa), respectively, and p would be the probability of the coin landing on heads or tails, respectively.
Mathematically, it can be written as follows. The formula is called the probability mass function (pmf).
1. Prerequisite
For those of you who know almost nothing about the university-level statistics, the formula might be daunting. So, let me separate it into two parts.
- The LHS says that this is the probability of some events. It also implies that this is a discrete distribution because the random variable X takes a single value x. For continuous distributions, the probability of having a single value is always 0.
- The RHS is a clever way to say that this random variable takes the value 1 with probability p and the value 0 with probability 1- p. Think about it.
2. An example
Sometimes we call it the likelihood function because it shows how likely it is to have some events. To illustrate it, assume that we don’t know the value of p, and want to know the value of p by tossing the coin three times.
Suppose we have (1, 1, 0). What is the probability (likelihood) of this event? The answer is
Now the question is what value of p makes (1, 1, 0) event the most likely. In other words, what value of p maximizes the likelihood of the event?
We can solve it, but your intuition might say 2/3 is the answer. And you are correct.
3. Change the views
Confused? OK. Let’s look at the formula again. The same formula with a different point of views
- We can view the x as a variable and p given, then the formula shows you the probability of x.
- We can view the formula as a function of p and assume that x is given (or ground truth in machine learning terminology). Then the formula is interpreted as a likelihood function. It shows us how likely it is to have the ground truth x as the variable p moves.
4. Loss is a bad thing. Minimize it.
For me, the cross-entropy is just a fancy name for something else.
It is just a negative of log-likelihood function of Bernoulli distribution. That’s it. Probably you don’t like the log function. But the log will save your life at university.
Since the likelihood should be maximized, it is easy to remember the negative of log-likelihood (in fact, its name is cross-entropy) as loss and you should try your best to minimize it.