[Lecture Notes] Loss is a bad thing. Minimize it.

The natural starting point in explaining the cross-entropy is the Bernoulli distribution. It is the discrete probability distribution of a random variable that takes the value 1 with probability p and the value 0 with probability 1- p.

It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent “heads” and “tails” (or vice versa), respectively, and p would be the probability of the coin landing on heads or tails, respectively.

Mathematically, it can be written as follows. The formula is called the probability mass function (pmf).

1. Prerequisite

For those of you who know almost nothing about the university-level statistics, the formula might be daunting. So, let me separate it into two parts.

  • The LHS says that this is the probability of some events. It also implies that this is a discrete distribution because the random variable X takes a single value x. For continuous distributions, the probability of having a single value is always 0.
  • The RHS is a clever way to say that this random variable takes the value 1 with probability p and the value 0 with probability 1- p. Think about it.

2. An example

Sometimes we call it the likelihood function because it shows how likely it is to have some events. To illustrate it, assume that we don’t know the value of p, and want to know the value of p by tossing the coin three times.

Suppose we have (1, 1, 0). What is the probability (likelihood) of this event? The answer is

Now the question is what value of p makes (1, 1, 0) event the most likely. In other words, what value of p maximizes the likelihood of the event?

We can solve it, but your intuition might say 2/3 is the answer. And you are correct.

3. Change the views

Confused? OK. Let’s look at the formula again. The same formula with a different point of views

  • We can view the x as a variable and p given, then the formula shows you the probability of x.
  • We can view the formula as a function of p and assume that x is given (or ground truth in machine learning terminology). Then the formula is interpreted as a likelihood function. It shows us how likely it is to have the ground truth x as the variable p moves.

4. Loss is a bad thing. Minimize it.

For me, the cross-entropy is just a fancy name for something else.

It is just a negative of log-likelihood function of Bernoulli distribution. That’s it. Probably you don’t like the log function. But the log will save your life at university.

Since the likelihood should be maximized, it is easy to remember the negative of log-likelihood (in fact, its name is cross-entropy) as loss and you should try your best to minimize it.

5. TensorFlow examples




Ydobon is nobody.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Fahrenheit to Celsius, fast

How to write Problem statement in Research

How to write Problem statement in Research


Alternative facts come from a lack of detail

In Light of my discoveries the partial sequence of (n-1) + (n-2)=T(n)/π/2 plus ~| f(1) n/π^2 T(N)-1…

Best Online Resources To Learn Math

[JustForFunMath] Basics for Set Theory: Cardinality and Countable Set

Mysteries of the Law of Large Numbers

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
A Ydobon

A Ydobon

Ydobon is nobody.

More from Medium

Three Myths About Getting a Job in Germany

L1 and L2 Regularization

How to create and evaluate random forests models — For beginners

Two ways to apply L1/2 regularization in PyTorch in 2022