[ML Crash Course] Logistic Regression: Probability and Loss

5 min readSep 1, 2020

I take this plot from my other post. In this post, x should be read z.

Calculating a Probability

Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:

“As is”
Converted to a binary category.

Let’s consider how we might use the probability “as is.” Suppose we create a logistic regression model to predict the probability that a dog will bark during the middle of the night. We’ll call that probability:

If the logistic regression model predicts a p(bark | night) of 0.05, then over a year, the dog's owners should be startled awake approximately 18 times:

In many cases, you’ll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., “spam” or “not spam”).

You might be wondering how a logistic regression model can ensure output that always falls between 0 and 1. As it happens, a sigmoid function, defined as follows, produces an output having those same characteristics:

The sigmoid function yields the following plot:

[Lecture Notes] What is Softmax?

It’s not ice cream. It is a mathematical function, let’s take a look.

medium.com

If z represents the output of the linear layer of a model trained with logistic regression, then sigmoid(z) will yield a value (a probability) between 0 and 1. In mathematical terms:

where:

y′ is the output of the logistic regression model for a particular example.
z=b+w1x1+w2x2+…+wNxN
The w values are the model’s learned weights, and b is the bias.
The x values are the feature values for a particular example.

Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the "1" label (e.g., "dog barks") divided by the probability of the "0" label (e.g., "dog doesn't bark"):

Here is the sigmoid function with ML labels:

Sample calculation

Suppose we had a logistic regression model with three features that learned the following bias and weights:

b = 1
w1 = 2
w2 = -1
w3 = 5

Further, suppose the following feature values for a given example:

x1 = 0
x2 = 10
x3 = 2

Therefore, the log-odds:

will be:

Consequently, the logistic regression prediction for this particular example will be 0.731:

Loss and Regularization

The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:

where:

(x,y)∈D is the data set containing many labeled examples, which are (x,y) pairs.
y is the label in a labeled example. Since this is logistic regression, every value of y must either be 0 or 1.
y′ is the predicted value (somewhere between 0 and 1), given the set of features in x.

[Lecture Notes] Loss is a bad thing. Minimize it.

The natural starting point in explaining the cross-entropy is the Bernoulli distribution. It is the discrete…

medium.com

Regularization in Logistic Regression

Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:

L2 regularization.
Early stopping, that is, limiting the number of training steps or the learning rate.

Imagine that you assign a unique id to each example, and map each id to its own feature. If you don’t specify a regularization function, the model will become completely overfit. That’s because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses when there’s a huge mass of rare crosses that happen only on one example each.

Fortunately, using L2 or early stopping will prevent this problem.

Portions of this page are modifications based on work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

Logistic Regression: Calculating a Probability

Estimated Time: 10 minutes Many problems require a probability estimate as output. Logistic regression is an extremely…

developers.google.com