What is MLE (Maximum Likelihood Estimation)?

Maximum Likelihood Estimate (MLE) primarily deals with using parameter inference to determine which parameters maximize the probability of data. This calculated parameter can be used to help you predict the outcome of future experiments.

A few relevant terms:

  • Parameter inference — the process of probabilistically inferring parameter(s) for a model in an analytical context.
  • Conditional probability — the probability of one event occurring given that another event has occurred
  • Probability density —the relationship between a sample and the likelihood of selecting that sample value at random.

Conditional probabilities are values that we use all the time in data science. The whole idea behind conditional probability is that we often have information about past events that we can use to estimate the probability of future dependent events. These probabilities won’t be 100% accurate, but will give us a more educated and complete picture of the event in question. As given in the name, these probabilities are conditional, and depend on a separate event occurance.

The formula for calculating conditional probability is as follows:

P(B|A) = P(A and B)/P(A)

The formula can be read as, “the probability of B given A is equal to the probability of A and B over the probability of A.”

You can imagine that if each sample has a probability, a collection of samples and their densities would form a distribution. The calculations of probabilities for specific outcomes of random variables are performed by a probability density function, or PDF for short.

Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain. It involves selecting a probability distribution function (PDF) and parameters for that distribution that best explain the joint probability distribution of the observed data (X).

But how do we choose the proper PDF and parameters?

There are many techniques for solving this problem, but two of the most common approaches are:

  • Maximum a Posteriori (MAP), a Bayesian method.
  • Maximum Likelihood Estimation (MLE), frequentist method.

The main difference is between the two is that MLE assumes all solutions are equally likely beforehand, whereas MAP allows prior information about the form of the solution to be harnessed.

We’re going to focus on MLE

MLE (Maximum Likelihood Estimate)

To use MLE, we define a likelihood function for calculating the conditional probability of observing the data sample (X) given a probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

MLE treats the problem as an optimization or search algorithm, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (X).

First, we define a parameter called theta that defines both the choice of the probability density function and the parameters of that distribution. It may be a vector of numerical values whose values change smoothly and map to different probability distributions and their parameters.

In Maximum Likelihood Estimation, we are trying to maximize the probability of observing our sample from the joint probability distribution given a specific probability distribution and its parameters. Written formally, we would say:

P(X ; theta)

This conditional probability is often stated using the semicolon (;) notation instead of the bar notation (|) because theta is not a random variable, but an unknown parameter.

P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing this set of data given the model parameters. We write the function fo this relationship using the notation L() to denote the likelihood function:

L(X ; theta)

The objective of MLE is to find the set of parameters (theta) that maximize the likelihood function, e.g. result in the largest likelihood value (maximize L(X ; theta).

The joint probability distribution can now be re-written as the product of the conditional probability for observing each example given the distribution parameters:

Product i to n P(xi ; theta)

Multiplying many small probabilities together can be unstable in practice, therefore, it is common to restate this problem as the sum of the log conditional probabilities of observing each example given the model parameters.

Sum i to n log(P(xi ; theta))

It is common in optimization problems to prefer to minimize the cost function, rather than to maximize it — although we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL). Therefore, the negative of the log-likelihood function can be used. This is referred to generally as a Negative Log-Likelihood (NLL) function.

Minimize -sum i to n log(P(xi ; theta))


  1. Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
  2. It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
  3. It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store