Confidence Intervals for Cross Validation Error Estimates

We use cross validation to estimate the extra-sample (“training data”) accuracy (or “error estimate”) for a given model. This error metric gives us a good estimation on how well the model will likely perform on data it has not yet seen (e.g., the greater population of data).

The cross-validation error estimate (the average of the error estimates across all folds) is a great way to compare different models as well.

Adding a confidence interval to the cross validation error estimate gives us a better idea of how well the model will perform over time as it encounters the population of data in the wild.

What Are Confidence Intervals?

A confidence interval gives us a range of values that bound a statistic’s mean (above and below). These bounds likely contain the unknown population parameter we’re interested in measuring. These bounds (e.g., “confidence interval”) refer to a percentage of probability (“certainty”) that the confidence interval range would capture the true population parameter if we were to draw random samples from the population a lot of times.

“A confidence interval is a range of values, bounded above and below the statistic’s mean, that likely would contain an unknown population parameter. Confidence level refers to the percentage of probability, or certainty, that the confidence interval would contain the true population parameter when you draw a random sample many times.”

How Are Confidence Intervals Useful in Machine Learning?

A confidence interval for cross validation error estimates gives us a good explanation of the range of performance of a machine learning model.

We sometimes may want to give a measure of the uncertainty around the error estimate. This can help us understand the range of potential outcomes which may impact how a line of business would use a given model. In these situations we want to compute a “confidence interval” for the error estimation of cross-validation.

If “confidence level” refers to “percentage of probability” of certainty, then we can assume that 95% of the time our accuracy should be between the lower and upper bound of the estimate.

Based on statistics we can then calculate how often (e.g., “number of days per year”) a model will perform outside of the confidence interval. This is useful when expressing to non-data science folks how often to expect the model to hold within the confidence interval range. The sidebar below explains how to calculate this “number of days” metric.

Probabilities of Occurrence of Rare Events

std. deviations Probability (%) Events/year Events/five years
-1.0 15.87 40 200
-2.0 2.28 6 29
-2.5 0.62 2 8
-3.0 0.13 0 2

Table Source: Advanced Portfolio Management: A Quant's Guide for Fundamental Investors 1st Edition

The above chart represents a single tail metric (the "bad tail", as opposed to the tail on the other side of the mean) for how often a "rare event" occurs. It's another way to think about how often our model error would be outside its confidence interval in terms of "days per year".

A 95% confidence interval is roughly the same as -1.96 std. deviations from the mean.

So model performance outside of the 95% confidence interval should be 12 days (240 days * .95 = ~228 days, 240 - 228 == 12 days).

The means that half of those days outside of the confidence interval should be below the low bound, so 6 days per year we'd fall under the lower end of the confidence interval.

Methods to Build Confidence Intervals

Below we list 4 ways that we can calculate a confidence interval for a model’s performance.

In the following sections, we give explanations and live notebook code for 3 of the the methods.

Note

For more reading on Confidence Intervals and various methods to generate them, check out Sebastian Raschka’s paper:

Using Standard Error to Calculate the Cross Validation Error Estimate Confidence Interval

With K-fold cross-validation it is often useful to understand the quantitative notion of variability for the cross-validation error (evaluation metric) estimate. One method to build this confidence interval is to use standard error.

This method is based on:

We compute the confidence intervals with the following formula:

\[\begin{split}\bar{e} - z_{1-\alpha/2}\cdot\hat{SE}, \\ \bar{e} + z_{1-\alpha/2}\cdot\hat{SE} \\\end{split}\]

where:

  • $\bar{e}$ is the cross validation error estimate (e.g., average of the cross validation folds’ errors)

  • $z_{1-\alpha/2}$ is the z-score

  • $\hat{SE}$ is the standard error of the cross validation error estimate

Calculating the Cross Validation Error Estimate

The cross validation error estimate is computed by averaging the error scores of each of the held out cross validation K folds. This is the normal error estimate produced by any cross validation modeling building process.

Calculating the Z-Score

The z-score (also called hte “standard normal deviate”, or the “normal score”) is calcuated based on the level of confidence interval we want. For a 95% confidence interval, the z-score is the approximate value of the 97.5 percentile point of the standard normal distribution. This 97.5 percentile point gives us the approximate value of 1.96 standard deviations.

The Normal Distribution via Wikipedia

This 1.96 standard deviations value is what we substitute back into the original confidence interval equation for z when calculating a 95% confidence interval.

Note

It’s worth mentioning that 95% of the values for our distribution fall inside the range of (-1.96 standard deviations, +1.96 standard deviaions)

To understand how the z term gives us the value 1.96 standard deviations (via the 97.5 percentile point, as seen in the “cumulative %” band in the graph above) for the 95% confidence interval, let’s quickly work through calculating the term $z_{1-\alpha/2}$ by resolving its subscript value (aka the “percentile point”, or “cumulative %”):

\[ = 1 - \alpha / 2\]

And to calculate alpha we use a 0.95 confidence level:

\[\begin{split}\alpha = 1 - 0.95 \\ \alpha = 0.05\end{split}\]

Substituting $\alpha$ back into the percentile point equation, this gives us:

\[\begin{split} = 1 - 0.05 / 2 \\ = 1 - 0.025 \\ = 0.975\end{split}\]

From the 0.975 (the 97.5% percentile point) value we can look up our z-score on the graph above, giving us 1.96 as the z-score to substitute back into the confidence interval equation.

Calculating the Standard Error of the Cross Validation Error Mean

We can compute the standard error of the cross validation error estimate by dividing the sample standard deviation by the square root of the number of observations (K, for K folds).

\[\hat{SE} = \frac{sd(cverr_0, cverr_1,...,cverr_K)}{\sqrt{K}}\]

where $sd$ denotes the sample standard deviation and $cverr_n$ are the errors from each of the cross validation folds.

Standard error is defined as:

Standard Error of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM)

(source: wikipedia.org)

Standard error measures the accuracy of how a data sample represents the larger population of data. It is the approximate standard deviation of a statistical sample. Put another way, the standard error is the deviation of a sample mean from the actual population mean.

Note

In the context of machine learning, we’re taking the “sample standard deviation” here of the sample represented by the training records; we don’t have the full population of data to train on, so our training records, from a statistical viewpoint, are considered a “sample”.

With the standard error of the sample mean (here, the mean of the fold errors), we can then calculate the 95% confidence intervals for our cross validation error estimate (e.g., “95% confidence interval for the population mean”, or to put it another way: “confidence intervals for the error estimate of the full population of data our model might encounter in the wild”).

This is useful for helping us understand if the difference between two models is statistically significant, which is certainly useful.

Note

To read more on calculating standard error for cross validation, take a look at:

The Paper: Cross-validation: what does it estimate and how well does it do it? by Stephen Bates, Trevor Hastie, and Robert Tibshirani, (2021) available at https://arxiv.org/abs/2104.00673

The CMU course lecture notes 1, notes 2 on Data Mining: 36-462/36-662 with Dr. Ryan Tibshirani.

Additionally take a look at section 7.10 of the book Elements of Statistical Learning.

Confidence Intervals for Cross Validation Error Estimates

A confidence interval gives us a range of values that bound a statistic’s mean (above and below). These bounds likely contain the unknown population parameter we’re interested in measuring. These bounds (e.g., “confidence interval”) refer to a percentage of probability (“certainty”) that the confidence interval range would capture the true population parameter if we were to draw random samples from the population a lot of times.

If confidence level refers to “percentage of probability” of certainty, then for a 95% confidence interval we can assume that 95% of the time our accuracy should be between the lower and upper bound of the estimate.

Returning to our confidence interval equations:

\[\begin{split}\bar{e} - z_{1-\alpha/2}\cdot\hat{SE}, \\ \bar{e} + z_{1-\alpha/2}\cdot\hat{SE} \\\end{split}\]

Substituting 1.96 for the z-score for the 95% confidence interval gives us:

\[\begin{split}\bar{e} - 1.96\cdot\hat{SE}, \\ \bar{e} + 1.96\cdot\hat{SE} \\\end{split}\]

This will give a high CI and a low CI around the mean accuracy for the Cross Validation process

We can see this in action in the live notebook: [ link ]

Using Bootstrap to Calculate Error Estimate Confidence Intervals

Another variant of calculation for the error estimate is to use bootstrap to calculate our model’s error estimate confidence intervals.

Note

For more information on what is bootstrap, check out the following resources:

We can see this in action in the live notebook: [ link ]

Using the Student’s t-Distribution to Calculate Cross Validation Error Estimate Confidence Intervals

Another way to compute the 95% Confidence Intervals for 10-Fold Cross Validation is with the Student’s t-Table.

The standard error is the standard deviation of the Student t-distribution. T-distributions are slightly different from Gaussian, and vary depending on the size of the sample.

Calculating a 95% confidence interval for a population mean

(Step 1) determine if you need to use:

  • the normal distribution

  • the student’s t-distribution

(Step 2) if we have the sample standard deviation

  • this is an indication to use the student’s t distribution

(Step 3) sample size as another indicator

  • when n >= 30 then we use the normal distribution

  • when n < 30 we use the student t distribution

(Step 4) Compute confidence intervals for the population mean

The equation to compute the low and high confidence intervals is:

\[CIs = \bar{e} \pm EBM\]

where EBM stands for error bound for a population mean:

\[EBM = t_{v, \alpha/2} * \frac{s}{\sqrt(n)}\]

where:

  • t is the t-score

  • v is the degrees of freedom (df), df = n – 1 degrees of freedom

  • $\alpha$ (“significance level”) is the probability that the interval does not contain the unknown population parameter

  • s is the sample standard deviation

  • n is the number of samples

The error bound (EBM) depends on the confidence level (abbreviated CL).

The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. Also, the $\alpha$ variable is the probability that the interval does not contain the unknown population parameter:

$cl = 1 - \alpha$

Note

The most commonly used significance level is $\alpha$ = 0.05.

From the section above, we recall that to calculate alpha we use a confidence level. For a 95% confidence level we’d use 0.95:

$\alpha = 1 - cl$

$\alpha = 1 - 0.95$

$\alpha = 0.05$

For a two-sided test, we compute:

$1 - \frac{\alpha}{2}$

or

$1 - \frac{0.05}{2} = 0.975$

And so we use 0.975 as our lookup column in the Student’s t-table, as we’ll see below.

If we substitute EBM back into the original confidence intervals equation we get:

\[CIs = \bar{e} \pm t_{v, \alpha/2} * \frac{s}{\sqrt(n)}\]

To calculate the value of t, we need to first calculate the value of v and $\alpha$ where $\alpha$ is the significance level.

The v parameter is the “degrees of freedom”, and is calculated as number of samples minus 1:

\[v = n - 1\]

Now we can lookup t in student’s t-distribuion table. The table gives t-scores that correspond to the confidence level (column) and degrees of freedom (row). From above, we have degrees of freedom (v: 9) and confidence level (0.975):

  ν         0.90    0.95   0.975    0.99   0.995   0.999

  1.       3.078   6.314  12.706  31.821  63.657 318.313
  2.       1.886   2.920   4.303   6.965   9.925  22.327
  3.       1.638   2.353   3.182   4.541   5.841  10.215
  4.       1.533   2.132   2.776   3.747   4.604   7.173
  5.       1.476   2.015   2.571   3.365   4.032   5.893
  6.       1.440   1.943   2.447   3.143   3.707   5.208
  7.       1.415   1.895   2.365   2.998   3.499   4.782
  8.       1.397   1.860   2.306   2.896   3.355   4.499
  9.       1.383   1.833   2.262   2.821   3.250   4.296
 10.       1.372   1.812   2.228   2.764   3.169   4.143

In the table above we find the value 2.262 for these parameters. Let’s now substitute this value back into the original confidence intervals equation:

\[CIs = \bar{e} \pm 2.262 * \frac{s}{\sqrt(n)}\]

Let’s also change n to k, because we have k-folds:

\[CIs = \bar{e} \pm 2.262 * \frac{s}{\sqrt(k)}\]

Note

It’s worth mention how similar this equation ends up looking to the “Standard Error” version of calculating confidence intervals. The difference is that the student-t version factors in a wider range with its larger 2.262 value (where 2.262 is larger than 1.96 from above).

In the example notebook below, you can see the 95% confidence intervals calculated with the student-t method.