*
*

*
Table Source: Advanced Portfolio Management: A Quant's Guide for Fundamental Investors 1st Edition
*

The above chart represents a single tail metric (the "bad tail", as opposed to the tail on the other side of the mean) for how often a "rare event" occurs. It's another way to think about how often our model error would be outside its confidence interval in terms of "days per year".

A 95% confidence interval is roughly the same as -1.96 std. deviations from the mean.

So model performance outside of the 95% confidence interval should be 12 days (240 days * .95 = ~228 days, 240 - 228 == 12 days).

The means that half of those days outside of the confidence interval should be below the low bound, so 6 days per year we'd fall under the lower end of the confidence interval.

(using-bootstrap-calc-err-estimate-ci)= ## Using Bootstrap to Calculate Error Estimate Confidence Intervals Another variant of calculation for the error estimate is to use bootstrap to calculate our model's error estimate confidence intervals. ```{note} For more information on what is bootstrap, check out the following resources: * [https://www.stat.cmu.edu/~ryantibs/advmethods/notes/bootstrap.pdf](https://www.stat.cmu.edu/~ryantibs/advmethods/notes/bootstrap.pdf) * [https://www.stat.cmu.edu/~ryantibs/advmethods/shalizi/ch06.pdf](https://www.stat.cmu.edu/~ryantibs/advmethods/shalizi/ch06.pdf) * The book "Regression Modeling Strategies" by Harrell * The book ["An Introduction to the Bootstrap"](http://www.ru.ac.bd/stat/wp-content/uploads/sites/25/2019/03/501_02_Efron_Introduction-to-the-Bootstrap.pdf) by Efron and Tibshirani ``` We can see this in action in the live notebook: [ link ]

(using-student-t-calc-err-estimate-ci)= ## Using the Student's t-Distribution to Calculate Cross Validation Error Estimate Confidence Intervals Another way to compute the 95% Confidence Intervals for 10-Fold Cross Validation is with the Student's t-Table. The standard error is the standard deviation of the Student t-distribution. T-distributions are slightly different from Gaussian, and vary depending on the size of the sample. ### Calculating a 95% confidence interval for a population mean (Step 1) determine if you need to use: * the normal distribution * the student's t-distribution (Step 2) if we have the sample standard deviation * this is an indication to use the student's t distribution (Step 3) sample size as another indicator * when n >= 30 then we use the normal distribution * when n < 30 we use the student t distribution (Step 4) Compute confidence intervals for the population mean The equation to compute the low and high confidence intervals is: ```{math} CIs = \bar{e} \pm EBM ``` where EBM stands for error bound for a population mean: ```{math} EBM = t_{v, \alpha/2} * \frac{s}{\sqrt(n)} ``` where: * t is the t-score * v is the degrees of freedom (df), df = n – 1 degrees of freedom * $\alpha$ ("significance level") is the probability that the interval does not contain the unknown population parameter * s is the sample standard deviation * n is the number of samples The error bound (EBM) depends on the confidence level (abbreviated CL). The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. Also, the $\alpha$ variable is the probability that the interval does not contain the unknown population parameter: $cl = 1 - \alpha$ ```{note} The most commonly used significance level is $\alpha$ = 0.05. From the section above, we recall that to calculate alpha we use a confidence level. For a 95% confidence level we'd use 0.95: $\alpha = 1 - cl$ $\alpha = 1 - 0.95$ $\alpha = 0.05$ For a two-sided test, we compute: $1 - \frac{\alpha}{2}$ or $1 - \frac{0.05}{2} = 0.975$ And so we use 0.975 as our lookup column in the Student's t-table, as we'll see below. ``` If we substitute EBM back into the original confidence intervals equation we get: ```{math} CIs = \bar{e} \pm t_{v, \alpha/2} * \frac{s}{\sqrt(n)} ``` To calculate the value of t, we need to first calculate the value of v and $\alpha$ where $\alpha$ is the significance level. The v parameter is the "degrees of freedom", and is calculated as number of samples minus 1: ```{math} v = n - 1 ``` Now we can lookup t in student's t-distribuion table. The table gives t-scores that correspond to the confidence level (column) and degrees of freedom (row). From above, we have degrees of freedom (v: 9) and confidence level (0.975):

ν 0.90 0.95 0.975 0.99 0.995 0.999 1. 3.078 6.314 12.706 31.821 63.657 318.313 2. 1.886 2.920 4.303 6.965 9.925 22.327 3. 1.638 2.353 3.182 4.541 5.841 10.215 4. 1.533 2.132 2.776 3.747 4.604 7.173 5. 1.476 2.015 2.571 3.365 4.032 5.893 6. 1.440 1.943 2.447 3.143 3.707 5.208 7. 1.415 1.895 2.365 2.998 3.499 4.782 8. 1.397 1.860 2.306 2.896 3.355 4.499In the table above we find the value 2.262 for these parameters. Let's now substitute this value back into the original confidence intervals equation: ```{math} CIs = \bar{e} \pm 2.262 * \frac{s}{\sqrt(n)} ``` Let's also change n to k, because we have k-folds: ```{math} CIs = \bar{e} \pm 2.262 * \frac{s}{\sqrt(k)} ``` ```{note} It's worth mention how similar this equation ends up looking to the "Standard Error" version of calculating confidence intervals. The difference is that the student-t version factors in a wider range with its larger 2.262 value (where 2.262 is larger than 1.96 from above). ``` In the example notebook below, you can see the 95% confidence intervals calculated with the student-t method.9. 1.383 1.833 2.262 2.821 3.250 4.29610. 1.372 1.812 2.228 2.764 3.169 4.143