Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

**Description:** In this lecture, the professor discussed classical statistics, maximum likelihood (ML) estimation, and confidence intervals.

**Instructor:** John Tsitsiklis

Lecture 23: Classical Stati...

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality, educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: So for the last three lectures we're going to talk about classical statistics, the way statistics can be done if you don't want to assume a prior distribution on the unknown parameters.

Today we're going to focus, mostly, on the estimation side and leave hypothesis testing for the next two lectures. So where there is one generic method that one can use to carry out parameter estimation, that's the maximum likelihood method. We're going to define what it is.

Then we will look at the most common estimation problem there is, which is to estimate the mean of a given distribution. And we're going to talk about confidence intervals, which refers to providing an interval around your estimates, which has some properties of the kind that the parameter is highly likely to be inside that interval, but we will be careful about how to interpret that particular statement.

Ok. So the big framework first. The picture is almost the same as the one that we had in the case of Bayesian statistics. We have some unknown parameter. And we have a measuring device. There is some noise, some randomness.

And we get an observation, X, whose distribution depends on the value of the parameter. However, the big change from the Bayesian setting is that here, this parameter is just a number. It's not modeled as a random variable. It does not have a probability distribution. There's nothing random about it. It's a constant. It just happens that we don't know what that constant is.

And in particular, this probability distribution here, the distribution of X, depends on Theta. But this is not a conditional distribution in the usual sense of the word.

Conditional distributions were defined when we had two random variables and we condition one random variable on the other. And we used the bar to separate the X from the Theta. To make the point that this is not a conditioned distribution, we use a different notation. We put a semicolon here.

And what this is meant to say is that X has a distribution. That distribution has a certain parameter. And we don't know what that parameter is.

So for example, this might be a normal distribution, with variance 1 but a mean Theta. We don't know what Theta is. And we want to estimate it. Now once we have this setting, then your job is to design this box, the estimator.

The estimator is some data processing box that takes the measurements and produces an estimate of the unknown parameter. Now the notation that's used here is as if X and Theta were one-dimensional quantities.

But actually, everything we say remains valid if you interpret X and Theta as vectors of parameters. So for example, you may obtain several measurements, X1 up to 2Xn. And there may be several unknown parameters in the background.

Once more, we do not have, and we do not want to assume, a prior distribution on Theta. It's a constant. And if you want to think mathematically about this situation, it's as if you have many different probabilistic models.

So a normal with this mean or a normal with that mean or a normal with that mean, these are alternative candidate probabilistic models. And we want to try to make a decision about which one is the correct model.

In some cases, we have to choose just between a small number of models. For example, you have a coin with an unknown bias. The bias is either 1/2 or 3/4. You're going to flip the coin a few times.

And you try to decide whether the true bias is this one or is that one. So in this case, we have two specific, alternative probabilistic models from which we want to distinguish.

But sometimes things are a little more complicated. For example, you have a coin. And you have one hypothesis that my coin is unbiased. And the other hypothesis is that my coin is biased. And you do your experiments. And you want to come up with a decision that decides whether this is true or this one is true.

In this case, we're not dealing with just two alternative probabilistic models. This one is a specific model for the coin. But this one actually corresponds to lots of possible, alternative coin models.

So this includes the model where Theta is 0.6, the model where Theta is 0.7, Theta is 0.8, and so on. So we're trying to discriminate between one model and lots of alternative models.

How does one go about this? Well, there's some systematic ways that one can approach problems of this kind. And we will start talking about these next time.

So today, we're going to focus on estimation problems. In estimation problems, theta is a quantity, which is a real number, a continuous parameter. We're to design this box, so what we get out of this box is an estimate.

Now notice that this estimate here is a random variable. Even though theta is deterministic, this is random, because it's a function of the data that we observe. The data are random. We're applying a function to the data to construct our estimate.

So, since it's a function of random variables, it's a random variable itself. The distribution of Theta hat depends on the distribution of X. The distribution of X is affected by Theta. So in the end, the distribution of your estimate Theta hat will also be affected by whatever Theta happens to be.

Our general objective, when designing estimators, is that we want to get, in the end, an error, an estimation error, which is not too large. But we'll have to make that specific. Again, what exactly do we mean by that?

So how do we go about this problem? One general approach is to pick a Theta, under which the data that we observe, that this is the X's, our most likely to have occurred.

So I observe X. For any given Theta, I can calculate this quantity, which tells me, under this particular Theta, the X that you observed had this probability of occurring. Under that Theta, the X that you observe had that probability of occurring. You just choose that Theta, which makes the data that you observed most likely.

It's interesting to compare this maximum likelihood estimate with the estimates that you would have, if you were in a Bayesian setting, and you were using maximum approach theory probability estimation.

In the Bayesian setting, what we do is, given the data, we use the prior distribution on Theta. And we calculate the posterior distribution of Theta given X. Notice that this is sort of the opposite from what we have here.

This is the probability of X for a particular value of Theta, whereas this is the probability of Theta for a particular X. So it's the opposite type of conditioning. In the Bayesian setting, Theta is a random variable. So we can talk about the probability distribution of Theta.

So how do these two compare, except for this syntactic difference that the order X's and Theta's are reversed? Let's write down, in full detail, what this posterior distribution of Theta is. By the Bayes rule, this conditional distribution is obtained from the prior, and the model of the measurement process that we have. And we get to this expression.

So in Bayesian estimation, we want to find the most likely value of Theta. And we need to maximize this quantity over all possible Theta's.

First thing to notice is that the denominator is a constant. It does not involve Theta. So when you maximize this quantity, you don't care about the denominator. You just want to maximize the numerator.

Now, here, things start to look a little more similar. And they would be exactly of the same kind, if that term here was absent, it the prior was absent. The two are going to become the same if that prior was just a constant.

So if that prior is a constant, then maximum likelihood estimation takes exactly the same form as Bayesian maximum posterior probability estimation. So you can give this particular interpretation of maximum likelihood estimation.

Maximum likelihood estimation is essentially what you have done, if you were in a Bayesian world, and you had assumed a prior on the Theta's that's uniform, all the Theta's being equally likely.

Okay. So let's look at a simple example. Suppose that the Xi's are independent, identically distributed random variables, with a certain parameter Theta. So the distribution of each one of the Xi's is this particular term.

So Theta is one-dimensional. It's a one-dimensional parameter. But we have several data. We write down the formula for the probability of a particular X vector, given a particular value of Theta. But again, when I use the word, given, here it's not in the conditioning sense. It's the value of the density for a particular choice of Theta.

Here, I wrote down, I defined maximum likelihood estimation in terms of PMFs. That's what you would do if the X's were discrete random variables.

Here, the X's are continuous random variables, so instead of I'm using the PDF instead of the PMF. So this a definition, here, generalizes to the case of continuous random variables. And you use F's instead of X's, our usual recipe.

So the maximum likelihood estimate is defined. Now, since the Xi's are independent, the joint density of all the X's together is the product of the individual densities. So you look at this quantity. This is the density or sort of probability of observing a particular sequence of X's.

And we ask the question, what's the value of Theta that makes the X's that we observe most likely? So we want to carry out this maximization. Now this maximization is just a calculational problem.

We're going to do this maximization by taking the logarithm of this expression. Maximizing an expression is the same as maximizing the logarithm. So the logarithm of this expression, the logarithm of a product is the sum of the logarithms. You get contributions from this Theta term. There's n of these, so we get an n log Theta.

And then we have the sum of the logarithms of these terms. It gives us minus Theta. And then the sum of the X's. So we need to maximize this expression with respect to Theta.

The way to do this maximization is you take the derivative, with respect to Theta. And you get n over Theta equals to the sum of the X's. And then you solve for Theta. And you find that the maximum likelihood estimate is this quantity.

Which sort of makes sense, because this is the reciprocal of the sample mean of X's. Theta, in an exponential distribution, we know that it's 1 over (the mean of the exponential distribution). So it looks like a reasonable estimate.

So in any case, this is the estimates that the maximum likelihood estimation procedure tells us that we should report. This formula here, of course, tells you what to do if you have already observed specific numbers. If you have observed specific numbers, then you observe this particular number as your estimate of Theta.

If you want to describe your estimation procedure more abstractly, what you have constructed is an estimator, which is a box that's takes in the random variables, capital X1 up to Capital Xn, and produces out your estimate, which is also a random variable. Because it's a function of these random variables and is denoted by an upper case Theta, to indicate that this is now a random variable.

So this is an equality about numbers. This is a description of the general procedure, which is an equality between two random variables. And this gives you the more abstract view of what we're doing here.

All right. So what can we tell about our estimate? Is it good or is it bad? So we should look at this particular random variable and talk about the statistical properties that it has.

What we would like is this random variable to be close to the true value of Theta, with high probability, no matter what Theta is, since we don't know what Theta is.

Let's make a little more specific the properties that we want. So we cook up the estimator somehow. So this estimator corresponds, again, to a box that takes data in, the capital X's, and produces an estimate Theta hat.

This estimate is random. Sometimes it will be above the true value of Theta. Sometimes it will be below. Ideally, we would like it to not have a systematic error, on the positive side or the negative side. So a reasonable wish to have, for a good estimator, is that, on the average, it gives you the correct value.

Now here, let's be a little more specific about what that expectation is. This is an expectation, with respect to the probability distribution of Theta hat. The probability distribution of Theta hat is affected by the probability distribution of the X's. Because Theta hat is a function of the X's.

And the probability distribution of the X's is affected by the true value of Theta. So depending on which one is the true value of Theta, this is going to be a different expectation. So if you were to write this expectation out in more detail, it would look something like this.

You need to write down the probability distribution of Theta hat. And this is going to be some function. But this function depends on the true Theta, is affected by the true Theta. And then you integrate this with respect to Theta hat.

What's the point here? Again, Theta hat is a function of the X's. So the density of Theta hat is affected by the density of the X's. The density of the X's is affected by the true value of Theta. So the distribution of Theta hat is affected by the value of Theta.

Another way to put it is, as I've mentioned a few minutes ago, in this business, it's as if we are considering different possible probabilistic models, one probabilistic model for each choice of Theta. And we're trying to guess which one of these probabilistic models is the true one.

One way of emphasizing the fact that this expression depends on the true Theta is to put a little subscript here, expectation, under the particular value of the parameter Theta. So depending on what value the true parameter Theta takes, this expectation will have a different value.

And what we would like is that no matter what the true value is, that our estimate will not have a bias on the positive or the negative sides. So this is a property that's desirable.

Is it always going to be true? Not necessarily, it depends on what estimator we construct. Is it true for our exponential example? Unfortunately not, the estimate that we have in the exponential example turns out to be biased.

And one extreme way of seeing this is to consider the case where our sample size is 1. We're trying to estimate Theta. And the estimator from the previous slide, in that case, is just 1/X1. Now X1 has a fair amount of density in the vicinity of 0, which means that 1/X1 has significant probability of being very large.

And if you do the calculation, this ultimately makes the expected value of 1/X1 to be infinite. Now infinity is definitely not the correct value. So our estimate is biased upwards. And it's actually biased a lot upwards.

So that's how things are. Maximum likelihood estimates, in general, will be biased. But under some conditions, they will turn out to be asymptotically unbiased.

That is, as you get more and more data, as your X vector is longer and longer, with independent data, the estimate that you're going to have, the expected value of your estimator is going to get closer and closer to the true value. So you do have some nice asymptotic properties, but we're not going to prove anything like this.

Speaking of asymptotic properties, in general, what we would like to have is that, as you collect more and more data, you get the correct answer, in some sense. And the sense that we're going to use here is the limiting sense of convergence in probability, since this is the only notion of convergence of random variables that we have in our hands.

This is similar to what we had in the pollster problem, for example. If we had a bigger and bigger sample size, we could be more and more confident that the estimate that we obtained is close to the unknown true parameter of the distribution that we have.

So this is a desirable property. If you have an infinitely large amount of data, you should be able to estimate an unknown parameter more or less exactly. So this is it desirable property of estimators.

It turns out that maximum likelihood estimation, given independent data, does have this property, under mild conditions. So maximum likelihood estimation, in this respect, is a good approach.

So let's see, do we have this consistency property in our exponential example? In our exponential example, we used this quantity to estimate the unknown parameter Theta. What properties does this quantity have as n goes to infinity?

Well this quantity is the reciprocal of that quantity up here, which is the sample mean. We know from the weak law of large numbers, that the sample mean converges to the expectation. So this property here comes from the weak law of large numbers.

In probability, this quantity converges to the expected value, which, for exponential distributions, is 1/Theta. Now, if something converges to something, then the reciprocal of that should converge to the reciprocal of that. That's a property that's certainly correct for numbers.

But you're not talking about convergence of numbers. We're talking about convergence in probability, which is a more complicated notion.

Fortunately, it turns out that the same thing is true, when we deal with convergence in probability. One can show, although we will not bother doing this, that indeed, the reciprocal of this, which is our estimate, converges in probability to the reciprocal of that. And that reciprocal is the true parameter Theta.

So for this particular exponential example, we do have the desirable property, that as the number of data becomes larger and larger, the estimate that we have constructed will get closer and closer to the true parameter value.

And this is true no matter what Theta is. No matter what the true parameter Theta is, we're going to get close to it as we collect more data.

Okay. So these are two rough qualitative properties that would be nice to have. If you want to get a little more quantitative, you can start looking at the mean squared error that your estimator gives.

Now, once more, the comment I was making up there applies. Namely, that this expectation here is an expectation with respect to the probability distribution of Theta hat that corresponds to a particular value of little theta.

So fix a little theta. Write down this expression. Look at the probability distribution of Theta hat, under that little theta. And do this calculation. You're going to get some quantity that depends on the little theta.

And so all quantities in this equality here should be interpreted as quantities under that particular value of little theta. So if you wanted to make this more explicit, you could start throwing little subscripts everywhere in those expressions.

And let's see what those expressions tell us. The expected value squared of a random variable, we know that it's always equal to the variance of this random variable, plus the expectation of the random variable squared. So the expectation value of that random variable, squared.

This equality here is just our familiar formula, that the expected value of X squared is the variance of X plus the expected value of X squared. So we apply this formula to X equal to Theta hat minus Theta.

Now, remember that, in this classical setting, theta is just a constant. We have fixed Theta. We want to calculate the variance of this quantity, under that particular Theta. When you add or subtract a constant to a random variable, the variance doesn't change. This is the same as the variance of our estimator.

And what we've got here is the bias of our estimate. It tells us, on the average, whether we fall above or below. And we're taking the bias to be b squared. If we have an unbiased estimator, the bias term will be 0.

So ideally we want Theta hat to be very close to Theta. And since Theta is a constant, if that happens, the variance of Theta hat would be very small. So Theta is a constant. If Theta hat has a distribution that's concentrated just around own little theta, then Theta hat would have a small variance.

So this is one desire that have. We're going to have a small variance. But we also want to have a small bias at the same time.

So the general form of the mean squared error has two contributions. One is the variance of our estimator. The other is the bias. And one usually wants to design an estimator that simultaneously keeps both of these terms small.

So here's an estimation method that would do very well with respect to this term, but badly with respect to that term. So suppose that my distribution is, let's say, normal with an unknown mean Theta and variance 1.

And I use as my estimator something very dumb. I always produce an estimate that says my estimate is 100. So I'm just ignoring the data and report 100. What does this do?

The variance of my estimator is 0. There's no randomness in the estimate that I report. But the bias is going to be pretty bad. The bias is going to be Theta hat, which is 100 minus the true value of Theta.

And for some Theta's, my bias is going to be horrible. If my true Theta happens to be 0, my bias squared is a huge term. And I get a large error.

So what's the moral of this example? There are ways of making that variance very small, but, in those cases, you pay a price in the bias. So you want to do something a little more delicate, where you try to keep both terms small at the same time.

So these types of considerations become important when you start to try to design sophisticated estimators for more complicated problems. But we will not do this in this class. This belongs to further classes on statistics and inference.

For this class, for parameter estimation, we will basically stick to two very simple methods. One is the maximum likelihood method we've just discussed. And the other method is what you would do if you were still in high school and didn't know any probability.

You get data. And these data come from some distribution with an unknown mean. And you want to estimate that the unknown mean. What would you do? You would just take those data and average them out.

So let's make this a little more specific. We have X's that come from a given distribution. We know the general form of the distribution, perhaps. We do know, perhaps, the variance of that distribution, or, perhaps, we don't know it. But we do not know the mean.

And we want to estimate the mean of that distribution. Now, we can write this situation. We can represent it in a different form. The Xi's are equal to Theta. This is the mean. Plus a 0 mean random variable, that you can think of as noise.

So this corresponds to the usual situation you would have in a lab, where you go and try to measure an unknown quantity. You get lots of measurements. But each time that you measure them, your measurements have some extra noise in there. And you want to kind of get rid of that noise.

The way to try to get rid of the measurement noise is to collect lots of data and average them out. This is the sample mean. And this is a very, very reasonable way of trying to estimate the unknown mean of the X's.

So this is the sample mean. It's a reasonable, plausible, in general, pretty good estimator of the unknown mean of a certain distribution. We can apply this estimator without really knowing a lot about the distribution of the X's.

Actually, we don't need to know anything about the distribution. We can still apply it, because the variance, for example, does not show up here. We don't need to know the variance to calculate that quantity.

Does this estimator have good properties? Yes, it does. What's the expected value of the sample mean? If the expectation of this, it's the expectation of this sum divided by n. The expected value for each one of the X's is Theta. So the expected value of the sample mean is just Theta itself.

So our estimator is unbiased. No matter what Theta is, our estimator does not have a systematic error in either direction. Furthermore, the weak law of large numbers tells us that this quantity converges to the true parameter in probability. So it's a consistent estimator. This is good.

And if you want to calculate the mean squared error corresponding to this estimator. Remember how we defined the mean squared error? It's this quantity. Then it's a calculation that we have done a fair number of times by now.

The mean squared error is the variance of the distribution of the X's divided by n. So as we get more and more data, the mean squared error goes down to 0.

In some examples, it turns out that the sample mean is also the same as the maximum likelihood estimate. For example, if the X's are coming from a normal distribution, you can write down the likelihood, do the maximization with respect to Theta, you'll find that the maximum likelihood estimate is the same as the sample mean.

In other cases, the sample mean will be different from the maximum likelihood. And then you have a choice about which one of the two you would use. Probably, in most reasonable situations, you would just use the sample mean, because it's simple, easy to compute, and has nice properties.

All right. So you go to your boss. And you report and say, OK, I did all my experiments in the lab. And the average value that I got is a certain number, 2.37. So is that the informative to your boss?

Well your boss would like to know how much they can trust this number, 2.37. Well, I know that the true value is not going to be exactly that. But how close should it be? So give me a range of what you think are possible values of Theta.

So the situation is like this. So suppose that we observe X's that are coming from a certain distribution. And we're trying to estimate the mean. We get our data. Maybe our data looks something like this.

You calculate the mean. You find the sample mean. So let's suppose that the sample mean is a number, for some reason take to be 2.37. But you want to convey something to your boss about how spread out these data were.

So the boss asks you to give him or her some kind of interval on which Theta, the true parameter, might lie. So the boss asked you for an interval. So what you do is you end up reporting an interval.

And you somehow use the data that you have seen to construct this interval. And you report to your boss also the endpoints of this interval. Let's give names to these endpoints, Theta_n- and Theta_n+. The ends here just play the role of keeping track of how many data we're using.

So what you report to your boss is this interval as well. Are these Theta's here, the endpoints of the interval, lowercase or uppercase? What should they be? Well you construct these intervals after you see your data.

You take the data into account to construct your interval. So these definitely should depend on the data. And therefore they are random variables. Same thing with your estimator, in general, it's going to be a random variable. Although, when you go and report numbers to your boss, you give the specific realizations of the random variables, given the data that you got.

So instead of having just a single box that produces estimates. So our previous picture was that you have your estimator that takes X's and produces Theta hats. Now our box will also be producing Theta hats minus and Theta hats plus. It's going to produce an interval as well.

The X's are random, therefore these quantities are random. Once you go and do the experiment and obtain your data, then your data will be some lowercase x, specific numbers. And then your estimates and estimator become also lower case.

What would we like this interval to do? We would like it to be highly likely to contain the true value of the parameter. So we might impose some specs of the following kind.

I pick a number, alpha. Usually that alpha, think of it as a probability of a large error. Typical value of alpha might be 0.05, in which case this number here is point 0.95.

And you're given specs that say something like this. I would like, with probability at least 0.95, this to happen, which says that the true parameter lies inside the confidence interval.

Now let's try to interpret this statement. Suppose that you did the experiment, and that you ended up reporting to your boss a confidence interval from 1.97 to 2.56. That's what you report to your boss.

And suppose that the confidence interval has this property. Can you go to your boss and say, with probability 95%, the true value of Theta is between these two numbers? Is that a meaningful statement?

So the statement is, the tentative statement is, with probability 95%, the true value of Theta is between 1.97 and 2.56. Well, what is random in that statement? There's nothing random. The true value of theta is a constant. 1.97 is a number. 2.56 is a number.

So it doesn't make any sense to talk about the probability that theta is in this interval. Either theta happens to be in that interval, or it happens to not be. But there are no probabilities associated with this. Because theta is not random.

Syntactically, you can see this. Because theta here is a lower case. So what kind of probabilities are we talking about here? Where's the randomness? Well the random thing is the interval. It's not theta.

So the statement that is being made here is that the interval, that's being constructed by our procedure, should have the property that, with probability 95%, it's going to fall on top of the true value of theta.

So the right way of interpreting what the 95% confidence interval is, is something like the following. We have the true value of theta that we don't know. I get data. Based on the data, I construct a confidence interval. I get my confidence interval. I got lucky. And the true value of theta is in here.

Next day, I do the same experiment, take my data, construct a confidence interval. And I get this confidence interval, lucky once more. Next day I get data. I use my data to come up with an estimate of theta and the confidence interval.

That day, I was unlucky. And I got a confidence interval out there. What the requirement here is, is that 95% of the days, where we use this certain procedure for constructing confidence intervals, 95% of those days, we will be lucky. And we will capture the correct value of theta by your confidence interval.

So it's a statement about the distribution of these random confidence intervals, how likely are they to fall on top of the true theta, as opposed to how likely they are to fall outside. So it's a statement about probabilities associated with a confidence interval. They're not probabilities about theta, because theta, itself, is not random.

So this is what the confidence interval is, in general, and how we interpret it. How do we construct a 95% confidence interval? Let's go through this exercise, in a particular example.

The calculations are exactly the same as the ones that you did when we talked about laws of large numbers and the central limit theorem. So there's nothing new calculationally but it's, perhaps, new in terms of the language that we use and the interpretation.

So we got our sample mean from some distribution. And we would like to calculate a 95% confidence interval. We know from the normal tables, that the standard normal has 2.5% on the tail, that's after 1.96.

Yes, by this time, the number 1.96 should be pretty familiar. So if this probability here is 2.5%, this number here is 1.96.

Now look at this random variable here. This is the sample mean. Difference, from the true mean, normalized by the usual normalizing factor. By the central limit theorem, this is approximately normal. So it has probability 0.95 of being less than 1.96.

Now take this event here and rewrite it. This the event, well, that Theta hat minus theta is bigger than this number and smaller than that number. This event here is equivalent to that event here.

And so this suggests a way of constructing our 95% percent confidence interval. I'm going to report the interval, which gives this as the lower end of the confidence interval, and gives this as the upper end of the confidence interval

In other words, at the end of the experiment, we report the sample mean, which is our estimate. And we report also, an interval around the sample mean. And this is our 95% confidence interval.

The confidence interval becomes smaller, when n is larger. In some sense, we're more certain that we're doing a good estimation job, so we can have a small interval and still be quite confident that our interval captures the true value of the parameter.

Also, if our data have very little noise, when you have more accurate measurements, you're more confident that your estimate is pretty good. And that results in a smaller confidence interval, smaller length of the confidence interval. And still you have 95% probability of capturing the true value of theta.

So we did this exercise by taking 95% confidence intervals and the corresponding value from the normal tables, which is 1.96.

Of course, you can do it more generally, if you set your alpha to be some other number. Again, you look at the normal tables. And you find the value here, so that the tail has probability alpha over 2.

And instead of using these 1.96, you use whatever number you get from the normal tables. And this tells you how to construct a confidence interval.

Well, to be exact, this is not necessarily a 95% confidence interval. It's approximately a 95% confidence interval. Why is this? Because we've done an approximation. We have used the central limit theorem.

So it might turn out to be a 95.5% confidence interval instead of 95%, because our calculations are not entirely accurate. But for reasonable values of n, using the central limit theorem is a good approximation. And that's what people almost always do.

So just take the value from the normal tables. Okay, except for one catch. I used the data. I obtained my estimate. And I want to go to my boss and report this theta minus and theta hat, which is the confidence interval.

What's the difficulty? I know what n is. But I don't know what sigma is, in general. So if I don't know sigma, what am I going to do?

Here, there's a few options for what you can do. And the first option is familiar from what we did when we talked about the pollster problem. We don't know what sigma is, but maybe we have an upper bound on sigma.

For example, if the Xi's Bernoulli random variables, we have seen that the standard deviation is at most 1/2. So use the most conservative value for sigma. Using the most conservative value means that you take bigger confidence intervals than necessary.

So that's one option. Another option is to try to estimate sigma from the data. How do you do this estimation? In special cases, for special types of distributions, you can think of heuristic ways of doing this estimation.

For example, in the case of Bernoulli random variables, we know that the true value of sigma, the standard deviation of a Bernoulli random variable, is the square root of theta1 minus theta, where theta is the mean of the Bernoulli.

Try to use this formula. But theta is the thing we're trying to estimate in the first place. We don't know it. What do we do? Well, we have an estimate for theta, the estimate, produced by our estimation procedure, the sample mean.

So I obtain my data. I get my data. I produce the estimate theta hat. It's an estimate of the mean. Use that estimate in this formula to come up with an estimate of my standard deviation. And then use that standard deviation, in the construction of the confidence interval, pretending that this is correct.

Well the number of your data is large, then we know, from the law of large numbers, that theta hat is a pretty good estimate of theta. So sigma hat is going to be a pretty good estimate of sigma. So we're not making large errors by using this approach.

So in this scenario here, things were simple, because we had an analytical formula. Sigma was determined by theta. So we could come up with a quick and dirty estimate of sigma.

In general, if you do not have any nice formulas of this kind, what could you do? Well, you still need to come up with an estimate of sigma somehow. What is a generic method for estimating a standard deviation? Equivalently, what could be a generic method for estimating a variance?

Well the variance is an expected value of some random variable. The variance is the mean of the random variable inside of those brackets. How does one estimate the mean of some random variable?

You obtain lots of measurements of that random variable and average them out. So this would be a reasonable way of estimating the variance of a distribution. And again, the weak law of large numbers tells us that this average converges to the expected value of this, which is just the variance of the distribution.

So we got a nice and consistent way of estimating variances. But now, we seem to be getting in a vicious circle here, because to estimate the variance, we need to know the mean. And the mean is something we're trying to estimate in the first place.

Okay. But we do have an estimate from the mean. So a reasonable approximation, once more, is to plug-in, here, since we don't know the mean, the estimate of the mean. And so you get that expression, but with a theta hat instead of theta itself.

And this is another reasonable way of estimating the variance. It does have the same consistency properties. Why? When n is large, this is going to behave the same as that, because theta hat converges to theta.

And when n is large, this is approximately the same as sigma squared. So for a large n, this quantity also converges to sigma squared. And we have a consistent estimate of the variance as well. And we can take that consistent estimate and use it back in the construction of confidence interval.

One little detail, here, we're dividing by n. Here, we're dividing by n-1. Why do we do this? Well, it turns out that's what you need to do for these estimates to be an unbiased estimate of the variance. One has to do a little bit of a calculation, and one finds that that's the factor that you need to have here in order to be unbiased.

Of course, if you get 100 data points, whether you divide by 100 or divided by 99, it's going to make only a tiny difference in your estimate of your variance.

So it's going to make only a tiny difference in your estimate of the standard deviation. It's not a big deal. And it doesn't really matter. But if you want to show off about your deeper knowledge of statistics, you throw in the 1 over n-1 factor in there.

So now one basically needs to put together this story here, how you estimate the variance. You first estimate the sample mean. And then you do some extra work to come up with a reasonable estimate of the variance and the standard deviation. And then you use your estimate, of the standard deviation, to come up with a confidence interval, which has these two endpoints.

In doing this procedure, there's basically a number of approximations that are involved. There are two types of approximations. One approximation is that we're pretending that the sample mean has a normal distribution. That's something we're justified to do, by the central limit theorem. But it's not exact. It's an approximation.

And the second approximation that comes in is that, instead of using the correct standard deviation, in general, you will have to use some approximation of the standard deviation.

Okay so you will be getting a little bit of practice with these concepts in recitation and tutorial. And we will move on to new topics next week. But the material that's going to be covered in the final exam is only up to this point. So next week is just general education. Hopefully useful, but it's not in the exam.