\documentclass[14pt]{extarticle} \usepackage{hyperref} \usepackage{Sweave} \begin{document} \title{Class: standard errors} \author{Dean Foster} \section*{Admistrivia} Practice 2 due today. Assignment from Berndt due Monday. \section{Story: Pair programming} \begin{itemize} \item Mythical man month \item ``If you double the number of programmers the amount of time it takes doubles.'' Huh? \item Invention of {\em Extreme Programming.} \begin{itemize} \item Always have working code \item Always modify working code to make it simplier \item Small changes and lots of testing \end{itemize} \item One Version is called {\em pair programming} \begin{itemize} \item Person who ``knows'' the code doesn't have the typewriter \item Person who ``knows'' the code can only make suggestions \item By the end of a day working together--the other person knows the code \end{itemize} \item Lots of experimental data on how effective pair programming is. Not as many controlled experiments as there should be. Oh well. \end{itemize} \section{Status so far} The model \begin{displaymath} Y_i = \alpha + \beta x_i + \epsilon_i \end{displaymath} where $\epsilon_i$ are iid and \begin{displaymath} \epsilon_i \sim N(0,\sigma^2). \end{displaymath} \begin{itemize} \item First we discussed fitting ($\alpha + \beta x_i$) \item Then we discussed the residuals \item Now we want to discuss how to estimate the error in $\hat\beta$ \end{itemize} \section{Why we care} If the normal linear model holds, we know that $\hat\beta$ has close to a normal distribution. In particular, $\frac{\hat\beta}{SE(\hat\beta)}$ is a t-distribution. We often want to know how accurately we know $\beta$ purely for its own sake. For example, if our model is $Y = $ sales, and $X = $ advertisments, then $\beta = $sales/ad. So if we know that each sale generates \$10 profit, and each add costs \$1 (think web based advertisements) then we need $\beta > .1$ before we make more money in sales than we spent in advertisements. We can also use $\hat\beta$ to make predictions: \begin{displaymath} \hat{Y} = \hat\alpha + \hat\beta x_i \end{displaymath} So before we can now how accurate a forecast is, we need to know how accurate $\hat\beta$ is. Either of these require knowing that $\hat\beta$ is a good estimate of $\beta$ and exactly how good an estimate of it it actually is. So we need the standard error for $\hat\beta$. \section{Standard errors rely on the linear model assumption} The standard errors generated by R/JMP/statistics all require that the standar linear model holds. I.e. the residuals are IID normal. If they aren't: \begin{itemize} \item predictions are wrong \item CI are wrong \item hypothesis tests are wrong \item accuracy of slopes are unknown \end{itemize} We are left with just guessing. In previous classes, all we did is notice problems. Now we want to fix them (if possible). \section{Hetroskadasticity} One problem that we can fix is that of hetroskadasticity. Suppose that we are regression salary (Y) on runs (X). Then we might expect that \begin{displaymath} Y = \alpha + \beta x \end{displaymath} will display hetroskadastic errors. In particular, we might expect that the errors grow with $x$. \paragraph{log-log model} If we use logs, we can consider the model \begin{displaymath} \log(Y) = \alpha + \beta \log(x) + \epsilon \end{displaymath} In this model, we now expect the errors to be homoskadastic. \begin{eqnarray*} \log(Y) & = & \alpha + \beta \log(x) + \epsilon\\ e^{\log(Y)} & = & e^{\alpha + \beta \log(x) + \epsilon}\\ Y & = & e^{\alpha} e^{\beta \log(x)} e^{\epsilon}\\ Y & = & e^{\alpha} (e^{\log(x)})^\beta e^{\epsilon}\\ Y & = & k x^\beta e^{\epsilon}\\ \end{eqnarray*} where we inserted a $k$ for $e^{\alpha}$ so the equation looked prettier. The problem with this analysis is that we can no longer address the question, ``How much is a hit worth?'' Instead, we can say, ``How many log(dollars) is a log(hit) worth?'' If this sounds resonable to you, then you are definitely an economists: A slope between log(Y) and log(x) is called the elasticity. \paragraph{Intrinsically linear models} Cobb / Douglass production functions are a cool example of a linear model that doesn't look like a linear model: log(output) = log(labor) + log(capital) etc... \paragraph{Weighted least squares:} Alternatively, we can do weighted least squares. If we believe that the errors scale with $x$, then we could write our model precisely as: \begin{displaymath} Y = \alpha + \beta x + x \epsilon \end{displaymath} So when $x$ is large, the errors are large and when $x$ is small the errors are small. This would show up nicely in a plot of $\epsilon^2$ vs $x$. We can modify our equation by dividing both sides by $x$: \begin{displaymath} Y/x = \alpha/x + \beta + \epsilon \end{displaymath} Now if we do a regression of $Y/x$ on $1/x$ we have a homoskadastic regression. This methodology is called weighted least squares. The weights in this case are $1/x$. There are two ways of doing a weighted least squares. \begin{itemize} \item First, you can ask R to simply use a column of $1/x^2$ as weights. Sample hetroskadastic data with evil LS line: \begin{Schunk} \begin{Sinput} > x <- 1:20 > y <- 10 + x * (3 + rnorm(20)) > plot(x, y) > abline(lm(y ~ x)$coef) \end{Sinput} \end{Schunk} \includegraphics{class_standard_errors-001} Now with the weighted line: \begin{Schunk} \begin{Sinput} > plot(x, y) > abline(lm(y ~ x, weights = 1/x^2)$coef) \end{Sinput} \end{Schunk} \includegraphics{class_standard_errors-002} This is nice since the equation it generates will still use $\alpha$ and $\beta$ as we have been using them in the equations already. \item Second, you can create the two new variables $Y/x$ and $1/x$ and do the regression yourself. This allows you control and will work in any regression package (i.e. excel). But you then have to interpret the slope and intercept carefully. So it looks like: \begin{Schunk} \begin{Sinput} > newx = 1/x > newy = y/x > plot(newx, newy) > abline(lm(newy ~ newx)$coef) \end{Sinput} \end{Schunk} \includegraphics{class_standard_errors-003} Compare the coeffients: \begin{Schunk} \begin{Sinput} > summary(lm(y ~ x, weights = 1/x^2)) \end{Sinput} \begin{Soutput} Call: lm(formula = y ~ x, weights = 1/x^2) Residuals: Min 1Q Median 3Q Max -1.6074 -0.9099 -0.1673 0.8402 1.9102 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.3621 1.1956 7.830 3.32e-07 *** x 3.2447 0.3378 9.606 1.65e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.165 on 18 degrees of freedom Multiple R-squared: 0.8368, Adjusted R-squared: 0.8277 F-statistic: 92.28 on 1 and 18 DF, p-value: 1.651e-08 \end{Soutput} \begin{Sinput} > summary(lm(newy ~ newx)) \end{Sinput} \begin{Soutput} Call: lm(formula = newy ~ newx) Residuals: Min 1Q Median 3Q Max -1.6074 -0.9099 -0.1673 0.8402 1.9102 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2447 0.3378 9.606 1.65e-08 *** newx 9.3621 1.1956 7.830 3.32e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.165 on 18 degrees of freedom Multiple R-squared: 0.7731, Adjusted R-squared: 0.7605 F-statistic: 61.32 on 1 and 18 DF, p-value: 3.321e-07 \end{Soutput} \end{Schunk} To check that $1/x$ is the wrong weights: \begin{Schunk} \begin{Sinput} > summary(lm(y ~ x, weights = 1/x)) \end{Sinput} \begin{Soutput} Call: lm(formula = y ~ x, weights = 1/x) Residuals: Min 1Q Median 3Q Max -5.294 -2.722 -0.490 1.772 6.716 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.389 2.957 2.837 0.0109 * x 3.420 0.387 8.837 5.78e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.847 on 18 degrees of freedom Multiple R-squared: 0.8127, Adjusted R-squared: 0.8023 F-statistic: 78.09 on 1 and 18 DF, p-value: 5.777e-08 \end{Soutput} \end{Schunk} \end{itemize} \end{document}