\documentclass[14pt]{extarticle} \usepackage{hyperref} \renewcommand{\baselinestretch}{1.3} \usepackage{Sweave} \begin{document} \title{Class: Taylor} \maketitle (\href{class_doglegs.pdf}{pdf version}) \section*{Story time: Dan Willingham, the Cog Psyc} \begin{itemize} \item Willingham: Professor of cognitive psychology at Harvard \item \href{http://www.amazon.com/Why-Dont-Students-Like-School/dp/0470279303/ref=sr_1_1?ie=UTF8&s=books&qid=1263385854&sr=1-1}{Why students don't like school} \item We know lots about psychology now \item amazingly little about education: \begin{itemize} \item Perry preschool project is still a state of the art education experiment \item There are few other good controlled experiments \end{itemize} \item But we can use what we know about psychology to inform education \item That is the approach of this book \item For example: He pushes stories \begin{itemize} \item People relate to humans with special hardware in their brain (cards vs liar): \begin{itemize} \item with cards: the rule is, red on one side means, even number on the other side. Which do you check: \begin{itemize} \item A red card? \item An even card? \item An odd card? % about 3/4 got these right \end{itemize} \item with people: Drink means over 21. Which do you check: \begin{itemize} \item A drinker? \item An old person? \item A young person? % 100% got these three questions right \end{itemize} \end{itemize} so try to tell stories \item People pay attention at start of class--new stuff is always interesting. So no need for it to have a connected theme. \item Hence when I start classes with a short story, blam Dan, the cognitive psychologist. \end{itemize} \end{itemize} \section*{Admistrivia} \begin{itemize} \item homework / ``cases'' project and a final \item Email statistics.assignments@gmail.com for questions and turning in assignments \begin{itemize} \item Both the TA and I get messages sent here. So that way you get a hold of who ever is currently on line sooner. \end{itemize} \item Books \begin{itemize} \item Introductory Statistics with R by Peter Dalgaard, 2nd edition, ISBN 978-0-387-79053-4, Springer 2008. \item Linear Models with R by Julian J. Faraway, ISBN 1-58488-425-8, Chapman \& Hall/CRC Press 2005. \end{itemize} \item Software R \begin{itemize} \item Its free, available on OS X/Linux/Windows. Everywhere! Personally, I use it most often on my android tablet. \item It is what production level statisticians use \item It had its own article in the NYT! (Ok, it shared a bit with ``big data.'') \item Friday at 11am, Kory will give an intro to using R. (And our computer person will be there a bit before (at 10:30) to help you load R if you are having problems.) \end{itemize} \end{itemize} \section*{The triangle of statistics} Statistics has three major pieces \begin{itemize} \item mathematics \item data analysis (i.e. science) \item communication \end{itemize} To be good, you need all three (or at least two of the three). Doing only one isn't as powerful: \begin{itemize} \item only mathematics: Terrance Tao, maybe the smartest guy on the planet. I would have recommended him for the genius award--but he already has one. \item only data analysis: called ``masters level statistician.'' Employable at big pharm. But low pay. \item only communication: Called bloggers. Basically unpaid! \end{itemize} My goal is to make sure you can make more money than any of these pure states! So \begin{itemize} \item MBA's closer to communication \item math undergrads closer to math corner \item stat concentrators, closer to data analysis corner \end{itemize} but by the end, I want you all to have moved a bit towards the middle. I'll present more mathematics and data analysis since that is what I know best. \vspace{2em} \section*{Today's Topic: Simple linear regression} \section*{Review of the standard linear model} The standard linear regression model is: \begin{displaymath} Y_i = \alpha + \beta x_i + \epsilon_i \quad \epsilon_i \sim_{iid} N(0,\sigma^2) \end{displaymath} You will see this equation written in almost any research paper which uses data. The names are often changed, but it is there somewhere. For example, it is basically equation 2.17 in Berndt of the reading. The entire chapter is designed to motivate that one equation. Let's break it down into pieces. \begin{itemize} \item The fit: \begin{displaymath} Y_i = \underbrace{\alpha + \beta x_i}_{\hbox{the fit}} + \epsilon_i \quad \epsilon_i \sim_{iid} N(0,\sigma^2) \end{displaymath} the most fun part is ``the fit''. It describes the relationship between $x$ and $Y$. This version describes a linear relationship. \item Residuals / errors: \begin{displaymath} Y_i = \alpha + \beta x_i + \underbrace{\epsilon_i \quad \epsilon_i \sim_{iid} N(0,\sigma^2)}_{\hbox{The residuals}} \end{displaymath} The residuals (aka errors) themselves. Describing them, looking at them, investigating them is the primary activity of a statistician. It is all about error! \begin{itemize} \item The ``i.d.'': The i.i.d. part can be broken into two pieces, ``i.'' and ``i.d.'' The easier is the identically distributed. It means each error looks like any other error. \item The ``i.'': The first ``i'' in IID is for independence. We will spend an entire class on this piece. It is the most important assumption in the entire model. \item the ``N'': Means normal. Look at a q-q plot to check it. It is easy to check (hence we cover it in intro classes). We won't discuss it here since I assume you already know how to check it. \item Style: iid = i.i.d. = IID = I.I.D. = independent and identically distributed. It is often even left off entirely since it is always assumed. \end{itemize} \item $Y$ is upper case, $x$ is lower case: Recall from probability that random variables are often written as upper case letters. This is why $Y$ is written as an upper case--it is random. The $x$ are thought of as inputs, and hence not random. \item $i$ is the row index. We might even say how many rows we have by the cryptic addition to the equation: \begin{displaymath} (i = 1,\ldots,n) \quad Y_i = \alpha + \beta x_i + \epsilon_i \quad \epsilon_i \sim_{iid} N(0,\sigma^2) \end{displaymath} \end{itemize} \section*{Is linear good enough?} \subsection*{The triangle answers} \begin{itemize} \item Communication: Littlewood's principle: ``Almost all functions are almost continuous almost everywhere.'' And from Stone-Weierstrass, all continuous functions are approximately equal to a polynomial. And all polynomials look like lines if you investigate them close enough to a zero. \item Mathematics: Taylor (\href{http://en.wikipedia.org/wiki/Brook_Taylor}{wiki}) tells us that ``everything'' can be approximated by a linear equation. So if there is a true relationship between $Y$ and $x$ that is non-linear, then we could say \begin{displaymath} E(Y|x) = f(x) \end{displaymath} (This is yet another cryptic form of our main equation. It could be written as $Y = f(x) + \epsilon$ to make it look more like our previous equation.) So Taylor's theorem says that \begin{displaymath} E(Y|x) \approx \alpha + \beta x \end{displaymath} and even tells us what $\alpha$ and $\beta$ are. \item Data analysis: Linear is easiest to look at--so start there. Then use residuals to decide if it is good enough. \end{itemize} \vspace{4em} \section*{Practice} First get the data. For me, I use the command line, just like your grandfather used: \begin{verbatim} wget http://www-stat.wharton.upenn.edu/~waterman/fsw/datasets/txt/Cleaning.txt \end{verbatim} You of course have this new fangled device called a mouse--so use it! Now start R. First read in the file: \begin{Schunk} \begin{Sinput} > read.table("Cleaning.txt") \end{Sinput} \end{Schunk} % Oops, that generates too much output, and doesn't put it anywhere. So let's assign all this mess to a data frame. \begin{Schunk} \begin{Sinput} > clean = read.table("Cleaning.txt") \end{Sinput} \end{Schunk} % Just look at what we have by typing ``clean'' again. Oops--we have the first row with the names of the variables in it. So let's try again: \begin{Schunk} \begin{Sinput} > clean = read.table("Cleaning.txt", header = TRUE) \end{Sinput} \end{Schunk} % Checking with ``clean'' shows we only have numbers. How happy can you get?!? Now for the fun part, let's run a regression. \begin{Schunk} \begin{Sinput} > lm(clean$RoomsClean ~ clean$NumberOfCrews) \end{Sinput} \begin{Soutput} Call: lm(formula = clean$RoomsClean ~ clean$NumberOfCrews) Coefficients: (Intercept) clean$NumberOfCrews 1.785 3.701 \end{Soutput} \end{Schunk} % Kinda a different world view than JMP. It just gives the minimal amount of output possible. So to see a bit more, try \begin{Schunk} \begin{Sinput} > summary(lm(clean$RoomsClean ~ clean$NumberOfCrews)) \end{Sinput} \begin{Soutput} Call: lm(formula = clean$RoomsClean ~ clean$NumberOfCrews) Residuals: Min 1Q Median 3Q Max -15.9990 -4.9901 0.8046 4.0010 17.0010 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.7847 2.0965 0.851 0.399 clean$NumberOfCrews 3.7009 0.2118 17.472 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.336 on 51 degrees of freedom Multiple R-squared: 0.8569, Adjusted R-squared: 0.854 F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16 \end{Soutput} \end{Schunk} % That should look very similar to other tables you have seen. But what of pictures? Well, let's do a plot: \begin{Schunk} \begin{Sinput} > plot(lm(clean$RoomsClean ~ clean$NumberOfCrews)) \end{Sinput} \end{Schunk} \includegraphics{class_simple-step5} \end{document}