\documentclass[14pt]{extarticle}
\usepackage{hyperref}
\renewcommand{\baselinestretch}{1.3}

\usepackage{Sweave}
\begin{document}

\title{Class: Taylor}
\maketitle

\section*{Story time: Dan Willingham, the Cog Psyc}

\begin{itemize}
\item Willingham: Professor of cognitive psychology at Harvard
\item \href{http://www.amazon.com/Why-Dont-Students-Like-School/dp/0470279303/ref=sr_1_1?ie=UTF8&s=books&qid=1263385854&sr=1-1}{Why students don't like school}
\item We know lots about psychology now
\begin{itemize}
\item Perry preschool project is still a state of the art education experiment
\item There are few other good controlled experiments
\end{itemize}
\item But we can use what we know about psychology to inform education
\item That is the approach of this book
\item For example: He pushes stories
\begin{itemize}
\item People relate to humans with special hardware in their brain
(cards vs liar):
\begin{itemize}
\item with cards: the rule is, red on one side means, even number on
the other side.  Which do you check:
\begin{itemize}
\item A red card?
\item An even card?
\item An odd card?  % about 3/4 got these right
\end{itemize}
\item with people: Drink means over 21.  Which do you check:
\begin{itemize}
\item A drinker?
\item An old person?
\item A young person?  % 100% got these three questions right
\end{itemize}
\end{itemize}
so try to tell stories
\item People pay attention at start of class--new stuff is always
interesting.  So no need for it to have a connected theme.
\item Hence when I start classes with a short story, blam Dan, the cognitive
psychologist.
\end{itemize}
\end{itemize}

\begin{itemize}
\item homework / cases'' project and a final
\item Email statistics.assignments@gmail.com for questions and turning
in assignments
\begin{itemize}
\item Both the TA and I get messages sent here.  So that way you get
a hold of who ever is currently on line sooner.
\end{itemize}
\item Books
\begin{itemize}
\item  Introductory Statistics with R by Peter Dalgaard, 2nd edition, ISBN 978-0-387-79053-4, Springer 2008.
\item  Linear Models with R by Julian J. Faraway, ISBN
1-58488-425-8, Chapman \& Hall/CRC Press 2005.
\end{itemize}
\item Software R
\begin{itemize}
\item Its free, available on OS X/Linux/Windows.  Everywhere!
Personally, I use it most often on my android tablet.
\item It is what production level statisticians use
\item It had its own article in the NYT!  (Ok, it shared a bit with
big data.'')
\item Friday at 11am, Kory will give an intro to using R.  (And
if you are having problems.)
\end{itemize}
\end{itemize}

\section*{The triangle of statistics}

Statistics has three major pieces
\begin{itemize}
\item mathematics
\item data analysis (i.e. science)
\item communication
\end{itemize}
To be good, you need all three (or at least two of the three).  Doing
only one isn't as powerful:
\begin{itemize}
\item only mathematics: Terrance Tao, maybe the smartest guy on the
planet.  I would have recommended him for the genius award--but he
\item only data analysis: called masters level statistician.''
Employable at big pharm.  But low pay.
\item only communication: Called bloggers.  Basically unpaid!
\end{itemize}
My goal is to make sure you can make more money than any of these pure
states!

So
\begin{itemize}
\item MBA's closer to communication
\item math undergrads closer to math corner
\item stat concentrators, closer to data analysis corner
\end{itemize}
but by the end, I want you all to have moved a bit towards the middle.

I'll present more mathematics and data analysis since that is what I
know best.

\vspace{2em}
\section*{Today's Topic: Simple linear regression}

\section*{Review of the standard linear model}

The standard linear regression model is:
\begin{displaymath}
Y_i = \alpha + \beta x_i + \epsilon_i \quad \epsilon_i \sim_{iid}
N(0,\sigma^2)
\end{displaymath}
You will see this equation written in almost any research paper which
uses data.  The names are often changed, but it is there somewhere.
For example, it is basically equation 2.17 in Berndt of the reading.
The entire chapter is designed to motivate that one equation.

Let's break it down into pieces.
\begin{itemize}
\item The fit:
\begin{displaymath}
Y_i = \underbrace{\alpha + \beta x_i}_{\hbox{the fit}} + \epsilon_i \quad \epsilon_i \sim_{iid}
N(0,\sigma^2)
\end{displaymath}
the most fun part is the fit''.  It describes the relationship
between $x$ and $Y$.  This version describes a linear relationship.

\item Residuals / errors:
\begin{displaymath}
Y_i = \alpha + \beta x_i + \underbrace{\epsilon_i \quad \epsilon_i
\sim_{iid} N(0,\sigma^2)}_{\hbox{The residuals}}
\end{displaymath}
The residuals (aka errors) themselves.  Describing them,
looking at them, investigating them is the primary activity of a
statistician.  It is all about error!
\begin{itemize}
\item The i.d.'':  The i.i.d. part can be broken into two pieces, i.''
and i.d.''  The easier is the identically distributed.  It means
each error looks like any other error.

\item The i.'': The first i'' in IID is for independence.  We will
spend an entire class on this piece.  It is the most important
assumption in the entire model.

\item the N'': Means normal.  Look at a q-q plot to check it.  It is
easy to check (hence we cover it in intro classes).  We won't discuss
it here since I assume you already know how to check it.

\item Style: iid = i.i.d. = IID = I.I.D. = independent and identically
distributed.  It is often even left off entirely since it is always
assumed.
\end{itemize}

\item $Y$ is upper case, $x$ is lower case: Recall from probability that
random variables are often written as upper case letters.  This is
why $Y$ is written as an upper case--it is random.  The $x$ are
thought of as inputs, and hence not random.

\item $i$ is the row index.  We might even say how many rows we have
by the cryptic addition to the equation:
\begin{displaymath}
Y_i = \alpha + \beta x_i + \epsilon_i \quad \epsilon_i \sim_{iid}
N(0,\sigma^2)
\end{displaymath}
\end{itemize}
\section*{Is linear good enough?}

\begin{itemize}
\item Communication: Littlewood's principle: Almost all functions
are almost continuous almost everywhere.''  And from
Stone-Weierstrass, all continuous functions are approximately equal to
a polynomial.  And all polynomials look like lines if you investigate
them close enough to a zero.
\item Mathematics: Taylor
(\href{http://en.wikipedia.org/wiki/Brook_Taylor}{wiki}) tells us
that everything'' can be approximated by a linear equation.  So if
there is a true relationship between $Y$ and $x$ that is non-linear,
then we could say
\begin{displaymath}
E(Y|x) = f(x)
\end{displaymath}
(This is yet another cryptic form of our main equation.  It could be
written as $Y = f(x) + \epsilon$ to make it look more like our
previous equation.)  So Taylor's theorem says that
\begin{displaymath}
E(Y|x) \approx \alpha + \beta x
\end{displaymath}
and even tells us what $\alpha$ and $\beta$ are.
\item Data analysis: Linear is easiest to look at--so start there.
Then use residuals to decide if it is good enough.
\end{itemize}

\vspace{4em}

\section*{Practice}

First get the data.  For me, I use the command line, just like your grandfather used:
\begin{verbatim}
wget http://www-stat.wharton.upenn.edu/~waterman/fsw/datasets/txt/Cleaning.txt
\end{verbatim}
You of course have this new fangled device called a mouse--so use it!  Now start R.
\begin{Schunk}
\begin{Sinput}
\end{Sinput}
\end{Schunk}
%
Oops, that generates too much output, and doesn't put it anywhere.
So let's assign all this mess to a data frame.
\begin{Schunk}
\begin{Sinput}
\end{Sinput}
\end{Schunk}
%
Just look at what we have by typing clean'' again.  Oops--we have
the first row with the names of the variables in it.  So let's try again:
\begin{Schunk}
\begin{Sinput}
\end{Sinput}
\end{Schunk}
%
Checking with clean'' shows we only have numbers.  How happy can
you get?!? Now for the fun part, let's run a regression.
\begin{Schunk}
\begin{Sinput}
> lm(clean$RoomsClean ~ clean$NumberOfCrews)
\end{Sinput}
\begin{Soutput}
Call:
lm(formula = clean$RoomsClean ~ clean$NumberOfCrews)

Coefficients:
(Intercept)  clean$NumberOfCrews 1.785 3.701 \end{Soutput} \end{Schunk} % Kinda a different world view than JMP. It just gives the minimal amount of output possible. So to see a bit more, try \begin{Schunk} \begin{Sinput} > summary(lm(clean$RoomsClean ~ clean$NumberOfCrews)) \end{Sinput} \begin{Soutput} Call: lm(formula = clean$RoomsClean ~ clean$NumberOfCrews) Residuals: Min 1Q Median 3Q Max -15.9990 -4.9901 0.8046 4.0010 17.0010 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.7847 2.0965 0.851 0.399 clean$NumberOfCrews   3.7009     0.2118  17.472   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.336 on 51 degrees of freedom
Multiple R-squared: 0.8569,	Adjusted R-squared: 0.854
F-statistic: 305.3 on 1 and 51 DF,  p-value: < 2.2e-16
\end{Soutput}
\end{Schunk}
%
That should look very similar to other tables you have seen.  But what
of pictures?  Well, let's do a plot:
\begin{Schunk}
\begin{Sinput}
> plot(lm(clean$RoomsClean ~ clean$NumberOfCrews))
\end{Sinput}
\end{Schunk}
\includegraphics{class_simple-step5}
\end{document}