Assignment Three

Due in class (at the beginning of class) on Wednesday, February 17.

Problems one and two can be written on paper and turned in on paper. Problems three through five must be turned in in two forms. First, I want a pdf document (no word .doc files!) with discussion, conclusions, and any graphs you are told to (or want to) include. It is okay to include snippets of R output, but you must interpret the output and explicitly state any conclusions (I know how to both produce the correct R output, and how to interpret it. Convincing me that you've learned to produce the correct output is part of what you should be doing, but you can't leave the interpretation up to me!). E-mailing me the .pdf document is fine. You should also send me the R code that you used to produce your output. Be sure to clearly indicate (with comments, etc.) which input produces which outputs used and discussed in your .pdf write-up (marking number 3, 4, 5 is necesary but not sufficient). Also, any analysis is sure to include several false starts or incorrect things, which you later fixed up. Be sure to clean those out of the file you turn in to me.

Here is an example solution to the first part of problem 5 to see what you should be turing in.

  1. Exercise 5.1 from the book.
  2. Use the above to prove that Σi=0n (yi- y)2 = Σi=0n (yi - ŷi)2 + Σi=0n (ŷi - y )2, that is, the total sum of squares is the sum of the residual sum of squares and the regression sum of squares.
  3. Use the dataset prestige.data. Consider the women variable, which measures the percentage of women in each occupation. Use a hypothesis test and confidence interval to determine if a lower percentage of women are in the professional category (type = "prof") than in the white collar category (type = "wc").
  4. Analyze the data in sahlins.data by regressing Acres/Gardener on Consumers/Gardener. In a society characterized by "primitive communism," the social product of the village would be redistributed according to need, while each household would work in proportion to its capacity, implying a regression slope of 0. In contrast, in a society in which redistribution is purely through the market, each household should have to work in proportion to its own consumption needs, suggesting a positive regression slope and an intercept of 0. Interpret the results of the regression in light of these observations. Examine and interpret the slope, intercept, standard error of the regression, and r2. Do the results change if the fourth household is deleted (using data.frame[-4,] is probably the most straightforward way to omit the fourth observation for now -- we'll see a better way of doing this later)? Plot the regression lines calculated with and without the fourth household on a scatterplot of the data. Does either regression do a good job of summarizing the relationship between Acres/Gardner and Consumers/Gardener?
  5. The file anscombe.data presents data on the U.S. states and Washington D.C. for state per-capita public school expenditures (in dollars), per-capita annual income (dollars), the proportion of residents under the age of 18 (per 1000), and the proportion of the population residing in urban areas (per 1000). Regress school expenditures (y) on each of income, proportion under 18, and proportion urban. Plot the least-squares line for each regression on a scatterplot of the data. Does the line adequately capture the relationship between the variables shown in the plot? In each case, examine and interpret the slope, intercept, regression standard error, and r2.