Problem 1.

Longley’s (1967) data set is a classic example of collinear data. The data (Table below) consist of a response variable Y and six predictor variablesX_1, …, X_6.

Y X1 X2 X3 X4 X5 X6

60323 830 234289 2356 1590 107608 1947

61122 885 259426 2325 1456 108632 1948

60171 882 258054 3682 1616 109773 1949

61187 895 284599 3351 1650 110929 1950

63221 962 328975 2099 3099 112075 1951

63639 981 346999 1932 3594 113270 1952

64989 990 365385 1870 3547 115094 1953

63761 1000 363112 3578 3350 116219 1954

66019 1012 397469 2904 3048 117388 1955

67857 1046 419180 2822 2857 118734 1956

68169 1084 442769 2936 2798 120445 1957

66513 1108 444546 4681 2637 121950 1958

68655 1126 482704 3813 2552 123366 1959

69564 1142 502601 3931 2514 125368 1960

69331 1157 518173 4806 2572 127852 1961

70551 1169 554894 4007 2827 130081 1962

The initial model

Y=β_0+β_1 X_1+⋯+β_6 X_6+ε, (1)

in terms of the original variables, can be written in terms of the standardized variables as

Y ̃=β_0+θ_1 X ̃_1+⋯+θ_6 X ̃_6+ε^’. (2)

Fit the model (2) to the data using least squares. What conclusion can you draw from the data?

From the results you obtained from the model in (2), obtain the least squares estimated regression coefficients in model (1).

Now fit the model in (1) to the data sing least squares and verify that the obtained results are consistent with those obtained above.

Compute the correlation matrix of the six predictor variables and the corresponding scatter plot matrix. Do you see any evidence of collinearity?

Compute the corresponding PCs (Pearson Correlation), their sample variances, and the condition number. How many different sets of collinearity exist in the data? What are the variables involved in each set?

Based on the number of PCs you choose to retain, obtain the PC estimates of the coefficients in (1) and (2).

Using the ridge method, construct the fide trace. What value of k do you recommend to be used in the estimation of the parameters in (1) and (2)? Use the chosen value of k and compute the ridge estimates of the regression coefficients in (1) and (2).

Compare the estimates you obtained by the three methods. Which one would you recommend? Explain.

Prob2.

Table 2.1

X1 X2 X3 X4 X5 X6 X7 X8 X9 Y

4.918 1.000 3.472 0.9981.0 7 4 42 0 25.90

5.021 1.000 3.531 1.500 2.0 7 4 62 0 29.50

4.543 1.000 2.275 1.175 1.0 6 3 40 0 27.90

4.557 1.000 4.050 1.232 1.0 6 3 54 0 25.90

5.060 1.000 4.455 1.121 1.0 6 3 42 0 29.90

3.891 1.000 4.455 0.988 1.0 6 3 56 0 29.90

5.898 1.000 5.850 1.240 1.0 7 3 51 1 30.90

5.604 1.000 9.520 1.501 0.0 6 3 32 0 28.90

5.828 1.000 6.435 1.225 2.0 6 3 32 0 35.90

5.300 1.000 4.988 1.552 1.0 6 3 30 0 31.50

6.271 1.000 5.520 0.975 1.0 5 2 30 0 31.00

5.959 1.000 6.666 1.121 2.0 6 3 32 0 30.90

5.050 1.000 5.000 1.020 0.0 5 2 46 1 30.00

8.246 1.500 5.150 1.664 2.0 8 4 50 0 36.90

6.697 1.500 6.902 1.488 1.5 7 3 22 1 41.90

7.784 1.500 7.102 1.376 1.0 6 3 17 0 40.50

9.038 1.000 7.800 1.500 1.5 7 3 23 0 43.90

5.989 1.000 5.520 1.256 2.0 6 3 40 1 37.90

7.542 1.500 5.000 1.690 1.0 6 3 22 0 37.90

8.795 1.500 9.890 1.820 2.0 8 4 50 1 44.50

6.083 1.500 6.727 1.652 1.0 6 3 44 0 37.90

8.361 1.500 9.150 1.777 2.0 8 4 48 1 38.90

8.140 1.000 8.000 1.504 2.0 7 3 3 0 36.90

9.142 1.500 7.326 1.831 1.5 8 4 31 0 45.80

Table 2.2: List of Variables for Data in Table 2.1

Variable Definition

Y Sale price of the house in thousands of dollars

X1 Taxes(local, county, school) in thousands of dollars

X2 Number of bathrooms

X3 Lot size(in thousands of square feet)

X4 Living space(in thousands of square feet)

X5 Number of garage stalls

X6 Number of rooms

X7 Number of bedrooms

X8 Age of the home(years)

X9 Number of fireplaces

Property Valuation: Scientific mass appraisal is a technique in which linear regression methods applied to the problem of property valuation. The objective in scientific mass appraisal is to predict the sale price of a home from selected physical characteristics of the building and taxes (local, school, county) paid on the building. Twenty-four observations were obtained from Multiple Listing (Vol. 87) for Erie, PA, which is designated as Area 12 in the directory. These data (Table 2.1) were originally presented by Narula and Wellington (1977). The list of variables are given in Table 2.2. Answer the following questions, in each case justifying your answer by appropriate analyses.

In a fitted regression model that relates the sale price to taxes and building characteristics, would you include all the variables?

A veteran real estate agent has suggested that local taxes, number of rooms, and age of the house would adequately describe the sale price. Do you agree?

A real estate expert who was brought into the project reasoned as follows: The selling price of a home is determined by its desirability and this is certainly a function of the physical characteristic of the building. This overall assessment is reflected in the local taxes paid by the homeowner; consequently, the best predictor of sale price is the local taxes. The building characteristics are therefore redundant in a regression equation which includes local taxes. An equation that relates sales price solely to local taxes would be adequate. Examine this assertion by examining several models. Do you agree? Present what you consider to be the most adequate model or models for prediction sale price of homes in Erie, PA.