# 随笔记录 ## Chapter 4 - Beyond OLS

(2014-05-07 08:54:39) 下一个

### Regression with SAS Chapter 4 - Beyond OLS

Chapter Outline
4.1 Robust Regression Methods
4.1.1 Regression with Robust Standard Errors
4.1.2 Using the Proc Genmod for Clustered Data
4.1.3 Robust Regression
4.1.4 Quantile Regression
4.2 Constrained Linear Regression
4.3 Regression with Censored or Truncated Data
4.3.1 Regression with Censored Data
4.3.2 Regression with Truncated Data
4.4 Regression with Measurement Error
4.5 Multiple Equation Regression Models
4.5.1 Seemingly Unrelated Regression
4.5.2 Multivariate Regression
4.6 Summary

In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.

The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.

4.1 Robust Regression Methods

It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.

The idea behind robust regression methods is to make adjustments in the estimates that take into account some of the flaws in the data itself. We are going to look at three robust methods: regression with robust standard errors, regression with clustered data, robust regression, and quantile regression.

Before we look at these approaches, let's look at a standard OLS regression using the elementary school academic performance index (elemapi2.dta) dataset. We will look at a model that predicts the api 2000 scores using the average class size in K through 3 (acs_k3), average class size 4 through 6 (acs_46), the percent of fully credentialed teachers (full), and the size of the school (enroll). First let's look at the descriptive statistics for these variables.  Note the missing values for acs_k3 and acs_k6.

` proc means data = "c:sasregelemapi2" mean std max min;   var api00 acs_k3 acs_46 full enroll; run;`
`The MEANS ProcedureVariable            Mean         Std Dev         Minimum         Maximum------------------------------------------------------------------------api00        647.6225000     142.2489610     369.0000000     940.0000000acs_k3        19.1608040       1.3686933      14.0000000      25.0000000acs_46        29.6851385       3.8407840      20.0000000      50.0000000full          84.5500000      14.9497907      37.0000000     100.0000000enroll       483.4650000     226.4483847     130.0000000         1570.00------------------------------------------------------------------------`

Below we see the regression predicting api00 from acs_k3 acs_46 full and enroll. We see that all of the variables are significant except for acs_k3.

` proc reg data = "c:sasregelemapi2";   model api00 = acs_k3 acs_46 full enroll ;  run;`
`The REG ProcedureModel: MODEL1Dependent Variable: api00                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     4        3071909         767977      61.01    <.0001Error                   390        4909501          12588Corrected Total         394        7981410Root MSE            112.19832    R-Square     0.3849Dependent Mean      648.65063    Adj R-Sq     0.3786Coeff Var            17.29719                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       -5.20041       84.95492      -0.06      0.9512acs_k3        1        6.95438        4.37110       1.59      0.1124acs_46        1        5.96601        1.53105       3.90      0.0001full          1        4.66822        0.41425      11.27      <.0001enroll        1       -0.10599        0.02695      -3.93      <.0001`

Since the regression procedure is interactive and we haven't issued the quit command, we can test both of the class size variables, and we find the overall test of these two variables is significant.

`test acs_k3 = acs_46 = 0; run;`
`       Test 1 Results for Dependent Variable api00                                MeanSource             DF         Square    F Value    Pr > FNumerator           2         139437      11.08    <.0001Denominator       390          12588`

Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope.  The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.

`plot r.*p.; run; `

Here is the index plot of Cook's D for this regression. We see 4 points that are somewhat high in both their leverage and their residuals.

`  plot cookd.*obs.; run; `

None of these results are dramatic problems, but the plot of residual vs. predicted value suggests that there might be some outliers and some possible heteroscedasticity and the index plot of Cook's D shows some points in the upper right quadrant that could be influential. We might wish to use something other than OLS regression to estimate this model. In the next several sections we will look at some robust regression methods.

4.1.1 Regression with Robust Standard Errors

The SAS proc reg  includes an option called acov in the model statement for estimating the asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity. The standard error obtained from the asymptotic covariance matrix is considered to be more  robust and can deal with a collection of minor concerns about failure to meet assumptions, such as minor problems about normality, heteroscedasticity, or some observations that exhibit large residuals, leverage or influence. For such minor problems, the standard error based on acov  may effectively deal with these concerns.

With the acov option, the point estimates of the coefficients are exactly the same as in ordinary OLS, but we will calculate the standard errors based on the asymptotic covariance matrix. Here is the same regression as above using the acov option. We also use SAS ODS (Output Delivery System)  to output the parameter estimates along with the asymptotic covariance matrix. We calculated the robust standard error in a data step and merged them with the parameter estimate using proc sql and created the t-values and corresponding probabilities.  Note the changes in the standard errors and t-tests (but no change in the coefficients). In this particular example, using robust standard errors did not change any of the conclusions from the original OLS regression. We should also mention that the robust standard error has been adjusted for the sample size correction.

`proc reg data = "c:sasregelemapi2";       model api00 = acs_k3 acs_46 full enroll /acov;       ods output  ACovEst = estcov;       ods output ParameterEstimates=pest; run; quit; data temp_dm;   set estcov;   drop model dependent;   array a(5) intercept acs_k3 acs_46 full enroll;   array b(5) std1-std5;   b(_n_) = sqrt((395/390)*a(_n_));   std = max(of std1-std5);   keep variable std; run; proc sql;   select pest.variable, estimate, stderr, tvalue, probt, std as robust_stderr,           estimate/robust_stderr as tvalue_rb,           (1 - probt(abs(estimate/robust_stderr), 394))*2 as probt_rb    from pest, temp_dm   where pest.variable=temp_dm.variable; quit;`
`                                                      robust_Variable   Estimate     StdErr  tValue   Probt     stderr  tvalue_rb  probt_rb------------------------------------------------------------------------------Intercept  -5.20041   84.95492   -0.06  0.9512   86.66308   -0.06001   0.95218acs_k3      6.95438    4.37110    1.59  0.1124   4.620599   1.505082  0.133104acs_46      5.96601    1.53105    3.90  0.0001   1.573214   3.792246  0.000173full        4.66822    0.41425   11.27  <.0001   0.414681   11.25737         0enroll     -0.10599    0.02695   -3.93  <.0001   0.028015   -3.78331  0.000179`

4.1.2 Using the Proc Genmod for Clustered Data

As described in Chapter 2, OLS regression assumes that the residuals are independent. The elemapi2 dataset contains data on 400 schools that come from 37 school districts. It is very possible that the scores within each school district may not be independent, and this could lead to residuals that are not independent within districts. SAS proc genmod is used to model correlated data.  We can use the class statement and the repeated statement  to indicate that the observations are clustered into districts (based on dnum) and that the observations may be correlated within districts, but would be independent between districts.

`proc genmod data="c:sasregelemapi2";     class dnum;     model api00 = acs_k3 acs_46 full enroll ;     repeated subject=dnum / type=ind ; run; quit;`
`The GENMOD Procedure`
`             Analysis Of GEE Parameter Estimates              Empirical Standard Error Estimates`
`                   Standard   95% ConfidenceParameter Estimate    Error       Limits            Z Pr > |Z|`
`Intercept  -5.2004 119.5172 -239.450 229.0490   -0.04   0.9653acs_k3      6.9544   6.7726  -6.3196  20.2284    1.03   0.3045acs_46      5.9660   2.4839   1.0976  10.8344    2.40   0.0163full        4.6682   0.6904   3.3151   6.0213    6.76   <.0001enroll     -0.1060   0.0421  -0.1886  -0.0234   -2.51   0.0119`

As with the regression with robust error, the estimate of the coefficients are the same as the OLS estimates, but the standard errors take into account that the observations within districts are non-independent.  Even though the standard errors are larger in this analysis, the three variables that were significant in the OLS analysis are significant in this analysis as well.

We notice that the standard error estimates given here are different from what Stata's result using regress with the cluster option. This is because that Stata further does a finite-sample adjustment. We can do some SAS programming here for the adjustment. The adjusted variance is a constant times the variance obtained from the empirical standard error estimates. This particular constant is
(N-1)/(N-k)*M/(M-1).

`data em;   set 'c:sasregelemapi2'; run;`
`proc genmod data=em;     class dnum;     model api00 = acs_k3 acs_46 full enroll ;     repeated subject=dnum / type = ind covb ; 	ods output geercov = gcov;     ods output GEEEmpPEst = parms; run; quit;`
`proc sql;   select count(dnum),count(distinct dnum)  into :n, :m   from em; quit; proc sql;   select count(prm1) into :k   from gcov; quit; data gcov_ad;   set gcov;   array all(*) _numeric_;   do i = 1 to dim(all);   all(i) = all(i)*((&n-1)/(&n-&k))*(&m/(&m-1));   if i = _n_ then std_ad = sqrt(all(i));   end; drop i; keep std_ad; run; data all;   merge parms gcov_ad; run; proc print data = all noobs; run;`
`Parm         Estimate      Stderr     LowerCL     UpperCL          Z     ProbZ     std_ad`
`Intercept     -5.2004    119.5172    -239.450    229.0490      -0.04    0.9653    121.778acs_k3         6.9544      6.7726     -6.3196     20.2284       1.03    0.3045      6.901acs_46         5.9660      2.4839      1.0976     10.8344       2.40    0.0163      2.531full           4.6682      0.6904      3.3151      6.0213       6.76    <.0001      0.703enroll        -0.1060      0.0421     -0.1886     -0.0234      -2.51    0.0119      0.043`

4.1.3 Robust Regression

In SAS, we can not simply execute some proc to perform a robust regression using iteratively reweighted least squares. In order to perform a robust regression,  we have to write our own macro. To this end, ATS has written a macro called robust_hb.sas. This macro first uses Hubert weight and later switches to biweight. This is why the macro is called robust_hb where h and b stands for Hubert and biweight respectively.  Robust regression assigns a weight to each observation with higher weights given to better behaved observations. In fact, extremely deviant cases, those with Cook's D greater than 1, can have their weights set to missing so that they are not included in the analysis at all.

The macro robust_hb.sas uses another macro called mad.sas to generate MAD (median absolute deviation) during the iteration process. We will include both macros to perform the robust regression analysis as shown below. Note that in this analysis both the coefficients and the standard errors differ from the original OLS regression.

`%include 'c:sasregmad.sas'; %include 'c:sasregrobust_hb.sas'; %robust_hb("c:sasregelemapi2", api00, acs_k3 acs_46 full enroll,             .01, 0.00005, 10);`
`The REG ProcedureModel: MODEL1Dependent Variable: api00Weight: _w2_                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     4        3053248         763312      73.19    <.0001Error                   390        4067269          10429Corrected Total         394        7120516Root MSE            102.12196    R-Square     0.4288Dependent Mean      647.38090    Adj R-Sq     0.4229Coeff Var            15.77463                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       -6.79858       80.45406      -0.08      0.9327acs_k3        1        6.11031        4.15015       1.47      0.1417acs_46        1        6.25588        1.45280       4.31      <.0001full          1        4.79575        0.39212      12.23      <.0001enroll        1       -0.10924        0.02558      -4.27      <.0001`

If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems.   Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.

Let's look at the predicted (fitted) values (p), the residuals (r), and the leverage (hat) values (h).  The macro robust_hb generates a final data set with predicted values, raw residuals and leverage values together with the original data called _tempout_.

Now, let's check on the various predicted values and the weighting. First, we will sort by _w2_, the weight generated at last iteration. Then we will look at the first 15 observations. Notice that the smallest weights are near one-half but quickly get into the .6 range. The first five values are missing due to the missing values of predictors.

` proc sort data = _tempout_;   by descending _w2_; run; proc print data = _tempout_ (obs=10);   var snum api00 p r h _w2_; run;`
`Obs    snum    api00       p           r           h         _w2_  1     116     513        .           .        .           .  2    3072     763        .           .        .           .  3    3055     590        .           .        .           .  4    4488     521        .           .        .           .  5    4534     445        .           .        .           .  6     637     447     733.426    -286.144    0.003657    0.55684  7    5387     892     611.709     280.464    0.002278    0.57183  8    2267     897     622.062     275.512    0.009887    0.58481  9      65     903     631.930     271.731    0.010198    0.59466 10    3759     585     842.664    -257.473    0.040009    0.63128`

Now, let's look at the last 10 observations. The weights for observations with snum 1678, 4486 and 1885 are all very close to one, since the residuals are fairly small. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).

`proc sort data = _tempout_;   by _w2_; run; proc print data = _tempout_ (obs=10);   var snum api00 p r h _w2_; run;`
`Obs    snum    api00       p           r           h         _w2_  1    1678     497     497.065     0.18424    0.023137    1.00000  2    4486     706     706.221     0.20151    0.013740    1.00000  3    1885     605     606.066    -0.41421    0.013745    1.00000  4    3535     705     703.929     1.16143    0.004634    0.99999  5    3024     727     729.278    -2.01559    0.010113    0.99997  6    3700     717     719.803    -2.43802    0.007317    0.99996  7    1596     536     533.918     2.77985    0.012143    0.99995  8     181     724     728.068    -3.53209    0.013665    0.99992  9    1949     696     691.898     3.96736    0.020426    0.99990 10    5917     735     731.141     4.03336    0.005831    0.99990`

After using macro robust_hb.sas, we can use the dataset _tempout_ to create some graphs for regression diagnostic purposes. For example,  we can create a graph of residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS plot, except that in the OLS all of the observations would be weighted equally, but as we saw above the observations with the greatest residuals are weighted less and hence have less influence on the results.

`axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; symbol v=star h=0.8 c=blue; proc gplot data = _tempout_;   plot r*p = 1 /haxis=axis2 vaxis=axis1 vref=0;   label r ="Residual";   label p ="Fitted Value"; run; quit; `

4.1.4 Quantile Regression

Quantile regression, in general, and median regression, in particular, might be considered as an alternative to robust regression. SAS does quantile regression using a little bit of proc iml.  Inside proc iml, a procedure called LAV is called and it does a median regression in which the coefficients will be estimated by minimizing the absolute deviations from the median. Of course, as an estimate of central tendency, the median is a resistant measure that is not as greatly affected by outliers as is the mean. It is not clear that median regression is a resistant estimation procedure, in fact, there is some evidence that it can be affected by high leverage values.

Here is what the quantile regression looks like using SAS proc iml. The first data step is to make sure that the data set that proc iml takes as input does not have any missing values. Inside proc iml we first generate necessary matrices for regression computation and then call the procedure LAV. After calling LAV we can calculate the predicted values and residuals. At last, we create a data set called _temp_ containing the dependent variables and all the predictors plus the predicted values and residuals.

`data elemapi2;    set "c:sasregelemapi2";    cons = 1;    if api00 ~=. & acs_k3 ~= . & acs_46 ~=. & full ~=. & enroll ~=.; run;  proc iml ; /*Least absolute values*/   use elemapi2;   read all;   a = cons || acs_k3 || acs_46 || full || enroll;   b=api00;   opt= { . 3  0 1 };    call lav(rc,xr,a,b,,opt); /*print out the estimates*/      opt= { . 0  . 1 };    call lav(rc,xr,a,b,,opt); /*no print out but to create the xr*/     pred = a*t(xr);   resid = b - pred;   create _temp_ var { api00 cons acs_k3 acs_46 full enroll pred resid};   append; quit;`
`                                 LAV (L1) Estimation                               Start with LS Solution                      Start Iter: gamma=284.75134813 ActEqn=395 Iter  N Huber  Act Eqn     Rank    Gamma         L1(x)      F(Gamma)   31       36        5        5   0.0000    36268.1049    36240.6335Algorithm convergedObjective Function L1(x)= 36268.104941Number Search Directions=       68Number Refactorizations =        2Total Execution Time=       0.0200Necessary and sufficient optimality condition satisfied.                                    L1 Solution with ASEEst  17.1505029  1.2690656888   7.2240793844   5.3238408715 -0.124573396ASE  123.531545  6.3559394265   2.2262729207   0.602359504   0.0391932684`

The coefficient and standard error for acs_k3 are considerably different as compared to OLS (the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients and standard errors for the other variables are also different, but not as dramatically different. Nevertheless, the  quantile regression results indicate that, like the OLS results, all of the variables except acs_k3 are significant. Using the data set _temp_ we created above we obtain a plot of residuals vs. predicted values shown below.

`symbol v=star h=0.8 c=blue; axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; proc gplot data = _temp_;   plot resid*pred = 1 / vaxis=axis1 haxis=axis2 vref=0;   label resid="Residual";   label pred = "Fitted Value"; run; quit;`
` `

4.2 Constrained Linear Regression

Let's begin this section by looking at a regression model using the hsb2 dataset.  The hsb2 file is a sample of 200 cases from the Highschool and Beyond Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the following variables: id female race ses schtyp program read write math science socst. The variables read write math science socst are the results of standardized tests on reading, writing, math, science and social studies (respectively), and the variable female is coded 1 if female, 0 if male.

Let's start by doing an OLS regression where we predict socst score from read, write, math, science and female (gender)

`proc reg data="c:sasreghsb2";   model  socst = read write math science female ; run;`
`The REG ProcedureModel: MODEL1Dependent Variable: socst                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     5          10949     2189.85150      35.44    <.0001Error                   194          11987       61.78834Corrected Total         199          22936Root MSE              7.86056    R-Square     0.4774Dependent Mean       52.40500    Adj R-Sq     0.4639Coeff Var            14.99963                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1        7.33934        3.65024       2.01      0.0457read          1        0.37840        0.08063       4.69      <.0001write         1        0.38587        0.08893       4.34      <.0001math          1        0.13033        0.08938       1.46      0.1464science       1       -0.03339        0.08187      -0.41      0.6838female        1       -0.35326        1.24537      -0.28      0.7770`

Notice that the coefficients for read and write are very similar, which makes sense since they are both measures of language ability.  Also, the coefficients for math and science are similar (in that they are both not significantly different from 0). Suppose that we have a theory that suggests that read and write and math should have equal coefficients. We can test the equality of the coefficients using the test command.

`   test read = write;  run;`
`        Test 1 Results for Dependent Variable socst                                MeanSource             DF         Square    F Value    Pr > FNumerator           1        0.19057       0.00    0.9558Denominator       194       61.78834`

The test result indicates that there is no significant difference in the coefficients for the reading and writing scores. Since it appears that the coefficients for math and science are also equal, let's test the equality of those as well.

`   test math = science;   run;`
`       Test 2 Results for Dependent Variable socst                                MeanSource             DF         Square    F Value    Pr > FNumerator           1       89.63950       1.45    0.2299Denominator       194       61.78834`

Let's now perform both of these tests together, simultaneously testing that the coefficient for read equals write and math equals science using mtest statement.

`  mtest math - science, read - write;   run;`
`Multivariate Test 1                Multivariate Statistics and Exact F Statistics                              S=1    M=0    N=96Statistic                        Value    F Value    Num DF    Den DF    Pr > FWilks' Lambda               0.99257173       0.73         2       194    0.4852Pillai's Trace              0.00742827       0.73         2       194    0.4852Hotelling-Lawley Trace      0.00748386       0.73         2       194    0.4852Roy's Greatest Root         0.00748386       0.73         2       194    0.4852`

Note this second test has 2 df, since it is testing both of the hypotheses listed, and this test is not significant, suggesting these pairs of coefficients are not significantly different from each other.  We can estimate regression models where we constrain coefficients to be equal to each other.  For example, let's begin on a limited scale and constrain read to equal write. Proc reg uses restrict statement to  accomplish this.

`proc reg data="c:sasreghsb2";   model  socst = read write math science female ;   restrict read=write; run;`
`The REG ProcedureModel: MODEL1Dependent Variable: socstNOTE: Restrictions have been applied to parameter estimates.                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     4          10949     2737.26674      44.53    <.0001Error                   195          11987       61.47245Corrected Total         199          22936Root MSE              7.84044    R-Square     0.4774Dependent Mean       52.40500    Adj R-Sq     0.4667Coeff Var            14.96124                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1        7.35415        3.63118       2.03     0.0442read          1        0.38185        0.05139       7.43     <.0001write         1        0.38185        0.05139       7.43     <.0001math          1        0.13030        0.08915       1.46     0.1454science       1       -0.03328        0.08164      -0.41     0.6840female        1       -0.32962        1.16736      -0.28     0.7780RESTRICT     -1      -25.51302      458.21611      -0.06     0.9558** Probability computed using beta distribution.`

Notice that the coefficients for read and write are identical, along with their standard errors, t-test, etc. Also note that the degrees of freedom for the F test is four, not five, as in the OLS model. This is because only one coefficient is estimated for read and write, estimated like a single variable equal to the sum of their values.  Notice also that the Root MSE is slightly higher for the constrained model, but only slightly higher.  This is because we have forced the model to estimate the coefficients for read and write that are not as good at minimizing the Sum of Squares Error (the coefficients that would minimize the SSE would be the coefficients from the unconstrained model).

Next, we will define a second constraint, setting math equal to science together with the first constraint we set before.

`proc reg data="c:sasreghsb2";   model  socst = read write math science female ;   restrict read = write, math = science; run;`
`The REG ProcedureModel: MODEL1Dependent Variable: socstNOTE: Restrictions have been applied to parameter estimates.                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     3          10860     3619.84965      58.75    <.0001Error                   196          12077       61.61554Corrected Total         199          22936Root MSE              7.84956    R-Square     0.4735Dependent Mean       52.40500    Adj R-Sq     0.4654Coeff Var            14.97864                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1        7.50566        3.63323       2.07     0.0402read          1        0.38604        0.05133       7.52     <.0001write         1        0.38604        0.05133       7.52     <.0001math          1        0.04281        0.05192       0.82     0.4107science       1        0.04281        0.05192       0.82     0.4107female        1       -0.20087        1.16383      -0.17     0.8631RESTRICT     -1      -15.36285      458.82638      -0.03     0.9734*RESTRICT     -1      547.24384      454.01563       1.21     0.2290** Probability computed using beta distribution.`

Now the coefficients for readwrite and math = science and the degrees of freedom for the model has dropped to three.  Again, the Root MSE is slightly larger than in the prior model, but we should emphasize only very slightly larger.  If indeed the population coefficients for readwrite and math = science, then these combined (constrained) estimates may be more stable and generalize better to other samples.  So although these estimates may lead to slightly higher standard error of prediction in this sample, they may generalize better to the population from which they came.

4.3 Regression with Censored or Truncated Data

Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.

We will begin by looking at analyzing data with censored values.

4.3.1 Regression with Censored Data

In this example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 16 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. The variable acadindx is said to be censored, in particular, it is right censored.

Let's look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

`proc means data = "c:sasregacadindx"; run;`
`The MEANS ProcedureVariable      N            Mean         Std Dev         Minimum         Maximum-------------------------------------------------------------------------------id          200     100.5000000      57.8791845       1.0000000     200.0000000female      200       0.5450000       0.4992205               0       1.0000000reading     200      52.2300000      10.2529368      28.0000000      76.0000000writing     200      52.7750000       9.4785860      31.0000000      67.0000000acadindx    200     172.1850000      16.8173987     138.0000000     200.0000000-------------------------------------------------------------------------------`
`proc means data = "c:sasregacadindx" N;   where acadindx = 200; run;`
`The MEANS ProcedureVariable      N---------------id           16female       16reading      16writing      16acadindx     16---------------`
`proc corr data = "c:sasregacadindx" nosimple noprob;   var acadindx female reading writing; run;`
`The CORR Procedure   4  Variables:    acadindx female   reading  writing           Pearson Correlation Coefficients, N = 200              acadindx        female       reading       writingacadindx       1.00000      -0.08210       0.71309       0.66256female        -0.08210       1.00000      -0.05308       0.25649reading        0.71309      -0.05308       1.00000       0.59678writing        0.66256       0.25649       0.59678       1.00000`

Now, let's run a standard OLS regression on the data and generate predicted scores in p1.

`proc reg data = "c:sasregacadindx";    model acadindx =female reading writing;    output out = reg1 p = p1; run; quit;`
`The REG ProcedureModel: MODEL1Dependent Variable: acadindx                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     3          34994          11665     107.40    <.0001Error                   196          21288      108.61160Corrected Total         199          56282Root MSE             10.42169    R-Square     0.6218Dependent Mean      172.18500    Adj R-Sq     0.6160Coeff Var             6.05261                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       96.11841        4.48956      21.41      <.0001female        1       -5.83250        1.58821      -3.67      0.0003reading       1        0.71842        0.09315       7.71      <.0001writing       1        0.79057        0.10410       7.59      <.0001`

The proc lifereg is one of the procedures in SAS that can be used for regression with censored data. The syntax of the command is similar to proc reg with the addition of the variable indicating if an observation is censored. Therefore, we have to create a data set with the information on censoring.

`data tobit_model;   set "c:sasregacadindx";   censor = ( acadindx >= 200 ); run;   proc lifereg data = tobit_model;    model acadindx*censor(1) = female reading writing /d=normal;    output out = reg2 p = p2; run;`
`The LIFEREG Procedure             Model InformationData Set                    WORK.TOBIT_MODELDependent Variable                  acadindxCensoring Variable                    censorCensoring Value(s)                         1Number of Observations                   200Noncensored Values                       184Right Censored Values                     16Left Censored Values                       0Interval Censored Values                   0Name of Distribution                  NormalLog Likelihood                  -718.0636168Algorithm converged.       Type III Analysis of Effects                         WaldEffect       DF    Chi-Square    Pr > ChiSqfemale        1       14.0654        0.0002reading       1       60.8529        <.0001writing       1       54.1655        <.0001                  Analysis of Parameter Estimates                      Standard   95% Confidence     Chi-Parameter DF Estimate    Error       Limits       Square Pr > ChiSqIntercept  1  92.7378   4.8034  83.3233 102.1524  372.74     <.0001female     1  -6.3473   1.6924  -9.6644  -3.0302   14.07     0.0002reading    1   0.7777   0.0997   0.5823   0.9731   60.85     <.0001writing    1   0.8111   0.1102   0.5951   1.0271   54.17     <.0001Scale      1  10.9897   0.5817   9.9067  12.1912`

Let's merge the two data sets we created together to compare the predicted score p1 and p2.

`data compare;   merge reg1 reg2;   by id; run; proc means data = compare;   var acadindx p1 p2; run;`
`The MEANS ProcedureVariable      N            Mean         Std Dev         Minimum         Maximum-------------------------------------------------------------------------------acadindx    200     172.1850000      16.8173987     138.0000000     200.0000000p1          200     172.1850000      13.2608696     142.3820725     201.5311070p2          200     172.7040275      14.0029221     141.2210940     203.8540576-------------------------------------------------------------------------------`

It shows that the censored regression model predicted values have a larger standard deviation and a greater range of values. When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the censored regression model predicted value is greater than the OLS predicted value. These predictions represent an estimate of what the variability would be if the values of acadindx could exceed 200.

`proc print data = compare;   var acadindx p1 p2;   where acadindx = 200; run;`
`Obs    acadindx       p1         p2 32       200      179.175    179.620 57       200      192.681    194.329 68       200      201.531    203.854 80       200      191.831    193.577 82       200      188.154    189.563 88       200      186.573    187.940 95       200      195.997    198.176100       200      186.933    188.108132       200      197.578    199.798136       200      189.459    191.144143       200      191.185    192.833157       200      191.614    193.477161       200      180.251    181.008169       200      182.275    183.367174       200      191.614    193.477200       200      187.662    189.421`

Proc lifereg  handles right censored, left censored and interval censored data.  There are two other commands in SAS that perform censored regression analysis such as proc qlim

4.3.2 Regression with Truncated Data

Truncated data occurs when some observations are not included in the analysis because of the value of the variable. We will illustrate analysis with truncation using the dataset, acadindx, that was used in the previous section.

Let's imagine that in order to get into a special honors program, students need to score at least 160 on acadindx. So we will drop all observations in which the value of acadindx is less than or equal 160. Now, let's estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored.

`proc reg data = "c:sasregacadindx";   model acadindx = female reading writing;   where acadindx >160; run; quit;`
`The REG ProcedureModel: MODEL1Dependent Variable: acadindx                             Analysis of Variance                                    Sum of           MeanSource                   DF        Squares         Square    F Value    Pr > FModel                     3     8074.79638     2691.59879      33.01    <.0001Error                   140          11416       81.54545Corrected Total         143          19491Root MSE              9.03025    R-Square     0.4143Dependent Mean      180.42361    Adj R-Sq     0.4017Coeff Var             5.00503                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1      125.63547        5.89156      21.32      <.0001female        1       -5.23849        1.61563      -3.24      0.0015reading       1        0.44111        0.09635       4.58      <.0001writing       1        0.58733        0.11508       5.10      <.0001`

It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. For example, the coefficient for writing dropped from .79 to .58. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the populations, then the truncation of acadindx in our sample is going to lead to biased estimates. A better approach to analyzing these data is to use truncated regression. In SAS this can be accomplished using proc qlim. Proc qlim is an experimental procedure first available in SAS version 8.1. Proc qlim (Qualitative and LImited dependent variable model) analyzes univariate (and multivariate) limited dependent variable models where dependent variables takes discrete values or dependent variables are observed only in a limited range of values.

`data trunc_model;   set "c:sasregacadindx";   y = .;   if acadindx > 160 & acadindx ~=. then y = acadindx; run; proc qlim data=trunc_model ;   model y  = female reading writing;   endogenous y~ truncated (lb=160); run;`
`The QLIM Procedure                            Summary Statistics of Continuous Responses                                                                                    N Obs    N Obs                            Standard                           Lower       Upper    Lower    UpperVariable        Mean           Error          Type             Bound       Bound    Bound    Bound   y        180.4236       11.674837       Truncated             160             Model Fit SummaryNumber of Endogenous Variables             1Endogenous Variable                        yNumber of Observations                   144Missing Values                            56Log Likelihood                    -510.00768Maximum Absolute Gradient         3.87471E-6Number of Iterations                      14AIC                                     1030Schwarz Criterion                       1045Algorithm converged.                      Parameter Estimates                                 Standard                 ApproxParameter        Estimate           Error    t Value    Pr > |t|Intercept      110.289206        8.673847      12.72      <.0001female          -6.099602        1.925245      -3.17      0.0015reading          0.518179        0.116829       4.44      <.0001writing          0.766164        0.152620       5.02      <.0001_Sigma           9.803571        0.721646      13.59      <.0001`

The coefficients from the proc qlim are closer to the OLS results, for example the coefficient for writing is .77 which is closer to the OLS results of .79.  However, the results are still somewhat different on the other variables, for example the coefficient for reading is .52 in the proc qlim as compared to .72 in the original OLS with the unrestricted data, and better than the OLS estimate of .47 with the restricted data.  While proc qlim may improve the estimates on a restricted data file as compared to OLS, it is certainly no substitute for analyzing the complete unrestricted data file.

4.4 Regression with Measurement Error

As you will most likely recall, one of the assumptions of regression is that the predictor variables are measured without error. The problem is that measurement error in predictor variables leads to under estimation of the regression coefficients.

This section is under development.

4.5 Multiple Equation Regression Models

If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type multiple equation regression known as seemly unrelated regression.. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation modes is that we can test predictors across equations.

Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.

Multiple equation models are a powerful extension to our data analysis tool kit.

4.5.1 Seemingly Unrelated Regression

Let's continue using the hsb2 data file to illustrate the use of seemingly unrelated regression. This time let's look at two regression models.

`    science = math female     write   = read female    `

It is the case that the errors (residuals) from these two models would be correlated. This would be true even if the predictor female were not found in both models. The errors would be correlated because all of the values of the variables are collected on the same set of observations. This is a situation tailor made for seemingly unrelated regression using the proc syslin with option sur. With the proc syslin we can estimate both models simultaneously while accounting for the correlated errors at the same time, leading to efficient estimates of the coefficients and standard errors. The syntax is as follows.

`proc syslin data="c:sasreghsb2" sur ;   science:  model science = math female ;   write:    model write = read female ; run;`

The first part of the output consists of the OLS estimate for each model. Here is the OLS estimate for the first model.

`               The SYSLIN ProcedureOrdinary Least Squares EstimationModel                  SCIENCEDependent Variable     science                         Analysis of Variance                               Sum of        MeanSource                 DF     Squares      Square    F Value    Pr > FModel                   2    7993.550    3996.775      68.38    <.0001Error                 197    11513.95    58.44645Corrected Total       199    19507.50Root MSE             7.64503    R-Square       0.40977Dependent Mean      51.85000    Adj R-Sq       0.40378Coeff Var           14.74451                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     18.11813    3.167133       5.72      <.0001math              1     0.663190    0.057872      11.46      <.0001female            1     -2.16840    1.086043      -2.00      0.0472`

And here is OLS estimate for the second model.

`The SYSLIN ProcedureOrdinary Least Squares EstimationModel                    WRITEDependent Variable       write                         Analysis of Variance                               Sum of        MeanSource                 DF     Squares      Square    F Value    Pr > FModel                   2    7856.321    3928.161      77.21    <.0001Error                 197    10022.55    50.87591Corrected Total       199    17878.88Root MSE             7.13273    R-Square       0.43942Dependent Mean      52.77500    Adj R-Sq       0.43373Coeff Var           13.51537                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     20.22837    2.713756       7.45      <.0001read              1     0.565887    0.049385      11.46      <.0001female            1     5.486894    1.014261       5.41      <.0001`

Proc syslin with option sur also gives an estimate of the correlation between the errors of the two models. Here is the corresponding output.

`The SYSLIN ProcedureSeemingly Unrelated Regression Estimation       Cross Model Covariance              SCIENCE       WRITESCIENCE        58.4464        7.8908WRITE           7.8908       50.8759      Cross Model Correlation              SCIENCE       WRITESCIENCE        1.00000       0.14471WRITE          0.14471       1.00000  Cross Model Inverse Correlation              SCIENCE       WRITESCIENCE        1.02139      -0.14780WRITE         -0.14780       1.02139   Cross Model Inverse Covariance              SCIENCE       WRITESCIENCE       0.017476      -.002710WRITE         -.002710      0.020076System Weighted MSE            0.9981Degrees of freedom                394System Weighted R-Square       0.3875`
Finally, we have the seemingly unrelated regression estimation for our models. Note that both the estimates of the coefficients and their standard errors are different from the OLS model estimates shown above.
`The SYSLIN ProcedureSeemingly Unrelated Regression EstimationModel                  SCIENCEDependent Variable     science                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     20.13265    3.149485       6.39      <.0001math              1     0.625141    0.057528      10.87      <.0001female            1     -2.18934    1.086038      -2.02      0.0452Model                    WRITEDependent Variable       write                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     21.83439    2.698827       8.09      <.0001read              1     0.535484    0.049091      10.91      <.0001female            1     5.453748    1.014244       5.38      <.0001`

Now that we have estimated our models let's test the predictor variables. The test for female combines information from both models. The tests for math and read are actually equivalent to the t-tests above except that the results are displayed as F-tests.

`proc syslin data = "c:sasreghsb2" sur ;   science:  model science = math female ;   write:    model write = read female ;   female: stest science.female = write.female =0;   math: stest science.math = 0;   read: stest write.read = 0; run;`
`      Test Results for Variable FEAMLE  Num DF      Den DF    F Value    Pr > F       2         394      18.48    0.0001      Test Results for Variable MATH  Num DF      Den DF    F Value    Pr > F       1         394     118.31    0.0001      Test Results for Variable READ  Num DF      Den DF    F Value    Pr > F       1         394     119.21    0.0001`

Now, let's estimate 3 models where we use the same predictors in each model as shown below.

`     read  = female prog1 prog3      write = female prog1 prog3      math  = female prog1 prog3`

Here variable prog1 and prog3 are dummy variables for the variable prog. Let's generate these variables before estimating our three models using proc syslin

`data hsb2;   set "c:sasreghsb2";   prog1 = (prog = 1);   prog3 = (prog = 3); run;  proc syslin data = hsb2 sur;   model1: model read  = female prog1 prog3;   model2: model write = female prog1 prog3;   model3: model  math  = female prog1 prog3; run;`

The OLS regression estimate of our three models are as follows.

`<some output omitted>`
`The SYSLIN ProcedureOrdinary Least Squares EstimationModel                   MODEL1Dependent Variable        read                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     56.82950    1.170562      48.55      <.0001female            1     -1.20858    1.327672      -0.91      0.3638prog1             1     -6.42937    1.665893      -3.86      0.0002prog3             1     -9.97687    1.606428      -6.21      <.0001Model                   MODEL2Dependent Variable       write                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     53.62162    1.042019      51.46      <.0001female            1     4.771211    1.181876       4.04      <.0001prog1             1     -4.83293    1.482956      -3.26      0.0013prog3             1     -9.43807    1.430021      -6.60      <.0001Model                   MODEL3Dependent Variable        math                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     57.10551    1.036890      55.07      <.0001female            1     -0.67377    1.176059      -0.57      0.5674prog1             1     -6.72394    1.475657      -4.56      <.0001prog3             1     -10.3217    1.422983      -7.25      <.0001`

These regressions provide fine estimates of the coefficients and standard errors but these results assume the residuals of each analysis are completely independent of the other. Also, if we wish to test female, we would have to do it three times and would not be able to combine the information from all three tests into a single overall test.

Now let's see the output of the estimate using seemingly unrelated regression.

`The SYSLIN ProcedureSeemingly Unrelated Regression Estimation`
`Model                   MODEL1Dependent Variable       read`
`                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     56.82950    1.170562      48.55      <.0001female            1     -1.20858    1.327672      -0.91      0.3638prog1             1     -6.42937    1.665893      -3.86      0.0002prog3             1     -9.97687    1.606428      -6.21      <.0001Model                   MODEL2Dependent Variable       write                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     53.62162    1.042019      51.46      <.0001female            1     4.771211    1.181876       4.04      <.0001prog1             1     -4.83293    1.482956      -3.26      0.0013prog3             1     -9.43807    1.430021      -6.60      <.0001Model                   MODEL3Dependent Variable        math                        Parameter Estimates                       Parameter    StandardVariable         DF     Estimate       Error    t Value    Pr > |t|Intercept         1     57.10551    1.036890      55.07      <.0001female            1     -0.67377    1.176059      -0.57      0.5674prog1             1     -6.72394    1.475657      -4.56      <.0001prog3             1     -10.3217    1.422983      -7.25      <.0001`

Note that the coefficients are identical in the OLS results and in the seemingly unrelated regression estimate, however the standard errors are different, only slightly, due to the correlation among the residuals in the multiple equations.

In addition to getting more appropriate standard errors, we can test the effects of the predictors across the equations.  We can test the hypothesis that the coefficient for female is 0 for all three outcome variables, as shown below.

`proc syslin data = hsb2 sur;   model1: model read  = female prog1 prog3;   model2: model write = female prog1 prog3;   model3: model  math  = female prog1 prog3;   feamle: stest model1.female = model2.female = model3.female = 0; run;`
`     Test Results for Variable FEAMLE  Num DF      Den DF    F Value    Pr > F       3         588      11.63    0.0001`

We can also test the hypothesis that the coefficient for female is 0 for just read and math

`proc syslin data = hsb2 sur;   model1: model read  = female prog1 prog3;   model2: model write = female prog1 prog3;   model3: model  math  = female prog1 prog3;   f1: stest model1.female = model3.female = 0; run;`
`       Test Results for Variable F1  Num DF      Den DF    F Value    Pr > F       2         588       0.42    0.6599`

We can also test the hypothesis that the coefficients for prog1 and prog3 are 0 for all three outcome variables, as shown below.

`proc syslin data = hsb2 sur;   model1: model read  = female prog1 prog3;   model2: model write = female prog1 prog3;   model3: model  math  = female prog1 prog3;   progs: stest model1.prog1 = model2.prog1 = model3.prog1 =0,                model1.prog3 = model2.prog3 = model3.prog3 =0 ; run;`
`     Test Results for Variable PROGS  Num DF      Den DF    F Value    Pr > F       6         588      11.83    0.0001`

4.5.2 Multivariate Regression

Let's now use multivariate regression using proc reg  to look at the same analysis say that we saw in the proc syslin example above, estimating the following 3 models.

`     read  = female prog1 prog3      write = female prog1 prog3      math  = female prog1 prog3`

Below we use proc reg to predict read write and math from female prog1 and prog3. Note that the top part of the output is similar to the sureg output in that it gives an overall summary of the model for each outcome variable, however the results are somewhat different and the sureg uses a Chi-Square test for the overall fit of the model, and mvreg uses an F-test. The lower part of the output appears similar to the sureg output, however when you compare the standard errors you see that the results are not the same.  These standard errors correspond to the OLS standard errors, so these results below do not take into account the correlations among the residuals (as do the sureg results).

`proc reg data =hsb2;   model read write math = female prog1 prog3 ; run;`
`The REG Procedure [Some output omitted] Dependent Variable: read                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       56.82950        1.17056      48.55      <.0001female        1       -1.20858        1.32767      -0.91      0.3638prog1         1       -6.42937        1.66589      -3.86      0.0002prog3         1       -9.97687        1.60643      -6.21      <.0001Dependent Variable: write                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       53.62162        1.04202      51.46      <.0001female        1        4.77121        1.18188       4.04      <.0001prog1         1       -4.83293        1.48296      -3.26      0.0013prog3         1       -9.43807        1.43002      -6.60      <.0001Dependent Variable: math                        Parameter Estimates                     Parameter       StandardVariable     DF       Estimate          Error    t Value    Pr > |t|Intercept     1       57.10551        1.03689      55.07      <.0001female        1       -0.67377        1.17606      -0.57      0.5674prog1         1       -6.72394        1.47566      -4.56      <.0001prog3         1      -10.32168        1.42298      -7.25      <.0001`

Now, let's test female. Note, that female was statistically significant in only one of the three equations. Using the mtest statement after proc reg allows us to test female across all three equations simultaneously. And, guess what? It is significant. This is consistent with what we found using seemingly unrelated regression estimation.

`    female: mtest female=0;   run;`
`Multivariate Test: female                Multivariate Statistics and Exact F Statistics                             S=1    M=0.5    N=96Statistic                        Value    F Value    Num DF    Den DF    Pr > FWilks' Lambda               0.84892448      11.51         3       194    <.0001Pillai's Trace              0.15107552      11.51         3       194    <.0001Hotelling-Lawley Trace      0.17796108      11.51         3       194    <.0001Roy's Greatest Root         0.17796108      11.51         3       194    <.0001`

We can also test prog1 and prog3, both separately and combined. Remember these are multivariate tests.

`  prog1: mtest prog1 = 0; run;`
`Multivariate Test: prog1                Multivariate Statistics and Exact F Statistics                             S=1    M=0.5    N=96Statistic                        Value    F Value    Num DF    Den DF    Pr > FWilks' Lambda               0.89429287       7.64         3       194    <.0001Pillai's Trace              0.10570713       7.64         3       194    <.0001Hotelling-Lawley Trace      0.11820192       7.64         3       194    <.0001Roy's Greatest Root         0.11820192       7.64         3       194    <.0001`
`  prog3: mtest prog3 = 0; run;`
`Multivariate Test: prog3                Multivariate Statistics and Exact F Statistics                             S=1    M=0.5    N=96Statistic                        Value    F Value    Num DF    Den DF    Pr > FWilks' Lambda               0.75267026      21.25         3       194    <.0001Pillai's Trace              0.24732974      21.25         3       194    <.0001Hotelling-Lawley Trace      0.32860304      21.25         3       194    <.0001Roy's Greatest Root         0.32860304      21.25         3       194    <.0001`
`   prog: mtest prog1 = prog3 =0; run; quit;`
`Multivariate Test: prog                 Multivariate Statistics and F Approximations                              S=2    M=0    N=96Statistic                        Value    F Value    Num DF    Den DF    Pr > FWilks' Lambda               0.73294667      10.87         6       388    <.0001Pillai's Trace              0.26859190      10.08         6       390    <.0001Hotelling-Lawley Trace      0.36225660      11.68         6     256.9    <.0001Roy's Greatest Root         0.35636617      23.16         3       195    <.0001NOTE: F Statistic for Roy's Greatest Root is an upper bound.NOTE: F Statistic for Wilks' Lambda is exact.`

Proc syslin with sur option and proc reg both allow you to test multi-equation models while taking into account the fact that the equations are not independent.  The proc syslin  with sur option allows you to get estimates for each equation which adjust for the non-independence of the equations, and it allows you to estimate equations which don't necessarily have the same predictors. By contrast, proc reg is restricted to equations that have the same set of predictors, and the estimates it provides for the individual equations are the same as the OLS estimates.  However, proc reg  allows you to perform more traditional multivariate tests of predictors.

4.6 Summary

This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in SAS for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.

## GIS and Visualization

[ 打印 ] 