Curve Estimation

- Models are types of linear and nonlinear curves which may be fitted to the data. PASW/SPSS supports these models: linear, logarithmic, inverse, quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential. Using the PASW/SPSS menu choice Analyze, Legacy Dialog, Scatter/Dot, will allow the researcher to plot dependent variables, which may aid the researcher in selecting a suitable model to fit. However, before selecting a more complex model the researcher should first consider if a transformation of the data might enable a simpler one to be used, even linear regression.
  - Residual models . The PASW/SPSS Curve Estimation module only supports one dependent and one independent variable. While this is suitable for bivariate analysis, for multivariate analysis it is at best a "quick and dirty" tool for assessing if one of multiple independent variables is related to the dependent in one of the 10 supported nonlinear manners. An alternative strategy is to use OLS, ordinal, multinomial, or some other form of multivariate regression to regress a given independent variable on all the other independents, then save the residuals. The residuals then represent the variance in the given independent once all other independents are controlled. One may then us these residuals as the independent variable in the PASW/SPSS Curve Fitting module, using it to predict the dependent under any of the supported linear and nonlinear models.
    The choice between a regular (raw data) and a residual model depends on whether the the researcher is interested in uncontrolled or in controlled relationships. Put another way, the standardized b coefficients in the uncontrolled, bivariate raw data approach are whole coefficients, equal to the correlation of the independent with the dependent. The standardized b coefficients in the controlled, multivariate residual approach are partial coefficients, partialling out the effect of other independent variables. Generally, partial coefficients are preferred for most multivariate analysis purposes.
  - Time series models . In the Curve Estimation dialog, if the "Time" radio button is turned on, PASW/SPSS assumes time series data with a uniform time interval separating cases in the series. That is, each data row is assumed to represent observations at sequential times which are uniformly spaced. It is assumed, of course, that the dependent variable is also a time series variable. A "Sequence" variable is created automatically and is used as the independent (other predictor variables cannot be used if the "Time" option is selected). If the "Time" option is selected, the time variable, t, replaces the independent variable, x, in the equations given below. and one can specify a forecast period past the end of the time series.
  - Linear models. Y = b0 + (b1 * x), where b0 is the constant, b1 the regression coefficient for x, the independent variable. Note: in this and figures below, the exact shape of the curve (line) is greatly affected by the parameters. Each figure only represents one particular set of parameters, of course. In the figure below, b0 is 4.818 and b1 is .436 in the "Model Summary and Parameter Estimates" output table.
  - Logarithmic models. Y = b0 + (b1 * ln(x)) where ln() is the natural log function. In the figure below, b0 is 5.422 and b1 is 1.113, in the "Model Summary and Parameter Estimates" output table.
  - Inverse models. Y = b0 + (b1 / x). In the figure below, b0 is 7.194 and b1 is -1.384, in the "Model Summary and Parameter Estimates" output table.
  - Quadratic models. Y = b0 + (b1 * t) + (b2 * x**2) where ** is the exponentiation operator. If b2 is positive, the slope is upward; if negative, downward. In the figure below, b0 is 4.065, b1 is 1.389, and b2 is -.141, in the "Model Summary and Parameter Estimates" output table.
  - Cubic models. Y = b0 + (b1 * x) + (b2 * x**2) + (b3 * x**3). If b3 is positive, the slope is upward; if negative, downward. In the figure below, b0 is 3.409, b1 is 2.609, b2 is -.598, and b3 is .043, in the "Model Summary and Parameter Estimates" output table.
  - Power models. Y = b0 * (x**b1). If b0 is positive, the slope is upward; if negative, downward. Also, ln(Y) = ln(b0) + (b1 * ln(x)). In the figure below, b0 is 4.84 and b1 is .263, in the "Model Summary and Parameter Estimates" output table.
  - Compound models. Y = b0 * (b1**x) . If b0 is positive, the slope is upward; if negative, downward. Also, ln(Y) = ln(b0) + (ln(b1) * x). Below b0 os 4.260 and b1 is 1.05, reported in the "Model Summary and Parameter Estimates" table:
  - S-curve models. Y = e**(b0 + (b1/x)) , where e is the base of the natural logarithm. If b1 is positive, the slope is upward; if negative, downward. Also, ln(Y) = b0 + (b1/x). Below b0 is 2.009 and b1 is -.331, reported in the "Model Summary and Parameter Estimates" table:
  - Logistic models. Y = 1 / (1/u + (b0 * (b1**x))) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to use in the regression equation. The value must be a positive number that is greater than the largest dependent variable value. If b1 is negative, the slope is upward; if positive, downward. Also, ln(1/y-1/u) = ln (b0) + (ln(b1) * x). Below b0 is .113 and b1 is .822 in the "Model Summary and Parameter Estimates" table output.
  - Growth models. Y = e**(b0 + (b1 * x)). If b1 is negative, the slope is downward; if positive, upward. Also ln(Y) = b0 + (b1 * x). Below b0 is 1.449 and b1 is .100 in the "Model Summary and Parameter Estimates" table in output.
  - Exponential models. Y = b0 * (e**(b1 * x)). If b0 is negative, the slope is downward; if positive, upward. Also ln(Y) = ln(b0) + (b1 * x). Below b0 is 4.260 and b1 is .100 in the "Model Summary and Parameter Estimates" table in output.
- Statistics . PASW/SPSS statistical output for the Curve Estimation module includes these:
  - Comparative fit plots of the type above can be displayed to compare any of the supported models. For example, for the data in the foregoing models, the table and plot below compare a linear model with an inverse model:
    
    Example 2. A second example uses the PASW/SPSS sample dataset, virus.save, which tracks the spread of a computer virus in messages by time. A comparison of the linear model with the quadratic model shows an even more marked contrast:
  - Regression coefficients are the b0 (constant) and other b terms in the model equations listed above.
  - R² measures , including multiple R, R-square, adjusted R-square, and standard error of the estimate.R-square is interpreted as the percent of variance in the dependent explained by the model. The "sig" column gives an F-test of the overall significance of the model. If the significance shown in the "Model Summary and Parameter Estimates" table for a given model (ex., Inverse) is, say, .032, this means there is a 3.2% chance that if a different random sample were taken, one would get a R² as strong or stronger simply by chance of random sampling. Since this 3.2% chance of Type I error (false positives) is less than the customary 5% level, the researcher concludes that the computer R² is significant (truly different from 0).
  - Analysis of variance table . If the "Display ANOVA table" checkbox is checked (not the default) in PASW/SPSS, the "Model Summary" table will contain R-square and the standard error of estimate but not the parameter estimates nor the comparison across models. Rather, one will get separate ANOVA output for each model. That for the quadratic model in the computer virus example is illustrated below.
    The "Model Summary" table is be followed by the "ANOVA" table containing regression, residual, and total sums of squares used in computing the F test of overall model significance, and the significance level is reported. Then the "Coefficients" table contains the unstandardized B coefficient for the one independent, its standard error, its standardized (beta weight) value, and a t test of its significance and the significance level. The constant and its standard error and significance is also reported. In the case of a quadratic model, illustrated below, the one independent is entered twice on the predictor side, once as hours and again as hours-squared, for this example.
    
    As in ANOVA generally, the sums of squares refer to variance. R² equals the regression (model) sum of squares divided by total sum of squares. The mean squares (MS) are the sums of squares divided by their respective degrees of freedom (df). The F statistic is MS regression divided by MS residual. The larger the F, the more in the direction of significance.
  - Save variables . The "Save" button in PASW/SPSS Curve Estimation allows the researcher to save predicted values, residuals, and prediction intervals (upper and lower bounds) back to the dataset for further analysis - for instance, for residual analysis , perhaps using the menu choice Graphs, Legacy, Scatter/Dot to plot residuals on the Y axis against the dependent on the X axis (as one example).
- Data dimensions . In PASW/SPSS Curve Estimation, only a single independent (or a time variable) predicting a single dependent can be modeled, meaning that only two-dimensional curves may be fitted. Other curve-fitting software supports three-dimensional curve-fitting.
- Data level . All models require quantitative dependent and independent variables. If both independent and dependent are dichotomous, the fit line will be linear even when a nonlinear (ex., quadratic) fit is requested; in such a case the linear and quadratic solutions will be identical. If one variable is dichotomous and the other is continuous, regardless of causal direction., the linear and nonlinear fit lines will not necessarily be identical.
- Randomly distributed residuals characterize well-fitting models.
- Independence . Observations should be independent.
- Linear models require multivariate normality (normal distribution of the dependent for each value of the independent or combinations of independent values). Also, the dependent must have constant variance across the ranges of the independent variables. The dependent must be related to the independent variables in a linear manner.
- Validation . While selecting the model with the highest R-squared is tempting, it is not the recommended method. For instance, a cubic model will always have a higher R-squared than a quadratic model. The recommended method for selecting which model is best is cross-validation. That is, the formulas for each model based on the estimation dataset are applied to the hold-out dataset, then the R-squares are compared based on output for the hold-out dataset. Alternatively, the determination may be made graphically by overlaying sequence plots of both models for the hold-out dataset.
- Other . Other assumptions are discussed separately according to model, as in the linear regression or logistic regression sections.
- How can I significance test the difference of two R² 's between models in a single sample?
  - Unfortunately, PASW/SPSS does not support this obvious need. However, the
  Multicorr program by James H. Steiger can do this and is downloadable for free, with manual, from http://www.statpower.net/Software.html . The program runs in MS-DOS. The approach is outlined in Steiger & Browne (1984). See also the F test and the partial F test discussed in the section on regression. The Chow test may be used between samples when the variables in the model is the same. However, as noted by Achen , R² difference tests are biased even when the two models use the same variables if the variances of those variables differ between samples.
I want to use, from the Curve Estimation module, the two best functions of my independent in a regression equation, but will this introduce multicollinearity?
- Yes, if you have two related terms like x
² and x³ in the same model, they will be correlated. The degree of multicollinearity might not be debilitating to the analysis. When adding terms, as always, the researcher should have a theoretical reason for doing so, not just a data-driven reason. Terms should be added one at a time and dropped one at a time to determine changes in fit -this may lead the researcher to include only the one best term, not both. If two terms are to be added, multicollinearity may be reduced by centering the variable first (subtract the mean from all values). Multicollinearity might be further reduced by adjusting the nonlinear term by some function of the linear term (for ex., replace x³ with x³ - x(SQRT(s_x ), where s_x is the standard deviation of x. Methods of dealing with multicollinearity are discussed here .

What software other than PASW/SPSS is available for curve fitting?

There are many, including many with richer input and output options than the PASW/SPSS Curve Estimation module. SigmaPlot and TableCurve 2D and 3D , also from SigmaPlot.com. CurveExpert will fit some 30 different models. DataFit supports hundreds of two- and three-dimensional models. LabFit is another package supporting hundreds of functions for two- and three-dimensional curve fitting. And there are many more.

What is the command structure if I prefer to use the syntax window rather than menu system in PASW/SPSS?

The command takes this form

TSET MXNEWVAR=4.
CURVEFIT
/VARIABLES=accident WITH age
/CONSTANT
/MODEL=LINEAR LOGARITHMIC
/PRINT ANOVA
/PLOT FIT
/SAVE=PRED RESID .

The TSET command sets aside space for the 4 new variables created by the SAVE command (predicted fit and error, residual fit and error). The VARIABLES command asks that number of accidents be predicted from age. The CONSTANT command requires a constant to be in the equation. MODEL requests both a linear and a logarithmic model be fitted. PRINT ANOVA puts an Anova table in output. PLOT FIT causes a plot of number of accidents on the Y axis against age on the X axis with points representing observed values and lines for the linear and logarithmic fit curves. The SAVE command saves variables back to the dataset.

The full general syntax is as follows:
CURVEFIT VARIABLES= varname [WITH varname]
[/MODEL= [LINEAR**] [LOGARITHMIC] [INVERSE]
[QUADRATIC] [CUBIC] [COMPOUND]
[POWER] [S] [GROWTH] [EXPONENTIAL]
[LGSTIC] [ALL]]
[/CIN={95** }]
{value}
[/UPPERBOUND={NO**}]
{n }
[/{CONSTANT† }
{NOCONSTANT}
[/PLOT={FIT**}]
{NONE }
[/ID = varname]
[/PRINT=ANOVA]
[/SAVE=[PRED] [RESID] [CIN]]
[/APPLY [='model name'] [{SPECIFICATIONS}]]
{FIT }

**Default if the subcommand is omitted.
†Default if the subcommand is omitted and there is no corresponding specification on the TSET command.

Daniel, Cuthbert and Fred S. Wood (1999). Fitting equations to data: Computer analysis of multifactor data, 2nd edition. . NY: Wiley-Interscience. A leading text on curve estimation, going beyond the capabilities of the PASW/SPSS Curve Estimation module.
Steiger, J.H. & Browne, M.W. (1984). The comparison of interdependent correlations between optimal linear composites. Psychometrika , 49, 11-21.

- The Curve Estimation routine in PASW/SPSS is a curve-fitting program to compare linear, logarithmic, inverse, quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential models based on their relative goodness of fit for models where a single dependent variable is predicted by a single independent variable or by a time variable. Thus this includes models involving nonlinear transforms of the independents in
OLS regression and general linear models (ex., quadratic models), nonlinear link functions in generalized linear models (ex., logistic models), as well as intrinsically nonlinear models found in nonlinear regression (ex., growth models). In this way Curve Estimation may aid the researcher as an exploratory tool in model selection, prior to exploring the model with an appropriate statistical procedure which supports multivariate analysis and which has more input and output options.
In PASW/SPSS, select Analyze, Regression, Curve Estimation.