Hypothesis Testing(review)
 Steps in Hypothesis Testing
 State null hypothesis(H_{0}:hypothesis to be tested)
 Select a test statistic
 Specify the level of significance
 State the decision rule for the hypothesis
 Collect the sample and calculate statistics
 Make a decision whether to reject H_{0} and conclude alternative hypothesis (Ha)
 Make a decision based on the test results
 Twotailed test H0: μ = 0 versus Ha: μ ≠ 0
 Onetailed test H0: μ ≤(≥) 0 versus Ha: μ >(<) 0
 Significant Level (α) probability of making a Type I error (rejecting H_{0} when it is true)
 For α = 5%, a twotailed test includes 2.5% ddddddin each tail.
Correlation
 Scatter Plot: a collection of points on a graph where each point represents the values of two variables
 Correlation inaccuracy arises when:
 Outliers: few extreme values for sample observations
 Spurious correlation: appearance of a linear relationship when, in fact, there is none
 t = (r * √df) / √(1  r^{2})
 df = n  k  1
 if  t_{c} ≤ t ≤ t_{c}, null hypothesis cannot be rejected
Linear Regression
A dependent variable explained in terms of independent variable
 Assumptions held:
 A linear relationship exists between the dependent and independent variable.
 The independent variable is uncorrelated with the residuals.
 The expected value of the residual term is zero.
 The variance of the residual term is constant for all observations.
 The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation.
 The residual term is normally distributed.
 The regression line is chosen so that the sum of the squared differences (vertical distances) YYi is minimized.
 Y(dependent variable) = b0(intercept) + b1(slope coefficient)*X(independent variable) + ε(error term)
 Standard Error of Estimate (SEE) measures the degree of variability of the actual Yvalues relative to the estimated Yvalues from a regression equation. The SEE gauges the "fit" of the regression line. The smaller the standard error, the better the fit. The SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.
 The coefficient of determination (R2) is defined as the percentage of the total variation in the dependent variable explained by the independent variable.
 confidence interval for the regression coefficient, b1, is calculated as
 Analysis of variance (ANOVA) is a statistical procedure for analyzing the total variability of a data set. Output of an ANOVA table consists of:
 Total sum of squares (SST) measures the total variation in the dependent variable.
 Regression sum of squares (RSS) measures the variation in the dependent variable explained by the independent variable.
 Sum of squared errors(SSE) measures the unexplained variation in the dependent variable.
 Thus, total variation = explained variation + unexplained variation, or SST = RSS + SSE
Source of Variation 
Degrees of Freedom 
Sum of Squares 
Mean Square 
Regression (explained) 
k = 1 
RSS 
MSR = RSS/k 
Error (unexplained) 
n − 2 
SSE 
MSE = SSE/(n − 2) 
Total 
n − 1 
SST 
F = MSR/MSE 
 Limitations of regression analysis include the following:
 Parameter instability estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period
 usefulness in investment analysis will be limited if other market participants are also aware of and act on this evidence
 If the assumptions of regression analysis are not valid, the interpretation and tests of hypotheses are not valid.
Multiple Regression
 The multiple regression equation specifies a dependent variable as a linear function of two or more independent variables:
 Yi = b_{0} + b_{1}X_{1i} + b_{2}X_{2i} + … + b_{k}X_{ki} + ε_{i}
 tstatistic with n – (k + 1) degrees of freedom
 t_{test} = estimated value / standard error
 pvalue smallest level of significance for which the null hypothesis can be rejected. An alternative method of doing hypothesis testing of the coefficients is to compare the pvalue to the significance level:
 pvalue < t_{c}, H_{0} rejected.
 pvalue > t_{c}, H_{0} not rejected.
 Assumptions of the multiple regression model:
 A linear relationship exists between the dependent and independent variables.
 The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
 The expected value of the error term, conditional on the independent is zero [i.e., E(ε) = 0].
 The variance of the error terms is constant for all observations.
 The error term for one observation is not correlated with that of another observation [i.e., E(εiεj) = 0, j ≠ i the errors are not serially correlated].
 The error term is normally distributed.
 Adjusted R^{2} almost always decreases in respect to R^{2}
 Dummy variables independent variables that are binary in nature (either on or off). Data is assigned a value of "0" or "1." In many cases, dummy variable concept applied in order to quantify the impact of a qualitative variable.
Issues in Multiple Regression
 Heteroskedasticity is the situation in which the variance of the residuals is not constant across all observations.
 Unconditional heteroskedasticity occurs when the heteroskedasticity is not correlated with the independent variables. While this is a violation of the equal variance assumption, it causes no major problems in most cases.
 Conditional heteroskedasticity is heteroskedasticity that is correlated with the values of the independent variables. Conditional heteroskedasticity does create significant problems for statistical inference.
 Effects:
 The standard errors are usually smaller than they would be in the absence of heteroskedasticity.
 The coefficient estimates aren’t affected.
 Because the standard errors are too small, but the coefficient estimates themselves are not affected, the ttests (coefficient estimates/standard error) are too large and the null hypothesis of no statistical significance is rejected too often. Recall from Level 1 that incorrectly rejecting a true null hypothesis is a Type I error.
 The Ftest is also unreliable.
 Autocorrelation(Serial correlation) situation in which the residual terms are correlated with one another
 Positive serial correlation exists when a positive regression error in one time period increases the probability of observing a positive regression error for the next time period. Positive serial correlation typically has the same effect as heteroskedasticity. The coefficient standard errors are too small, even though the estimated coefficients are accurate. These small standard error terms will cause the computed tstatistics to be larger than they should be, which will cause too many Type I errors: the rejection of the null hypothesis when it is actually true. The Ftest will also be too high.
 Negative serial correlation occurs when a positive error in one period increases the probability of observing a negative error in the next period. Negative serial correlation typically causes the standard errors to be too large, which leads to tstatistics that are too small. This will cause us to fail to reject the null hypothesis when it is actually false, resulting in too many Type II errors.
 Multicollinearity refers to the condition under which a high correlation exists among two or more of the independent variables in a multiple regression. This condition distorts the standard error of estimate, which distorts the coefficient standard errors, leading to problems when conducting ttests for statistical significance of parameters.
 detected by a significant Fstatistic and a high R2 but insignificant tstatistics on individual coefficients. It is corrected by removing one or more of the correlated independent variable(s), but it is sometimes difficult to identify the source of the collinearity.
 Model misspecification ways in which the regression model can be specified incorrectly.
 The functional form can be misspecified.
 Important variables are omitted.
 Variables should be transformed.
 Data is improperly pooled.
 Explanatory variables are correlated with the error term in time series models.
 A lagged dependent variable is used as an independent variable.
 A function of the dependent variable is used as an independent variable ("forecasting the past").
 Independent variables are measured with error.
 Other timeseries misspecifications that result in nonstationarity.
 regression coefficients are often biased and/or inconsistent, which means we can’t have any confidence in our hypothesis tests of the coefficients or in the predictions of the model.
 Probit and logit models: A probit model is based on the normal distribution, while a logit model is based on the logistic distribution. Application of these models results in estimates of the probability that the event occurs (e.g., probability of default). The maximum likelihood methodology is used to estimate coefficients for probit and logit models. These coefficients relate the independent variables to the likelihood of an event occurring, such as a merger, bankruptcy, or default.
 Discriminant models: Discriminant models are similar to probit and logit models but make different assumptions regarding the independent variables. Discriminant analysis results in a linear function similar to an ordinary regression which generates an overall score, or ranking, for an observation. The scores can then be used to rank or classify observations. A popular application of a discriminant model makes use of financial ratios as the independent variables to predict the qualitative dependent variable bankruptcy. A linear relationship among the independent variables produces a value for the dependent variable that places a company in a bankrupt or not bankrupt class.
 The economic meaning of the results of a regression estimation focuses primarily on the slope coefficients. The slope coefficients indicate the change in the dependent variable for a one unit change in the independent variable. The individual slope coefficients can then be interpreted as an elasticity measure.
 Critique Possibility of having a relationship that has statistical significance without having any economic significance. For instance, a study of dividend announcements may identify a statistically significant abnormal return associated with the announcement, but these returns may not be sufficient to cover transactions costs.
Time Series Analysis
 A linear trend model may be appropriate if the data points seem to be equally distributed above and below the line and the mean is constant. Inflation rate data can often be modeled with a linear trend model
 The loglinear model is best for a data series that exhibits a trend or for which the residuals are correlated or predictable or the mean is nonconstant. Financial data (e.g., stock indices and stock prices) and company sales data are often best modeled with a loglinear model. In addition, any data that has seasonality is a candidate for a loglinear model.
 Conditions for Covariance Stationarity:
 Constant and finite expected value: The expected value of the time series is constant over time.
 Constant and finite variance: The time series’ volatility around its mean (i.e., the distribution of the individual observations around the mean) does not change over time.
 Constant and finite covariance with leading or lagged values: The covariance of the time series with leading or lagged values of itself is constant.
 Degree of the nonstationarity depends on the length of the series and the underlying economic and market environment and conditions
Autoregressive model AR(p)  dependent variable is regressed against lagged values of itself
 x_{t} = b_{0} + b_{1}x_{t1} + b_{2}x_{t2} + ...+ b_{p} x_{tp} + εt
 orders of p: number of lagged periods
 Procedure for testing serial correlation for AR timeseries model:
 Estimate the AR model being evaluated using linear regression.
 Calculate the autocorrelations of the model’s residuals.
 Test whether the autocorrelations are significant.
 Mean reversion: tendency to move toward its mean
 time series has a tendency to decline when > mean
 tendency to increase when < mean
 predicts to be its current value when at meanreverting level
 Insample forecasts are made within the range of data used in the estimation.
 Outofsample forecasts are made outside of the time period for the data used in the estimation.
 Root mean squared error criterion (RMSE)  used to compare the accuracy of autoregressive models in forecasting outofsample values. Lower RMSE will have lower forecast error and will be expected to have better predictive power in the future.
 Nonstationarity(instability)
 increased statistical reliability of longer time periods
 increased stability for shorter periods
 time series sample period should consider changes in economic environment
 Random walk is when the predicted value of the series in one period is equal to the value of the series in another period, plus some random error.
 Simple random walk process x_{t} = x_{t1} + ε_{t}, where the best forecast of xt is xt1, subject to the conditions that:
 The expected value of the error terms is zero.
 The variance of the error term is homoskedastic (i.e., constant).
 There is no serial correlation in the error terms.
 Random walk with a drift xt = b0 + b1x_{t1} + εt
 not covariance stationary since meanreverting level is b0 / (1 – b1) = b0 / 0 (undefined)
 Firstdifferencing process: to transform a time series with a unit root to achieve covariance stationarity
 subtract the time series value in the immediately preceding period from the current value
 The firstdifferenced observation in period t:
 y_{t} = x_{t} − x_{t1}
 firstdifference of the random walk model, y_{t} = b_{0} + b_{1}y_{t1} ε_{t},
 transformed time series has a finite meanreverting level of b_{0} / (1 − b_{1}) = 0, so covariance stationary.
 Covariance stationarity test method
 Plot the data.
 Run an AR model and examine autocorrelations.
 Perform the Dickey Fuller test.
 Seasonality: pattern within the year that tends to repeat from year to year, examination of the residual autocorrelations needed
 Autoregressive conditional heteroskedasticity (ARCH) exists if the variance of the residuals in one time period within a time series is dependent on the variance of the residuals in another period.
 SE of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid.
 ARCH model  to test for autoregressive conditional heteroskedasticity.
 ARCH(1) time series  variance of the residuals in one period is dependent on the variance of the residuals in the preceding period
 To test whether the two time series have unit roots, the analyst first runs separate DF tests with five possible results:
Both time series are covariance stationary (i.e., neither has a unit root). Linear regression can be used.

 Only the dependent variable time series is covariance stationary (i.e., independent variable time series has a unit root). Linear regression cannot be used.
 Only the independent variable time series is covariance stationary (i.e., dependent variable time series has a unit root). Linear regression cannot be used.
 Neither time series is covariance stationary (i.e., both have unit roots) and the two series are not cointegrated. Linear regression cannot be used.
 Neither time series is covariance stationary (i.e., both have unit roots) and the two series are cointegrated. Linear regression can be used.
 Cointegration means that two time series are economically linked or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the ttests are reliable.
 Choice of a particular timeseries model
 Determine your goal – cointegrated time series, crosssectional multiple regression, or a trend model?
 Plot the values of the variable over time and look for characteristics that would indicate nonstationarity.
 If there is no seasonality or structural shift, use a trend model. If the data plot on a straight line with an upward or downward slope, use a linear trend model. If the data plot in a curve, use a loglinear trend model.
 Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. If no serial correlation, use the model. If serial correlation, use another model.
 If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows:
 If the data has a linear trend, firstdifference the data.
 If the data has an exponential trend, firstdifference the natural log of the data.
 If there is a structural shift in the data, run two separate models as discussed above.
 If the data has a seasonal component, incorporate the seasonality in the AR model as discussed below.
 Run an AR(1) model and test for serial correlation and seasonality. If no serial correlation, use the model. If serial correlation, incorporate lagged values of the variable.
 Test for ARCH. If the coefficient is not significantly different from zero, stop. You can use the model. If the coefficient is significantly different from zero, ARCH is present. Correct using generalized least squares.
 If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their outofsample RMSE.
