Hypothesis Testing(review)
- Steps in Hypothesis Testing
- State null hypothesis(H0:hypothesis to be tested)
- Select a test statistic
- Specify the level of significance
- State the decision rule for the hypothesis
- Collect the sample and calculate statistics
- Make a decision whether to reject H0 and conclude alternative hypothesis (Ha)
- Make a decision based on the test results
- Two-tailed test H0: μ = 0 versus Ha: μ ≠ 0
- One-tailed test H0: μ ≤(≥) 0 versus Ha: μ >(<) 0
- Significant Level (α) probability of making a Type I error (rejecting H0 when it is true)
- For α = 5%, a two-tailed test includes 2.5% ddddddin each tail.
Correlation


- Scatter Plot: a collection of points on a graph where each point represents the values of two variables
- Correlation inaccuracy arises when:
- Outliers: few extreme values for sample observations
- Spurious correlation: appearance of a linear relationship when, in fact, there is none
- t = (r * √df) / √(1 - r2)
- df = n - k - 1
- if - tc ≤ t ≤ tc, null hypothesis cannot be rejected
Linear Regression
A dependent variable explained in terms of independent variable
- Assumptions held:
- A linear relationship exists between the dependent and independent variable.
- The independent variable is uncorrelated with the residuals.
- The expected value of the residual term is zero.
- The variance of the residual term is constant for all observations.
- The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation.
- The residual term is normally distributed.
- The regression line is chosen so that the sum of the squared differences (vertical distances) Y-Yi is minimized.
- Y(dependent variable) = b0(intercept) + b1(slope coefficient)*X(independent variable) + ε(error term)
- Standard Error of Estimate (SEE) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SEE gauges the "fit" of the regression line. The smaller the standard error, the better the fit. The SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.

- The coefficient of determination (R2) is defined as the percentage of the total variation in the dependent variable explained by the independent variable.

- confidence interval for the regression coefficient, b1, is calculated as

- Analysis of variance (ANOVA) is a statistical procedure for analyzing the total variability of a data set. Output of an ANOVA table consists of:
- Total sum of squares (SST) measures the total variation in the dependent variable.
- Regression sum of squares (RSS) measures the variation in the dependent variable explained by the independent variable.
- Sum of squared errors(SSE) measures the unexplained variation in the dependent variable.
- Thus, total variation = explained variation + unexplained variation, or SST = RSS + SSE
Source of Variation |
Degrees of Freedom |
Sum of Squares |
Mean Square |
Regression (explained) |
k = 1 |
RSS |
MSR = RSS/k |
Error (unexplained) |
n − 2 |
SSE |
MSE = SSE/(n − 2) |
Total |
n − 1 |
SST |
F = MSR/MSE |

- Limitations of regression analysis include the following:
- Parameter instability estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period
- usefulness in investment analysis will be limited if other market participants are also aware of and act on this evidence
- If the assumptions of regression analysis are not valid, the interpretation and tests of hypotheses are not valid.
Multiple Regression
- The multiple regression equation specifies a dependent variable as a linear function of two or more independent variables:
- Yi = b0 + b1X1i + b2X2i + … + bkXki + εi
- t-statistic with n – (k + 1) degrees of freedom

- ttest = estimated value / standard error
- p-value smallest level of significance for which the null hypothesis can be rejected. An alternative method of doing hypothesis testing of the coefficients is to compare the p-value to the significance level:
- p-value < tc, H0 rejected.
- p-value > tc, H0 not rejected.
- Assumptions of the multiple regression model:
- A linear relationship exists between the dependent and independent variables.
- The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
- The expected value of the error term, conditional on the independent is zero [i.e., E(ε) = 0].
- The variance of the error terms is constant for all observations.
- The error term for one observation is not correlated with that of another observation [i.e., E(εiεj) = 0, j ≠ i the errors are not serially correlated].
- The error term is normally distributed.
- Adjusted R2 almost always decreases in respect to R2

- Dummy variables independent variables that are binary in nature (either on or off). Data is assigned a value of "0" or "1." In many cases, dummy variable concept applied in order to quantify the impact of a qualitative variable.
Issues in Multiple Regression
- Heteroskedasticity is the situation in which the variance of the residuals is not constant across all observations.
- Unconditional heteroskedasticity occurs when the heteroskedasticity is not correlated with the independent variables. While this is a violation of the equal variance assumption, it causes no major problems in most cases.
- Conditional heteroskedasticity is heteroskedasticity that is correlated with the values of the independent variables. Conditional heteroskedasticity does create significant problems for statistical inference.
- Effects:
- The standard errors are usually smaller than they would be in the absence of heteroskedasticity.
- The coefficient estimates aren’t affected.
- Because the standard errors are too small, but the coefficient estimates themselves are not affected, the t-tests (coefficient estimates/standard error) are too large and the null hypothesis of no statistical significance is rejected too often. Recall from Level 1 that incorrectly rejecting a true null hypothesis is a Type I error.
- The F-test is also unreliable.
- Autocorrelation(Serial correlation) situation in which the residual terms are correlated with one another
- Positive serial correlation exists when a positive regression error in one time period increases the probability of observing a positive regression error for the next time period. Positive serial correlation typically has the same effect as heteroskedasticity. The coefficient standard errors are too small, even though the estimated coefficients are accurate. These small standard error terms will cause the computed t-statistics to be larger than they should be, which will cause too many Type I errors: the rejection of the null hypothesis when it is actually true. The F-test will also be too high.
- Negative serial correlation occurs when a positive error in one period increases the probability of observing a negative error in the next period. Negative serial correlation typically causes the standard errors to be too large, which leads to t-statistics that are too small. This will cause us to fail to reject the null hypothesis when it is actually false, resulting in too many Type II errors.
- Multicollinearity refers to the condition under which a high correlation exists among two or more of the independent variables in a multiple regression. This condition distorts the standard error of estimate, which distorts the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.
- detected by a significant F-statistic and a high R2 but insignificant t-statistics on individual coefficients. It is corrected by removing one or more of the correlated independent variable(s), but it is sometimes difficult to identify the source of the collinearity.
- Model misspecification ways in which the regression model can be specified incorrectly.
- The functional form can be misspecified.
- Important variables are omitted.
- Variables should be transformed.
- Data is improperly pooled.
- Explanatory variables are correlated with the error term in time series models.
- A lagged dependent variable is used as an independent variable.
- A function of the dependent variable is used as an independent variable ("forecasting the past").
- Independent variables are measured with error.
- Other time-series misspecifications that result in nonstationarity.
- regression coefficients are often biased and/or inconsistent, which means we can’t have any confidence in our hypothesis tests of the coefficients or in the predictions of the model.
- Probit and logit models: A probit model is based on the normal distribution, while a logit model is based on the logistic distribution. Application of these models results in estimates of the probability that the event occurs (e.g., probability of default). The maximum likelihood methodology is used to estimate coefficients for probit and logit models. These coefficients relate the independent variables to the likelihood of an event occurring, such as a merger, bankruptcy, or default.
- Discriminant models: Discriminant models are similar to probit and logit models but make different assumptions regarding the independent variables. Discriminant analysis results in a linear function similar to an ordinary regression which generates an overall score, or ranking, for an observation. The scores can then be used to rank or classify observations. A popular application of a discriminant model makes use of financial ratios as the independent variables to predict the qualitative dependent variable bankruptcy. A linear relationship among the independent variables produces a value for the dependent variable that places a company in a bankrupt or not bankrupt class.
- The economic meaning of the results of a regression estimation focuses primarily on the slope coefficients. The slope coefficients indicate the change in the dependent variable for a one unit change in the independent variable. The individual slope coefficients can then be interpreted as an elasticity measure.
- Critique Possibility of having a relationship that has statistical significance without having any economic significance. For instance, a study of dividend announcements may identify a statistically significant abnormal return associated with the announcement, but these returns may not be sufficient to cover transactions costs.
Time Series Analysis
- A linear trend model may be appropriate if the data points seem to be equally distributed above and below the line and the mean is constant. Inflation rate data can often be modeled with a linear trend model
- The log-linear model is best for a data series that exhibits a trend or for which the residuals are correlated or predictable or the mean is non-constant. Financial data (e.g., stock indices and stock prices) and company sales data are often best modeled with a log-linear model. In addition, any data that has seasonality is a candidate for a log-linear model.
- Conditions for Covariance Stationarity:
- Constant and finite expected value: The expected value of the time series is constant over time.
- Constant and finite variance: The time series’ volatility around its mean (i.e., the distribution of the individual observations around the mean) does not change over time.
- Constant and finite covariance with leading or lagged values: The covariance of the time series with leading or lagged values of itself is constant.
- Degree of the nonstationarity depends on the length of the series and the underlying economic and market environment and conditions
Autoregressive model AR(p) - dependent variable is regressed against lagged values of itself
- xt = b0 + b1xt-1 + b2xt-2 + ...+ bp xt-p + εt
- orders of p: number of lagged periods
- Procedure for testing serial correlation for AR time-series model:
- Estimate the AR model being evaluated using linear regression.
- Calculate the autocorrelations of the model’s residuals.
- Test whether the autocorrelations are significant.
- Mean reversion: tendency to move toward its mean
- time series has a tendency to decline when > mean
- tendency to increase when < mean
- predicts to be its current value when at mean-reverting level
- In-sample forecasts are made within the range of data used in the estimation.
- Out-of-sample forecasts are made outside of the time period for the data used in the estimation.
- Root mean squared error criterion (RMSE) - used to compare the accuracy of autoregressive models in forecasting out-of-sample values. Lower RMSE will have lower forecast error and will be expected to have better predictive power in the future.
- Nonstationarity(instability)
- increased statistical reliability of longer time periods
- increased stability for shorter periods
- time series sample period should consider changes in economic environment
- Random walk is when the predicted value of the series in one period is equal to the value of the series in another period, plus some random error.
- Simple random walk process xt = xt-1 + εt, where the best forecast of xt is xt-1, subject to the conditions that:
- The expected value of the error terms is zero.
- The variance of the error term is homoskedastic (i.e., constant).
- There is no serial correlation in the error terms.
- Random walk with a drift xt = b0 + b1xt-1 + εt
- not covariance stationary since mean-reverting level is b0 / (1 – b1) = b0 / 0 (undefined)
- First-differencing process: to transform a time series with a unit root to achieve covariance stationarity
- subtract the time series value in the immediately preceding period from the current value
- The first-differenced observation in period t:
- yt = xt − xt-1
- first-difference of the random walk model, yt = b0 + b1yt-1 εt,
- transformed time series has a finite mean-reverting level of b0 / (1 − b1) = 0, so covariance stationary.
- Covariance stationarity test method
- Plot the data.
- Run an AR model and examine autocorrelations.
- Perform the Dickey Fuller test.
- Seasonality: pattern within the year that tends to repeat from year to year, examination of the residual autocorrelations needed
- Autoregressive conditional heteroskedasticity (ARCH) exists if the variance of the residuals in one time period within a time series is dependent on the variance of the residuals in another period.
- SE of the regression coefficients in AR models and the hypothesis tests of these coefficients are invalid.
- ARCH model - to test for autoregressive conditional heteroskedasticity.
- ARCH(1) time series - variance of the residuals in one period is dependent on the variance of the residuals in the preceding period
- To test whether the two time series have unit roots, the analyst first runs separate DF tests with five possible results:
Both time series are covariance stationary (i.e., neither has a unit root). Linear regression can be used.
-
- Only the dependent variable time series is covariance stationary (i.e., independent variable time series has a unit root). Linear regression cannot be used.
- Only the independent variable time series is covariance stationary (i.e., dependent variable time series has a unit root). Linear regression cannot be used.
- Neither time series is covariance stationary (i.e., both have unit roots) and the two series are not cointegrated. Linear regression cannot be used.
- Neither time series is covariance stationary (i.e., both have unit roots) and the two series are cointegrated. Linear regression can be used.
- Cointegration means that two time series are economically linked or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable.
- Choice of a particular time-series model
- Determine your goal – cointegrated time series, cross-sectional multiple regression, or a trend model?
- Plot the values of the variable over time and look for characteristics that would indicate nonstationarity.
- If there is no seasonality or structural shift, use a trend model. If the data plot on a straight line with an upward or downward slope, use a linear trend model. If the data plot in a curve, use a log-linear trend model.
- Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. If no serial correlation, use the model. If serial correlation, use another model.
- If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows:
- If the data has a linear trend, first-difference the data.
- If the data has an exponential trend, first-difference the natural log of the data.
- If there is a structural shift in the data, run two separate models as discussed above.
- If the data has a seasonal component, incorporate the seasonality in the AR model as discussed below.
- Run an AR(1) model and test for serial correlation and seasonality. If no serial correlation, use the model. If serial correlation, incorporate lagged values of the variable.
- Test for ARCH. If the coefficient is not significantly different from zero, stop. You can use the model. If the coefficient is significantly different from zero, ARCH is present. Correct using generalized least squares.
- If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.
|