Book: Beyond Multiple Linear Regression Chapter 1
Counts and binary responses are non-normal, so can more modelled in a more natural way by fitting generalised linear models as opposed to linear least squares regression models. Linear least squares regression relies on assumptions (independence of observations, normality in features, similar variation). Inference in GLMs use different assumptions but still rely on independence assumption
GLMs help extend least squares methods for non-normal responses. Multi-level models help handle situations when observations are not independent.
Linearity is sufficient for fitting a linear least squares regression model. Inference and prediction also required independent observations, normal response at level of predictors and equal variance among predictors. Use LINE assumptions (linear, independent, normal and equal variance). Although you could apply transformations, the book posits that other models may be more suitable when certain assumptions have been violated.
Interpreting plot(model):
- Residuals vs fitted helps check linearity; residuals should be patternless around Y=0
- Normal Q-Q checks normality - deviations from straight line indicate lack of normality
- Scale-location checks equal variance - positive or negative trends indicate variability is not constant
- Residuals vs leverage used to check for influential points - don’t want to have these as they can affect estimation of model parameters, indicated by high cook’s distance Graphical evidence/assessment is better than numerical tests (they are not very reliable)
Bootstrapping as an alternative approach for inference (rather than normal theory of statistical inference/modelling), which is robust esp when assumptions are shaky. We use the data we’ve collected and computing power to estimate the uncertainty in the parameter estimates. Core assumption: that original sample represents the larger population, so we can learn about uncertainty in parameter estimates by repeated sampling with replacement of the original sample.
A bootstrap formula:
- Take a bootstrap sample of data with replacement - case resampling, so all information from a observation stays together
- Fit the model to the bootstrap sample, saving model params
- Repeat the sample-fit process a large number of times
- The bootstrap estimates for the parameters form the bootstrap distribution
- the 95% confidence interval for each param is the middle 95% of the bootstrap distribution (this is percentile method) There are many alternatives methods for resampling and calculating confidence intervals in bootstrapping.
Choice of a “final model” depends on many factors, such as primary research questions, purpose of modeling, tradeoff between parsimony and quality of fitted model, underlying assumptions, etc. … Subject area knowledge should always play a role in the modelling process … most good models will lead to similar conclusions
Assessing performance: R^2, adjusted R^2, AIC, BIC (BIC has greater penalty for extra terms than AIC), extra sum of squares F test - can be used to perform significance test on nested models (one model vs a smaller version of the model)