Multiple linear regression refers to a statistical mechanism used to predict a variable’s outcome based on the value of two or more variables. In this experiment, the goal was to predict housing prices in King County in Washington, USA. The process started with obtaining the data, which was a dataset of houses sold in 2014 and 2015. The data was cleaned and prepared by removing missing observations, outliers, and redundant variables. After this came model selection, where it was determined that a multiple linear regression model would be applied. Because of the assumptions that come with regression, such as normality, no autocorrelation, no multicollinearity, and homoscedasticity, the dataset had to be tested and necessary transformations performed. After splitting the dataset into training and test sets, three training iterations were performed, and the final model emerged with an accuracy of 72%.
Predicting House Prices Using Regression
Multiple linear regression is a statistical technique used in the prediction of a variable’s outcome from the values of one or more other variables. It is sometimes referred to simply as regression and is an extension of linear regression. The variable to be predicted is called the dependent variable, while the variable(s) to do the prediction are called the independent or explanatory variables. Regression allows statisticians to predict the value of a variable using available information about other variables. Linear regression uses a straight line to establish how the two variables relate.
Multiple linear regression is slightly more advanced than linear regression because it attempts to explain the relationship between the independent variable and multiple regressors. It does not have to be linear in that the variables’ relationship does not need to follow a straight line. Both types of regression track a specific response from two or more variables graphically. However, non-linear regression is typically harder to execute since assumptions are obtained through trial and error.
Overfitting is a phenomenon where a model performs very well on the training dataset but performs poorly in the field; on the other hand, underfitting is when the model performs poorly even during training, usually an indicator of poor model choice (Lakshmanan et al., 2020). There are different types of models for different types of data. The simplest of all models is the simple linear regression, where a target variable is predicted using an independent variable using the principles of geometry (Montgomery et al., 2012). This paper will focus on predicting house prices using multiple linear regression in King County in Washington, USA.
The dataset consists of houses sold in the King County area between 2014 and 2015. The size of the dataset is 16383 houses. Price will be used as the independent variable. The feature variables for the model include the square footage of the house, square footage of the lot, square footage of the home excluding the basement, square footage of the basement, number of bathrooms, and number of bedrooms. In the process of data cleansing, seven observations were removed due to various reasons, such as being outliers, as shown in Figure 1. Another 62 observations were removed for having outlier values in the number of bedrooms. Finally, seven observations were removed for having outlier values for the number of bathrooms. The final dataset had a total of 16252 observations.
The dataset had no missing data. A feature selection process was carried out to investigate the most critical variables; correlation analysis was used for this process. The features that were removed for being insignificant were Zip code, Date, sqft_lot, long, yr_built, condition, sqft_lot15, yr_renovated, and sqft_lot. From the dataset, categories variables were bedrooms, bathrooms, view, and floors, while continuous variables were the size of the house and the target variable.
Correlation analysis from Figure 4 shows a high correlation between price and sqft_living, grade, and sqft_above at 0.6, 0.7, and 0.6, respectively. For independent variables, there was a high correlation between bathrooms and sqft_above, grade and sqft_living; sqft_living was also highly correlated with sqft_above, bathrooms, and sqft_living. These correlations between feature variables were grounds to check for multicollinearity (Montgomery et al., 2012). The assumptions for the linear regression were normality which was studied using Q-Q plots. Some variables had to be transformed to the log scale from Figure 8b. to fit the assumption of linearity.
After the transformation of the data, a simple regression model was fit with the dataset. The summary of the model was F (10, 4241) = 4241, p<0.01, R2=.72. The equation for the model was:
Log (Price) = -0.6291 -0.00978 bedrooms + 0.0074 bathrooms + 0.388 log(sqft_living) -0.0023 floors + 0.0903 view + 0.1322 grade – 0.0000045 sqft_above + 1.51 lat + 0.40 waterfront +0.000093 sqft_living15
Variance Inflation Factor (VIF) was used in checking for multicollinearity. From Table 3, a VIF of 5.7 was observed for the predictor saf_living, which was reasonably large. This showed that the variable was highly correlated with one of the other predictors in the model. From Figure 9, Influential Distance (Cook’s distance) was observed that led to the removal of the observation 15744.
Train-Test Split and Model 1
A standard ratio of 0.7: 0.3 was used to split the data into a training set and a test/ validation set. This is a standard ratio that s used in supervised machine learning. The resulting training set had 11379 observations and 11 features. In the models that will be discussed below, the summary will indicate first the F-statistic accompanied by the degrees of freedom. The F-test compares an intercept-only model to that of the model being tested. The null hypothesis assumes that the current model is similar to an intercept-only model, while the alternate hypothesis is that they are not equal (Montgomery et al., 2012). The p-value of the F-test indicates the overall significance of the model; if significant, one can then reject the null hypothesis and conclude that the current model is a better fit than a zero-intercept model.
After the first fitting, Model 1 had the following results: AIC=2919.32, BIC = 3007.394 and RMSE = 0.274. There was one variable with a VIF > 5. The variable sqft_above was removed for having a p-value of 0.986; its adjusted R-square was 72.44% which was quite high. The model summary was F (11, 11369) = 3280, p<0.01, R adj2=.72. Following the removal of some variables, the model had to be refit.
After refitting the model, it had the following metrics: AIC= 2918.768, BIC = 2999.5, and RMSE = 0.274. There was also no multicollinearity among the predictors. The summary of the model was F (9, 11369) = 3280, p<0.01, R2=.72. The variables bathrooms and floors had large p-values and were thus removed for insignificance. Because of the removal of these variables, the model had to be refit.
After refitting the model, it had the following parameters: F (8, 11370) = 3681, p<0.01, R2=.72. The F-statistic and the accompanying p-value indicate that the overall model is significant. The adjusted R-square of 72% shows that the predictors can explain 72% of the variation of the target variable. As a general rule of thumb, a model with an R2 > 0.7 is a good model. The other metrics of the model were AIC = 2880, BIC = 2880 and Root Mean Square Error (RMSE) = 0.274. From the VIF table below, the values are small, indicating the absence of multicollinearity. For the Durbin-Watson check for auto-correlation, the result was 2.002, which is greater than the recommended threshold of 2. However, the model will be assumed to have passed the assumption since the value is close to 2.
After performing several iterations of tuning parameters and refitting the models, Model 3 emerged as the most-tuned. The next step was to test the model against the test dataset to see if the model trained well. The results of the testing are shown below. The test dataset gave an accuracy of 0.726, while the training dataset gave an accuracy of 0.7214, which is an impressive result.
The experiment was meant to use multiple linear regression to predict the price of houses in King County that includes Seattle in the USA. The dataset was from house sales from the 2014-2015 period. The target variable was “price,” while some of the predictor variables were the size of the house, size of the lot, number of bathrooms, number of bedrooms, and size of the basement. Some variables were removed for exhibiting multicollinearity, while some observations were removed for being outliers. The data was split into training and test splits, and after three iterations of training, Model 3 was the best performing model with an accuracy of 72%.
Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns solutions to common challenges in data preparation, model building, and mlops. O’Reilly Media. Web.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (5th ed). Wiley.