top of page

Suppose that we have developed a model for predicting the Outlet Sales we had data from 8523 data set on the following variable Item_Weight, Item_Fat_Content, Item_Type, Item_Visibility, Item_MRP, Outlet_Size, Outlet_Location_Type, Outlet_Type, Age Of Years  this all are consider as independent variable and dependent variable is Item_Outlet_Sales.

> linear_model <- lm(Item_Outlet_Sales ~ ., data = train_data)

> summary(linear_model)

Plot 1st regression model –

> par(mfrow=c(2,2))

> plot(linear_model)

This is 1st regression model plot. Here we see that 4 different plot are saying that 4 assumption of linear regression. This 4 assumption are not satisfied for this data set in regression model. Because in plot we see present of heteroscedasticity, and second cook’s distance (between Residual and Leverage). Had there been constant variance, there would be no pattern visible in this graph.

So we can go for log transformation.

> log_linear_model <- lm(log(Item_Outlet_Sales) ~ ., data = train_data)

> summary(log_linear_model)

Now we can see the plot of this log model.

> plot(log_linear_model)

Now the model plot says that no more heteroscedasticity in the mode and more normally distributed. Now the log model is more reliable for the further modeling part.

There is one more step is calculates RMSE for checking model performance.

 For the first model having RMSE -

> library(Metrics)

> rmse(train_data$Item_Outlet_Sales,(linear_model$fitted.values))

[1] 1128.07

  And the second model or log model having RMSE –

> rmse(train_data$Item_Outlet_Sales, exp(log_linear_model$fitted.values))

[1] 1141.635

bottom of page