A guide to executing Linear Regression in Python

A guide to executing Linear Regression in Python

As most of us already know, linear regression is used to find the correlation between two continuous variables. There are various ways of going about it, and various applications as well. In this post, we are going to explain the steps of executing linear regression in Python.

There are two kinds of supervised machine learning algorithms: Classification and Regression. Regression tries to predict the continuous value outputs while classification tries to predict discrete values. Here we will be using Python to execute Linear Regression. For this purpose,  Scikit-Learn will be used. Scikit-Learn is one of the most popular machine learning tools for Python.

First up – what is linear regressive theory?

A linear regressive theory is based on the principle of linearity between two or more variables. It’s task is to predict a dependable variable value, let’s say y, based on an independent variable, let’s say x. Hence, x becomes the input and y is the output. This relationship, when plotted on a graph, gives a straight line. Hence, we have to use the equation of a straight line for this, which is:

y=mx+b

Where m is the slope of the line and b is the intercept. Here y and x remains the same and so, all the changes that takes place will be in the slope and the intercept. Thus, there can be multiple straight lines on that basis. What a linear regression algorithm does is it fits the multiple lines along the data points and then returns the line but with the least errors.

A regression model can be represented as:

y = b0 + m1b1 + m2b2 + m3b3 + … … mnbn

This is referred to as the hyperplane.

So, now, how can we use Scikit-Learn library to execute linear regression:  

Let’s say there are many flight delays that has taken place due to weather changes. To measure this fluctuation, you must perform linear regression with the data being provided. This data can include the variation of minimum and maximum temperatures for the particular days. Now, you can download the weather charts to understand the fluctuation. The input x will be the minimum temperature and using that, we have to find the maximum temperature y.

Import all the necessary libraries

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

import seaborn as seabornInstance

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

Now check the data by exploring the number of rows and columns in the datasets

Dataset.shape

You will receive output in the form of (n rows, n columns)

For statistical display, use:

dataset.describe()

Now, try to plot the data points in the form of a 2-D graph to figure relationship just by glancing at the graph. We can do so by using:

dataset.plot(x=’MinTemp’, y=’MaxTemp’, style=’o’) 

plt.title(‘MinTemp vs MaxTemp’) 

plt.xlabel(‘MinTemp’) 

plt.ylabel(‘MaxTemp’) 

plt.show()

linear regression in python

Source

So, here we have used the MinTemp and MaxTemp for analysis. So, let’s use the Average Maximum Temperature between 25 and 35.

plt.figure(figsize=(15,10))

plt.tight_layout()

seabornInstance.distplot(dataset[‘MaxTemp’])

linear regression in python

Source

Once we have done that, we have to divide the data in labels and attributes. Labels refer to the dependent variables which need to be predicted and attributes refer to the independent variables. Here we want to predict the MaxTemp by using the values of the MinTemp. Attribute should include “MinTemp” which is the X value and the label with have ‘MaxTemp’ which is Y value.

X = dataset[‘MinTemp’].values.reshape(-1,1)

y = dataset[‘MaxTemp’].values.reshape(-1,1)

Now we can assigned 80% of this data to the training set and the rest to the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

After this, we can train the data using the following:

regressor = LinearRegression() 

regressor.fit(X_train, y_train) #training the algorithm

We can find the best value for the slope and intercept so that you get the best fit for the data. We can do that with the following code:

#To retrieve the intercept:

print(regressor.intercept_)#For retrieving the slope:

print(regressor.coef_)

With the algorithm trained, we can now use it to make some predictions of the MaxTemp. Our test data can be used for that. We use the following:

y_pred = regressor.predict(X_test)

After we find the predicted value, we have to match it with the actual output value.

We use this script for it:

df = pd.DataFrame({‘Actual’: y_test.flatten(), ‘Predicted’: y_pred.flatten()})

df

Now, there is a possibility that you will find huge variances between the predicted and actual outcome.

So, by taking the 25 of them, develop a bar graph, using this script:

df1 = df.head(25)

df1.plot(kind=’bar’,figsize=(16,10))

plt.grid(which=’major’, linestyle=’-‘, linewidth=’0.5′, color=’green’)

plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5′, color=’black’)

plt.show()

linear regression in python

Source

In the bar graph, you can see how close the predictions are to the actual output. Now, plot it as a straight line.

plt.scatter(X_test, y_test,  color=’gray’)

plt.plot(X_test, y_pred, color=’red’, linewidth=2)

plt.show()

linear regression in python

Source

The straight lines will indicate that the algorithm is correct.

Now, you have to examine the performance of the algorithm. This will use certain metrics:

  1. Mean Absolute Error (MAE) : This will calculate the mean absolute values of the errors. 
  2. Mean Squared Error (MSE): It calculates the mean of the squared errors.
  3. Root Mean Squared Error (RMSE) calculates the square root of the mean of the squared errors.

The Scikit-Learn library has a pre-built function which you can use to calculate this performance by using the following script.

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Multiple Linear Regression

linear regression in python

Source

Now let’s imagine you have multiple data points to work with. This means, you have to use multiple linear regression. An example would be the use of alcohol – let’s say beer. When you consider something like beer and the quality of it, you have to take in various factors like sugar, chloride, pH level, alcohol, density, etc. These are the inputs that will help to determine the quality of the beer. 

So, as we did earlier, we will first import the libraries: 

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

import seaborn as seabornInstance

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

%matplotlib inline

Again, explore the rows and columns using:

dataset.shape

Find the statistical data by using:

dataset.describe()

Now, we have to first clean up some of the data. We can use the following script:

dataset.isnull().any()

All the columns should give False when you use this check, but if one of them turns out to be true, use this script:

dataset = dataset.fillna(method=’ffill’)

Next, we divide them into labels and attributes. 

X = dataset[[‘fixed acidity’, ‘volatile acidity’, ‘citric acid’, ‘residual sugar’, ‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘density’, ‘pH’, ‘sulphates’,’alcohol’]].values

y = dataset[‘quality’].values

linear regression in python

Source

Find the average of the quality column

plt.figure(figsize=(15,10))

plt.tight_layout()

seabornInstance.distplot(dataset[‘quality’])

Separate 80% for training and 20% to test.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Train the specific model:

regressor = LinearRegression() 

regressor.fit(X_train, y_train)

Now, check the difference between predicted and actual values:

df = pd.DataFrame({‘Actual’: y_test, ‘Predicted’: y_pred})

df1 = df.head(25)

Plot it on a graph:

df1.plot(kind=’bar’,figsize=(10,8))

plt.grid(which=’major’, linestyle=’-‘, linewidth=’0.5′, color=’green’)

plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5′, color=’black’)

plt.show()

linear regression in python

Source

If you find the predictions close to the actual one, then apply the following script:

print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 

print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 

print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

If you are facing any errors, it can be due to any of these factors:

  • Inadequate Data: The best prediction can be done with more data inputs.
  • Poor Assumptions: If you assume that you can have a linear relationship for data which may not have such a relationship, that will lead to an error.
  • Poor use of feature: If the features used does not have a high correlation with the predictions, then there can be errors.

So, this was a sample problem on how to perform a linear progression in Python. Let’s hope you can ace your linear regressions using Python! If you’re looking to get your concepts of Machine learning and Python crystal clear, CodingNinjas might be able to help you out.