The Talent500 Blog
A guide to executing Linear Regression in Python 1

A guide to executing Linear Regression in Python

Regression and classification are two commonly used supervised learning algorithms and while classification is used for discrete values (e.g. email spam detection), regression is employed when the values are continuous (e.g. weather forecasting). Breaking down the term linear regression is important: ‘regression’ means that there is a causal or ‘cause and effect’ relationship between two or more variables and ‘linear’ implies that this correlation is linear, that is, if you plot it in a 2D space, you’d get a straight line. 

Linear regression has a wide range of real-life applications from assessing the risks that insurance providers take to vendors adjusting their prices based on income statistics. Modelling such data in Python is fairly easy too. In this guide you’ll learn some basic linear regression theory and how to execute linear regression in Python.

What is simple linear regression?

In linear regression, you map dependent variables with independent variables. The dependent features are also known as outputs or responses and the independent features are called inputs or predictors. Here is some theory based on insights from Mirko Stojiljković’s article on Real Python.

A linear regression model can be defined by the equation:

𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽n𝑥n + 𝜀

𝑦 is the dependent feature

x = (𝑥₁, … 𝑥n) are the independent features 

𝛽₀…𝛽n are the regression coefficients or model parameters

𝜀 is the error of estimation, which can be averaged out to be zero

In case of simple linear regression, there is only one independent feature.

So, the regression model narrows down to:

𝑦 = 𝛽₀ + 𝛽₁𝑥₁

In carrying out linear regression you are arriving at estimators for 𝛽₀…𝛽n, the regression coefficients.

The estimated regression function then is:

𝑓(x) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏n𝑥n

At any given point, the difference between the actual and predicted response would be the residual (R).

For 𝑖 = 1, …, m,

      Ri = 𝑦i – 𝑓(𝐱

To arrive at the best set of coefficients, the goal is to minimize the SSR, sum of squared residuals. 

SSR = Σᵢ(𝑦ᵢ – 𝑓(𝐱ᵢ))²

How to execute simple linear regression in Python?

To implement simple linear regression in Python you can use the machine learning package Scikit-Learn. You can try out the code on Jupyter.

Import libraries

A guide to executing Linear Regression in Python 2

Pick the dataset

Consider this simple dataset provided by T McKetterick on Kaggle which provides, from a small sample of American women aged 30-39, average mass as a function of height.

Now, importing the CSV using pandas:

A guide to executing Linear Regression in Python 3

To get a glimpse of the dataset, execute:

A guide to executing Linear Regression in Python 4

The output is:

(15, 2)

And to view some of the data run:

A guide to executing Linear Regression in Python 5

The output is:

A guide to executing Linear Regression in Python 6

To “see” if the data would be good for linear regression, it makes sense to plot it.

A guide to executing Linear Regression in Python 7

The result is:

A guide to executing Linear Regression in Python 8

As you can see, this is very linear in form.

Provide the data

Here, you are to divide the data into independent and dependent variables:

A guide to executing Linear Regression in Python 9

Create a linear regression model:
Now, on to creating a linear regression model with Scikit-Learn and fitting it with the dataset.

A guide to executing Linear Regression in Python 10

To train the model use:

A guide to executing Linear Regression in Python 11

The output is:

LinearRegression()

To predict features run:

A guide to executing Linear Regression in Python 12

Evaluate the model

Root mean square error is a metric we want to minimise. So, let’s see how our model has performed:

A guide to executing Linear Regression in Python 13

View the results:

A guide to executing Linear Regression in Python 14

You get:
Root mean squared error: 0.49937056025884025
R2 score: 0.9891969224457968

Both these are very good values.

Also, take a look at the coefficients.

A guide to executing Linear Regression in Python 15

You get:

Slope: [61.27218654]
Intercept: -39.06195591884392

Having 𝑏₀ = -39 makes sense, as a person shouldn’t have positive weight for zero height (x).

Take a look at the line predicted by the model:

A guide to executing Linear Regression in Python 16
A guide to executing Linear Regression in Python 17
A guide to executing Linear Regression in Python 18

Now that you have a working model, you can use it to predict dependent variables for a new dataset, say, for a larger group of women of a similar age group and demographic.

So, you could get weight values for random heights, such as, 1.3, 1.5, 1.2, 2.4, 1:

A guide to executing Linear Regression in Python 19

Have a look at x and y_pred:

A guide to executing Linear Regression in Python 20

You get:

[[1.3]
[1.5]
[1.2]
[1.4]
[2.4]
[1. ]]

And similarly,

A guide to executing Linear Regression in Python 21

To get:

[ 40.59188659 52.84632389 34.46466793 46.71910524 107.99129178
22.21023062]

How to execute multiple linear regression in Python?

As an example of multiple linear regression consider data on cars from CarDekho, available at Kaggle.

Import Libraries

A guide to executing Linear Regression in Python 22

Provide the dataset

A guide to executing Linear Regression in Python 23

To view the first 5 rows of data:

A guide to executing Linear Regression in Python 24

You get:

A guide to executing Linear Regression in Python 25

Prepare the data
Now, for the purpose of evaluation, consider that the present price depends on only 3 of the above factors: years, selling price, and kilometers driven.

A guide to executing Linear Regression in Python 26

This time around, try dividing the dataset itself, into training and testing sets. You could keep a ratio of 80:20.

A guide to executing Linear Regression in Python 27

You can view the dimensions of the matrices:

A guide to executing Linear Regression in Python 28

You get:
(240, 3)
(61, 3)
(240,)
(61,)

Train and test

Once again, use Scikit-Learn library to train and test:

A guide to executing Linear Regression in Python 29

The output is:

LinearRegression()

In multiple linear regression, the model will try to arrive at optimal values for all coefficients.

You can view these now:

A guide to executing Linear Regression in Python 30

You get:

A guide to executing Linear Regression in Python 31

That is, if the present price increases by ‘1 unit’, the selling price should increase by 1.87 units. (e-1 ~ 3.68). Or if the kilometers driven increase by 1 unit, the selling price drops by 3.14 units.

Prediction and evaluation
You can now make predictions on the dataset:

A guide to executing Linear Regression in Python 32

To view actual vs precited values,

A guide to executing Linear Regression in Python 33

This is what you get:

A guide to executing Linear Regression in Python 34

You can now evaluate RMSE and R2 score for training and test data:

A guide to executing Linear Regression in Python 35

You get,

RMSE = 2.040579021493682
R2 score = 0.8384238519780597

And for the test data:

A guide to executing Linear Regression in Python 36
A guide to executing Linear Regression in Python 37

You get:

RMSE = 1.6867312004140655
R2 score = 0.887446046362064

So, what do you think? 

  1. Are these metrics a good starting point? 
  1. Or would it help to carry out some exploratory analysis first, and have some other variables play a factor in deciding the selling price?

In the field of data science, being inquisitive helps arrive at the right answers. 

It also helps to work on projects with some of the best minds in the industry, and for that, you can simply sign up with Talent500. Our dynamic skill assessment algorithms align your skills to premium job openings at Fortune 500 companies. This way you’ll put what you learn in data science to practice and get hands-on experience with top real-world problems!

*You can gain complimentary insight from the following blogs, which also served as resource material for this post:

https://realpython.com/linear-regression-in-python/#polynomial-regression

https://stackabuse.com/linear-regression-in-python-with-scikit-learn/

https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

3+
Kartikey Bhardwaj

Kartikey Bhardwaj

Backend Developer at Talent500. I love fixing daily life problems with web apps and mobile apps. Starting from scratch of how a problem came into existence and where it is now. I design whiteboard prototypes and transform my solution into something people truly value.
I believe in a wise man's saying that unless you yourself use your own solution, it's not worth it. Keeping this in mind, I always move forward with my work.

Add comment