ML - Home
ML - Introduction
ML - Getting Started
ML - Basic Concepts
ML - Ecosystem
ML - Python Libraries
ML - Applications
ML - Life Cycle
ML - Required Skills
ML - Implementation
ML - Challenges & Common Issues
ML - Limitations
ML - Reallife Examples
ML - Data Structure
ML - Mathematics
ML - Artificial Intelligence
ML - Neural Networks
ML - Deep Learning
ML - Getting Datasets
ML - Categorical Data
ML - Data Loading
ML - Data Understanding
ML - Data Preparation
ML - Models
ML - Supervised Learning
ML - Unsupervised Learning
ML - Semi-supervised Learning
ML - Reinforcement Learning
ML - Supervised vs. Unsupervised
Machine Learning Data Visualization
ML - Data Visualization
ML - Histograms
ML - Density Plots
ML - Box and Whisker Plots
ML - Correlation Matrix Plots
ML - Scatter Matrix Plots
Statistics for Machine Learning
ML - Statistics
ML - Mean, Median, Mode
ML - Standard Deviation
ML - Percentiles
ML - Data Distribution
ML - Skewness and Kurtosis
ML - Bias and Variance
ML - Hypothesis
Regression Analysis In ML
ML - Regression Analysis
ML - Linear Regression
ML - Simple Linear Regression
ML - Multiple Linear Regression
ML - Polynomial Regression
Classification Algorithms In ML
ML - Classification Algorithms
ML - Logistic Regression
ML - K-Nearest Neighbors (KNN)
ML - Naïve Bayes Algorithm
ML - Decision Tree Algorithm
ML - Support Vector Machine
ML - Random Forest
ML - Confusion Matrix
ML - Stochastic Gradient Descent
Clustering Algorithms In ML
ML - Clustering Algorithms
ML - Centroid-Based Clustering
ML - K-Means Clustering
ML - K-Medoids Clustering
ML - Mean-Shift Clustering
ML - Hierarchical Clustering
ML - Density-Based Clustering
ML - DBSCAN Clustering
ML - OPTICS Clustering
ML - HDBSCAN Clustering
ML - BIRCH Clustering
ML - Affinity Propagation
ML - Distribution-Based Clustering
ML - Agglomerative Clustering
Dimensionality Reduction In ML
ML - Dimensionality Reduction
ML - Feature Selection
ML - Feature Extraction
ML - Backward Elimination
ML - Forward Feature Construction
ML - High Correlation Filter
ML - Low Variance Filter
ML - Missing Values Ratio
ML - Principal Component Analysis
Reinforcement Learning
ML - Reinforcement Learning Algorithms
ML - Exploitation & Exploration
ML - Q-Learning
ML - REINFORCE Algorithm
ML - SARSA Reinforcement Learning
ML - Actor-critic Method
ML - Monte Carlo Methods
ML - Temporal Difference
Deep Reinforcement Learning
ML - Deep Reinforcement Learning
ML - Deep Reinforcement Learning Algorithms
ML - Deep Q-Networks
ML - Deep Deterministic Policy Gradient
ML - Trust Region Methods
Quantum Machine Learning
ML - Quantum Machine Learning
ML - Quantum Machine Learning with Python
Machine Learning Miscellaneous
ML - Performance Metrics
ML - Automatic Workflows
ML - Boost Model Performance
ML - Gradient Boosting
ML - Bootstrap Aggregation (Bagging)
ML - Cross Validation
ML - AUC-ROC Curve
ML - Grid Search
ML - Data Scaling
ML - Train and Test
ML - Association Rules
ML - Apriori Algorithm
ML - Gaussian Discriminant Analysis
ML - Cost Function
ML - Bayes Theorem
ML - Precision and Recall
ML - Adversarial
ML - Stacking
ML - Epoch
ML - Perceptron
ML - Regularization
ML - Overfitting
ML - P-value
ML - Entropy
ML - MLOps
ML - Data Leakage
ML - Monetizing Machine Learning
ML - Types of Data
Machine Learning - Resources
ML - Quick Guide
ML - Cheatsheet
ML - Interview Questions
ML - Useful Resources
ML - Discussion

Simple Linear Regression in Machine Learning

Quiz

What is Simple Linear Regression?

Simple linear regression is a statistical and supervised learning method in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

Simple linear regression in machine learning is a type of linear regression. When the linear regression algorithm deals with a single independent variable, it is known as simple linear regression. When there is more than one independent variable (feature variables), it is known as multiple linear regression.

Independent Variable

The feature inputs in the dataset are termed as the independent variables. There is only a single independent variable in simple linear regression. An independent variable is also known as a predictor variable as it is used to predict the target value. It is plotted on a horizontal axis.

Dependent Variable

The target value in the dataset is termed as the dependent variable. It is also known as a response variable or predicted variable. It is plotted on a vertical axis.

Line of Regression

In simple linear regression, a line of regression is a straight line that best fits the data points and is used to show the relationship between a dependent variable and an independent variable.

Graphical Representation

The following graph depicts the simple linear regression model −

In the above image, the straight line represents the simple linear regression line where &Ycirc; is the predicted value, and Y is dependent variable (target) and X is independent variable (input).

Simple Linear Regression Model

A simple linear regression model in machine learning can be represented as the following mathematical equation −

$$\mathrm{ Y = w_0 + w_1 X + \epsilon }$$

Where

Y is the dependent variable (target).
X is the independent variable (feature).
w₀ is the y-intercept of the line.
w₁ is the slope of the line, representing the effect of X on Y.
ε is the error term, capturing the variability in Y not explained by X.

How Simple Linear Regression Works?

The main of simple linear regression is to find the best fit line (a straight line) through the data points that minimizes the difference between the actual values and predicted values.

Defining Hypothesis Function

In simple linear regression, the hypothesis is that there is a linear relation between the dependent variable (output/ target) and the independent variable (input). This linear relation can be represented using a linear equation −

$$\mathrm{\hat{Y} = w_0 + w_1 X}$$

With different values of parameters w₀ and w₁ there are multiple linear equations (straight lines). The set of all such linear equations (all straight lines) is termed hypothesis space.

Now, the main aim of the simple linear regression model is to find the best-fit line in Hypothesis space (set of all straight lines).

Finding the Best Fit Line

Now the task is to find the best fit line (line of regression). To do this, we define a cost function or loss function that measure the the difference between the actual values and predicted values.

To find the best fit line, the simple linear regression model initializes (with default values) the parameters of the regression line. This regression line (with initialized parameters) is used to find the predicted values for the given input values.

Loss Function for Simple Linear Regression

Now using the input and predicted values, we compute the loss function. The loss function is used to find the optimal values of the parameters.

The loss function finds the difference between the input value and predicted value. There are different loss functions such as mean squared error (MSE), mean absolute error (MEA), R-squared, etc. used in simple linear regression. The most commonly used loss function is mean squared error.

The loss function for simple linear regression in terms of mean squared error is as follows −

$$\mathrm{J(w_0, w_1) = \frac{1}{2n} \sum_{i=1}^{n} \left( Y_i - \hat{Y}_i \right)^2}$$

Optimization

The optimal values of parameters are those values that minimize the cost function. Finding the optimal values is an iterative process in which the parameters are updated iteratively.

There are many optimization techniques applied in simple linear regression. Gradient Descent is a simple and most common optimization technique used in simple linear regression.

A linear equation with optimal parameter values is the best fit line(regression line) and it is the final solution for a simple linear regression problem. This line is used to predict new and unseen data.

Assumptions of Simple Linear Regression

There are some assumptions about the dataset that are made by the simple linear regression model. The following are some assumptions −

Linearity − This assumption assumes that the relationship between the dependent and independent variables is linear. That means the dependent variable changes linearly as the independent variable changes. A scatter plot will show the linearity in the dataset.
Homoskedasticity − For all observations, the variance of the residuals is the same. This assumption relates to the squared residuals.
Independence − The examples (observations or X and Y pairs) are independent. There is no collinearity in data so the residuals will not be correlated. To check this, we example the scatter plot of residuals vs. fits.
Normality − Model Residuals are normally distributed. Residuals are the differences between the actual and predicted values. To check for the normality, we examine the histogram of residuals. The histogram should be approximately normally distributed.

Implementation of Simple Linear Regression Algorithm using Python

To implement the simple linear regression algorithm, we are taking a dataset with two variables: YearsExperience (independent variable) and Salary (dependent variable).

Here, we are using the following dataset. The dataset contains 30 examples of data points. You can create a CSV file and store these data points in it.

Salary_Data.csv

Years of Experience	Salary
1.1	39343
1.3	46205
1.5	37731
2	43525
2.2	39891
2.9	56642
3	60150
3.2	54445
3.2	64445
3.7	57189
3.9	63218
4	55794
4	56957
4.1	57081
4.5	61111
4.9	67938
5.1	66029
5.3	83088
5.9	81363
6	93940
6.8	91738
7.1	98273
7.9	101302
8.2	113812
8.7	109431
9	105582
9.5	116969
9.6	112635
10.3	122391
10.5	121872

What is the purpose of this implementation?

The purpose of building this simple linear regression model is to determine which line best represents the relationship between the two variables.

The following are the steps to implement the simple linear regression model in Python −

Step 1: Data Preparation

Data preparation or pre-processing is the initial step. We have our dataset as a CSV file named "Salary_Data.csv," as discussed above.

We need to import python libraries prior to importing the dataset and building the simple linear regression model.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Load the dataset

dataset = pd.read_csv('Salary_Data.csv')

The dependent variable (X) and independent variable (Y) must then be extracted from the provided dataset. Years of experience (YearsExperience) is the independent variable, and Salary is the dependent variable.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Let's check the first five examples of the dataset.

print(dataset.head())

Output

0	1.1	39343.0
1	1.3	46205.0
2	1.5	37731.0
3	2.0	43525.0
4	2.2	39891.0

Lets check if the dataset is linear or not

plt.scatter(X, y, color="green")
plt.title("Salary vs Experience")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (INR)")
plt.show()

Output

Linear Relation Between Dependent and Independent Variables

The above graph shows that the dependent and independent variables are linearly dependent. So we can apply the simple linear regression on the dataset to find the best relation between these variables.

Split the dataset into training and testing sets

The training set and test set will then be divided into two groups. We will use 80% observations for the training set and 20% observations for the test set out of the total 30 observations we have. So there will be 24 observation in training set and 6 observation in test set. We divide our dataset into training and test sets so that we can use one set to train and the other to test our model.

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Here, X_train represents the input feature of the training data and y_train represents the output variable (target variable).

Step 2: Model Training (Fitting the Simple Linear Regression to Training Set)

The next step is fitting our model with the training dataset. We will use scikit-learn's LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows −

from sklearn.linear_model import LinearRegression

# Create a linear regression object
regressor= LinearRegression()
regressor.fit(X_train, y_train)

The fit() method is used to fit the linear regression object (regressor) to the training data. The model learns the relation between the predictor variable (X_train), and the target variable (y_train).

Step 3: Model Testing

Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows −

y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual Values':y_test, 'Predicted Values':y_pred})
print(df)

Output

   Actual Values  Predicted Values
0        60150.0      54093.648425
1        93940.0      82416.119864
2        57081.0      64478.554619
3       116969.0     115459.003211
4        56957.0      63534.472238
5       121872.0     124899.827024

The above output shows actual values and predicted values of Salary in the test set.

Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable).

Similarly, you can test the model with training data.

y_pred = regressor.predict(X_train)
df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
print(df)

Output

    Real Values  Predicted Values
0       57189.0      60702.225094
1       64445.0      55981.813188
2       63218.0      62590.389857
3      122391.0     123011.662261
4       91738.0      89968.778915
5       43525.0      44652.824612
6       61111.0      68254.884145
7       56642.0      53149.566044
8       66029.0      73919.378433
9       83088.0      75807.543195
10      46205.0      38044.247943
11     109431.0     107906.344160
12      98273.0      92801.026059
13      37731.0      39932.412705
14      54445.0      55981.813188
15      39891.0      46540.989374
16     101302.0     100353.685109
17      55794.0      63534.472238
18      81363.0      81472.037483
19      39343.0      36156.083180
20     113812.0     103185.932253
21      67938.0      72031.213670
22     112635.0     116403.085592
23     105582.0     110738.591304

Step 4: Model Evaluation

We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE), root mse (RMSE), mean average error (MAE), and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows −

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

# get the predicted values for test dat
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("mse", mse)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("rsme", rmse)
mae = mean_absolute_error(y_test, y_pred)
print("mae", mae)
r2 = r2_score(y_test, y_pred)
print("r2", r2)

Output

mse:  46485664.99327367
rsme:  6818.0396737826095
mae:  6015.513730219523
r2:  0.9399326805390613

Here, y_test represents the actual output variable of the test data.

Step 5: Visualize Training Set Results (with Regression Line)

Now, let's visualize the results on the training set and the regression line.

We use the scatter plot to plot the actual values (input and target values) in the training set. We also plot a straight line (regression line) for actual values (input) and predicted values of the training set.

y_pred = regressor.predict(X_train)
plt.scatter(X_train, y_train, color="green", label="training data points (actual)")
plt.scatter(X_train, y_pred, color="blue",label="training data points (predicted)")
plt.plot(X_train, y_pred, color="red")
plt.title("Salary vs Experience (Training Dataset)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary(In Rupees)")
plt.legend()
plt.show()

Output

The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the training set.

Step 6: Visualize the Test Set Results (with Regression Line)

Now, let's visualize the results on the test set and the regression line.

We use the scatter plot to plot the actual values (input and target values) in the test set. We also plot a straight line (regression line) for actual values (input) and predicted values of the test set.

y_pred = regressor.predict(X_test)
plt.scatter(X_test, y_test, color="green", label="test data points (actual)")
plt.scatter(X_test, y_pred, color="blue",label="test data points (predicted)")
plt.plot(X_test, y_pred, color="red")
plt.title("Salary vs Experience (Test Dataset)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary(In Rupees)")
plt.legend()
plt.show()

Output

The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the test set.

Print Page