ML - Home
ML - Introduction
ML - Getting Started
ML - Basic Concepts
ML - Ecosystem
ML - Python Libraries
ML - Applications
ML - Life Cycle
ML - Required Skills
ML - Implementation
ML - Challenges & Common Issues
ML - Limitations
ML - Reallife Examples
ML - Data Structure
ML - Mathematics
ML - Artificial Intelligence
ML - Neural Networks
ML - Deep Learning
ML - Getting Datasets
ML - Categorical Data
ML - Data Loading
ML - Data Understanding
ML - Data Preparation
ML - Models
ML - Supervised Learning
ML - Unsupervised Learning
ML - Semi-supervised Learning
ML - Reinforcement Learning
ML - Supervised vs. Unsupervised
Machine Learning Data Visualization
ML - Data Visualization
ML - Histograms
ML - Density Plots
ML - Box and Whisker Plots
ML - Correlation Matrix Plots
ML - Scatter Matrix Plots
Statistics for Machine Learning
ML - Statistics
ML - Mean, Median, Mode
ML - Standard Deviation
ML - Percentiles
ML - Data Distribution
ML - Skewness and Kurtosis
ML - Bias and Variance
ML - Hypothesis
Regression Analysis In ML
ML - Regression Analysis
ML - Linear Regression
ML - Simple Linear Regression
ML - Multiple Linear Regression
ML - Polynomial Regression
Classification Algorithms In ML
ML - Classification Algorithms
ML - Logistic Regression
ML - K-Nearest Neighbors (KNN)
ML - Naïve Bayes Algorithm
ML - Decision Tree Algorithm
ML - Support Vector Machine
ML - Random Forest
ML - Confusion Matrix
ML - Stochastic Gradient Descent
Clustering Algorithms In ML
ML - Clustering Algorithms
ML - Centroid-Based Clustering
ML - K-Means Clustering
ML - K-Medoids Clustering
ML - Mean-Shift Clustering
ML - Hierarchical Clustering
ML - Density-Based Clustering
ML - DBSCAN Clustering
ML - OPTICS Clustering
ML - HDBSCAN Clustering
ML - BIRCH Clustering
ML - Affinity Propagation
ML - Distribution-Based Clustering
ML - Agglomerative Clustering
Dimensionality Reduction In ML
ML - Dimensionality Reduction
ML - Feature Selection
ML - Feature Extraction
ML - Backward Elimination
ML - Forward Feature Construction
ML - High Correlation Filter
ML - Low Variance Filter
ML - Missing Values Ratio
ML - Principal Component Analysis
Reinforcement Learning
ML - Reinforcement Learning Algorithms
ML - Exploitation & Exploration
ML - Q-Learning
ML - REINFORCE Algorithm
ML - SARSA Reinforcement Learning
ML - Actor-critic Method
ML - Monte Carlo Methods
ML - Temporal Difference
Deep Reinforcement Learning
ML - Deep Reinforcement Learning
ML - Deep Reinforcement Learning Algorithms
ML - Deep Q-Networks
ML - Deep Deterministic Policy Gradient
ML - Trust Region Methods
Quantum Machine Learning
ML - Quantum Machine Learning
ML - Quantum Machine Learning with Python
Machine Learning Miscellaneous
ML - Performance Metrics
ML - Automatic Workflows
ML - Boost Model Performance
ML - Gradient Boosting
ML - Bootstrap Aggregation (Bagging)
ML - Cross Validation
ML - AUC-ROC Curve
ML - Grid Search
ML - Data Scaling
ML - Train and Test
ML - Association Rules
ML - Apriori Algorithm
ML - Gaussian Discriminant Analysis
ML - Cost Function
ML - Bayes Theorem
ML - Precision and Recall
ML - Adversarial
ML - Stacking
ML - Epoch
ML - Perceptron
ML - Regularization
ML - Overfitting
ML - P-value
ML - Entropy
ML - MLOps
ML - Data Leakage
ML - Monetizing Machine Learning
ML - Types of Data
Machine Learning - Resources
ML - Quick Guide
ML - Cheatsheet
ML - Interview Questions
ML - Useful Resources
ML - Discussion

SARSA Reinforcement Learning

Quiz

SARSA stands for State-Action-Reward-State-Action, which is a modified version of the Q-learning algorithm where the target policy is the same as the behavior policy. The two consecutive state-action pairs and the immediate reward received by the agent while transitioning from the first state to the next state determine the updated Q value, so this method is called SARSA.

What is SARSA?

State-Action-Reward-State-Action (SARSA) is a reinforcement learning algorithm that explains a series of events in the process of learning. It is one of the effective 'On Policy' learning techniques for agents to make the right choices in various situations. The main idea behind SARSA is trial and error. The agent takes action in a situation, observes the consequence, and modifies its plan based on the result.

For example, assume you are teaching a robot how to walk through a maze. The robot starts at a particular position, which is the 'state', and your goal is to find the best route to the end of the maze. The robot has the option to move in various directions during each step, referred to as 'action'. The robot is given feedback in the form of incentives, either positive or negative, to indicate how well it is performing.

The equation for updated statements in the SARSA algorithm is as follows −

Components of SARSA

Some of the core components of SARSA algorithm include −

State(S) − A state is a reflection of the environment, containing all details about the agent's present situation.
Action(A) − An action represents the decision made by the agent depending on its present condition. The action it chose from the repository causes a change from the current state to the next state. This shift is how the agent engages with its environment to generate desired results.
Reward(R) − Reward is a variable provided by the environment in response to the agent's action within a specific state. This feedback signal shows the instant outcome of the agent's choice. Rewards help the agent learn by showing which actions are desirable in certain situations.
Next State(S') − When the agent acts in a specific state, it causes a shift to a different situation called the "next state." This new state (s') is the agent's updated environment.

Working of SARSA Algorithm

The SARSA reinforcement learning algorithm allows agents to learn and make decisions in an environment by maximizing cumulative rewards over time using the State-Action-Reward-State-Action sequence. It involves an iterative cycle of engaging with the environment, gaining insights from past events, and enhancing the decision-making strategy. Let's analyze the working of the SARSA algorithm −

Q-Table Initialization − SARSA begins by initializing Q(S,A) , which denotes the state-action pair to arbitrary values. In this process, the starting state (s) is determined, and initial action (A) is chosen by employing an epsilon-greedy algorithm policy replying to current Q-values.
Exploration Vs. Exploitation − Exploitation involves using already known values that were estimated previously to improve the chance of receiving rewards in the learning process. On the other hand, exploration involves selecting actions that may result in short-term benefits but could help discover better actions and rewards in the future.
Action execution and Feedback − Once the chosen action (A) is executed, it results in a reward (R) and a transition to the next state (S').
Q-Value Update − The Q-value of the current state-action pair is updated based on the received and the new state. The next action (A') is selected from the values updated in the Q-table.
Iteration and Learning − The above steps are repeated until the state terminates. Throughout the process, SARSA updates its Q-values continuously by considering the transitions of state-action-reward. These improvements enhance the algorithm's capacity to anticipate future rewards for state-action pairs, directing the agent toward making improved decisions in the long run.

SARSA Vs Q-Learning

SARSA and Q-learning are two algorithms in reinforcement learning that belong to value-based methods. SARSA follows the current policy, whereas Q-learning doesn't follow the current policy. This variance impacts the way in which each algorithm adjusts its action-value function. Some differences are tabulated below −

Feature	SARSA	Q-Learning
Policy Type	On-policy	Off-Policy
Update Rule	Q(s,a) = Q(s,a) + ɑ(r + γmax_aQ(s',a)-Q(s,a))	Q(s,a) = Q(s,a) + ɑ(r + γ Q(s',a')-Q(s,a))
Convergence	Slower convergence to the optimal policy.	Typically faster convergence to the optimal policy.
Exploration Vs Exploitation	Exploration directly influences learning updates.	Exploration policy can differ from learning policy.
Policy Update	Updates the action-value function based on the action actually taken.	Updates the action-value function, assuming the best possible action is always taken.
Use case	Suitable for environments where stability is important.	Suitable for environments where efficiency is important.
Example	Healthcare, traffic management, personalized learning.	Gaming, robotics, financial trading

Print Page