A Comparative Study of On-Policy and Off-Policy Tabular RL in the Taxi-v3 Path-Planning Task
Keywords:
Reinforcement Learning, On-Policy, Off-Policy, Path PlanningAbstract
Mobile robots are increasingly relied upon for navigation and exploration in unknown environments, where path planning is crucial. Reinforcement Learning (RL) algorithms, particularly Q-Learning (off-policy) and SARSA (on-policy), have proven effective for autonomous decision-making during path planning. This study presents a comparative analysis of both algorithms using the Taxi-v3 environment in OpenAI Gym, focusing on differences in policy behaviour and learning dynamics. Experiments were conducted using the Taxi environment in OpenAI Gym with learning rates from 0.1 to 0.5, across 10,000 to 50,000 episodes. Performance was evaluated based on convergence rate, cumulative reward, and path efficiency. Both algorithms converged within the first 10-20% of training episodes, with Q-learning converging 5-10% faster and accumulating fewer penalties due to its greedy strategy. Post-convergence analysis showed Q-learning required on average 9 steps using a direct, shortest-path approach, while SARSA required 13 steps following an exploration-type path. Q-learning is suitable for tasks requiring fast, shortest paths, while SARSA is preferred for exploration tasks with precautionary conditions.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Autonomous Robotics and Intelligent Systems (IJARIS)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

