A Comparative Study of On-Policy and Off-Policy Tabular RL in the Taxi-v3 Path-Planning Task

Authors

Keywords:

Reinforcement Learning, On-Policy, Off-Policy, Path Planning

Abstract

Mobile robots are increasingly relied upon for navigation and exploration in unknown environments, where path planning is crucial. Reinforcement Learning (RL) algorithms, particularly Q-Learning (off-policy) and SARSA (on-policy), have proven effective for autonomous decision-making during path planning. This study presents a comparative analysis of both algorithms using the Taxi-v3 environment in OpenAI Gym, focusing on differences in policy behaviour and learning dynamics. Experiments were conducted using the Taxi environment in OpenAI Gym with learning rates from 0.1 to 0.5, across 10,000 to 50,000 episodes. Performance was evaluated based on convergence rate, cumulative reward, and path efficiency. Both algorithms converged within the first 10-20% of training episodes, with Q-learning converging 5-10% faster and accumulating fewer penalties due to its greedy strategy. Post-convergence analysis showed Q-learning required on average 9 steps using a direct, shortest-path approach, while SARSA required 13 steps following an exploration-type path. Q-learning is suitable for tasks requiring fast, shortest paths, while SARSA is preferred for exploration tasks with precautionary conditions.

Downloads

Published

2025-12-29

How to Cite

Zainal, M. M. M., Kamarudin, K., Abdalrahman, N., Rahiman, W., Abu Bakar, M. A., & Manan, M. R. (2025). A Comparative Study of On-Policy and Off-Policy Tabular RL in the Taxi-v3 Path-Planning Task. International Journal of Autonomous Robotics and Intelligent Systems (IJARIS), 1(2), 143–156. Retrieved from https://ejournal.unimap.edu.my/index.php/ijaris/article/view/2639

Issue

Section

Articles

Similar Articles

You may also start an advanced similarity search for this article.