Reinforcement Learning 101

A friendly overview of RL and algorithms like Q-learning, PPO and SAC.
Back to demo
🤖What is Reinforcement Learning?

Core idea

In Reinforcement Learning (RL), an agent interacts with an environment. At each time step it:

The goal: learn a policy (a way of choosing actions) that maximizes the long-term reward.

Key pieces of the RL puzzle:

🧮Value-based RL

Q-learning (what the demo uses)

Q-learning learns a table or function Q(s,a) = “expected future reward if I take action a in state s and act well afterwards”.

The classic update rule is:

Q(s, a) ← Q(s, a) + α · [ r + γ · maxa' Q(s', a') − Q(s, a) ]

In the Human-Reward Q-Learning demo:

The agent explores with an ε-greedy policy: most of the time it chooses the action with highest Q-value, but with probability ε it picks a random action to keep exploring.

🎯Policy-gradient and Actor-Critic

From tables to neural networks

For large or continuous state spaces (e.g. images, robots), we can’t store a Q-value for every state–action pair. Instead we use neural networks.

Policy-gradient

A policy network πθ(a | s) outputs a distribution over actions. We directly adjust the parameters θ to increase expected reward:

θ ← θ + α · ∇θ E[ return ]

This approach naturally handles continuous actions (e.g. torques for a robot arm).

Actor-Critic

Actor-Critic methods combine:

The critic helps the actor learn more stable updates by providing a better estimate of “how good was that trajectory?” than raw returns.

🚀Popular modern algorithms

PPO – Proximal Policy Optimization

PPO is a widely used on-policy actor-critic method. It tries to improve the policy while keeping each update close to the old policy so learning stays stable.

SAC – Soft Actor-Critic

SAC is an off-policy actor-critic algorithm that adds an entropy bonus to the objective:

In many modern robotic control setups you’ll see SAC or PPO as the main baseline.

🧑‍💻Where this demo fits

Human-reward Q-learning

The “Human-Reward Q-Learning Demo” is intentionally tiny:

But the structure is the same idea as in bigger systems:

You can think of complex algorithms like PPO or SAC as “much smarter, GPU-powered versions” of this little loop, usually with:

Your demo is a great way to make that learning loop visible and interactive.

📄Read My Research Papers

Recent Work & Publications

Explores hyperparameter optimization for SAC and PPO using Tree-structured Parzen Estimator (TPE) on a 7-DOF robotic arm. TPE significantly improves success rates and speeds up convergence (up to ~80% faster training for SAC and ~76% faster for PPO to reach near-optimal reward), highlighting how advanced hyperparameter search boosts deep RL performance in complex robotic tasks.

Jonaid Shianifar, Michael Schukat, Karl Mason
International Conference on Practical Applications of Agents and Multi-Agent Systems (Springer Nature Switzerland), pp. 293–304, (2024)

View paper