Reinforcement Learning 101 – Human-Reward Q-Learning Demo

🤖What is Reinforcement Learning?

Core idea

In Reinforcement Learning (RL), an agent interacts with an environment. At each time step it:

observes the state s
chooses an action a
receives a reward r and a new state s'

The goal: learn a policy (a way of choosing actions) that maximizes the long-term reward.

Key pieces of the RL puzzle:

State – what the agent currently sees.
Action – what the agent can do.
Reward – scalar signal of “how good was that?”.
Policy – mapping from states to actions.
Value – how good it is to be in a state (or take an action) in the long run.

🧮Value-based RL

Q-learning (what the demo uses)

Q-learning learns a table or function Q(s,a) = “expected future reward if I take action a in state s and act well afterwards”.

The classic update rule is:

Q(s, a) ← Q(s, a) + α · [ r + γ · max_a' Q(s', a') − Q(s, a) ]

In the Human-Reward Q-Learning demo:

Each cell of the grid is a state.
Actions are up, right, down, left.
You choose the reward (-1, 0, +1).
The Q-table is updated based on your feedback.

The agent explores with an ε-greedy policy: most of the time it chooses the action with highest Q-value, but with probability ε it picks a random action to keep exploring.

🎯Policy-gradient and Actor-Critic

From tables to neural networks

For large or continuous state spaces (e.g. images, robots), we can’t store a Q-value for every state–action pair. Instead we use neural networks.

Policy-gradient

A policy network π_θ(a | s) outputs a distribution over actions. We directly adjust the parameters θ to increase expected reward:

θ ← θ + α · ∇_θ E[ return ]

This approach naturally handles continuous actions (e.g. torques for a robot arm).

Actor-Critic

Actor-Critic methods combine:

an actor (policy network) – decides actions
a critic (value network) – estimates how good states/actions are

The critic helps the actor learn more stable updates by providing a better estimate of “how good was that trajectory?” than raw returns.

🚀Popular modern algorithms

PPO – Proximal Policy Optimization

PPO is a widely used on-policy actor-critic method. It tries to improve the policy while keeping each update close to the old policy so learning stays stable.

Uses a clipped loss to prevent too-big policy updates.
Good default choice for many continuous-control tasks.
Often used in robotics and game AI.

SAC – Soft Actor-Critic

SAC is an off-policy actor-critic algorithm that adds an entropy bonus to the objective:

Maximize reward and keep the policy “high-entropy” (more random).
This encourages exploration and often gives very stable learning.
Works very well for continuous actions (e.g. MuJoCo robots).

In many modern robotic control setups you’ll see SAC or PPO as the main baseline.

🧑‍💻Where this demo fits

Human-reward Q-learning

The “Human-Reward Q-Learning Demo” is intentionally tiny:

Small discrete gridworld.
Tabular Q-learning instead of neural networks.
The reward is not hard-coded — you provide it interactively.

But the structure is the same idea as in bigger systems:

state → action → reward → update the policy/value function
repeat many times → the agent’s behavior improves

You can think of complex algorithms like PPO or SAC as “much smarter, GPU-powered versions” of this little loop, usually with:

vector states (images, sensor data, joint positions)
continuous actions (forces, torques)
neural networks instead of a simple Q-table

Your demo is a great way to make that learning loop visible and interactive.

Explores hyperparameter optimization for SAC and PPO using Tree-structured Parzen Estimator (TPE) on a 7-DOF robotic arm. TPE significantly improves success rates and speeds up convergence (up to ~80% faster training for SAC and ~76% faster for PPO to reach near-optimal reward), highlighting how advanced hyperparameter search boosts deep RL performance in complex robotic tasks.

Jonaid Shianifar, Michael Schukat, Karl Mason
International Conference on Practical Applications of Agents and Multi-Agent Systems (Springer Nature Switzerland), pp. 293–304, (2024)

View paper