Explores hyperparameter optimization for SAC and PPO using Tree-structured Parzen Estimator (TPE) on a 7-DOF robotic arm. TPE significantly improves success rates and speeds up convergence (up to ~80% faster training for SAC and ~76% faster for PPO to reach near-optimal reward), highlighting how advanced hyperparameter search boosts deep RL performance in complex robotic tasks.
View paperCore idea
In Reinforcement Learning (RL), an agent interacts with an environment. At each time step it:
- observes the state
s - chooses an action
a - receives a reward
rand a new states'
The goal: learn a policy (a way of choosing actions) that maximizes the long-term reward.
Key pieces of the RL puzzle:
- State – what the agent currently sees.
- Action – what the agent can do.
- Reward – scalar signal of “how good was that?”.
- Policy – mapping from states to actions.
- Value – how good it is to be in a state (or take an action) in the long run.
Q-learning (what the demo uses)
Q-learning learns a table or function
Q(s,a) = “expected future reward if I take action
a in state s and act well afterwards”.
The classic update rule is:
Q(s, a) ← Q(s, a) + α · [ r + γ · maxa' Q(s', a') − Q(s, a) ]
In the Human-Reward Q-Learning demo:
- Each cell of the grid is a state.
- Actions are
up, right, down, left. - You choose the reward (
-1,0,+1). - The Q-table is updated based on your feedback.
The agent explores with an ε-greedy policy: most of the time it chooses
the action with highest Q-value, but with probability ε it picks a random action
to keep exploring.
From tables to neural networks
For large or continuous state spaces (e.g. images, robots), we can’t store a Q-value for every state–action pair. Instead we use neural networks.
Policy-gradient
A policy network πθ(a | s) outputs a distribution over actions.
We directly adjust the parameters θ to increase expected reward:
θ ← θ + α · ∇θ E[ return ]
This approach naturally handles continuous actions (e.g. torques for a robot arm).
Actor-Critic
Actor-Critic methods combine:
- an actor (policy network) – decides actions
- a critic (value network) – estimates how good states/actions are
The critic helps the actor learn more stable updates by providing a better estimate of “how good was that trajectory?” than raw returns.
PPO – Proximal Policy Optimization
PPO is a widely used on-policy actor-critic method. It tries to improve the policy while keeping each update close to the old policy so learning stays stable.
- Uses a clipped loss to prevent too-big policy updates.
- Good default choice for many continuous-control tasks.
- Often used in robotics and game AI.
SAC – Soft Actor-Critic
SAC is an off-policy actor-critic algorithm that adds an entropy bonus to the objective:
- Maximize reward and keep the policy “high-entropy” (more random).
- This encourages exploration and often gives very stable learning.
- Works very well for continuous actions (e.g. MuJoCo robots).
In many modern robotic control setups you’ll see SAC or PPO as the main baseline.
Human-reward Q-learning
The “Human-Reward Q-Learning Demo” is intentionally tiny:
- Small discrete gridworld.
- Tabular Q-learning instead of neural networks.
- The reward is not hard-coded — you provide it interactively.
But the structure is the same idea as in bigger systems:
- state → action → reward → update the policy/value function
- repeat many times → the agent’s behavior improves
You can think of complex algorithms like PPO or SAC as “much smarter, GPU-powered versions” of this little loop, usually with:
- vector states (images, sensor data, joint positions)
- continuous actions (forces, torques)
- neural networks instead of a simple Q-table
Your demo is a great way to make that learning loop visible and interactive.
Connect With Me