Proximal Policy Optimization¶

class olympus.tasks.reinforcement.ppo.PPO(model: olympus.reinforcement.utils.AbstractActorCritic, dataloader, optimizer, lr_scheduler, device, ppo_epoch=5, ppo_batch_size=32, ppo_clip_param=10, ppo_max_grad_norm=1000, criterion=None, storage=None, logger=None)[source]¶

Bases: olympus.tasks.task.Task

Parameters:	actor_critic: Module Torch Module that takes a state and return an action and a value env: Env Gym like environment num_steps: int number of simulation/environment steps to accumulate before doing a gradient step

Notes

RL has two batch size, the data loader batch size (lbs) which is equivalent to the number of simulation done in parallel and the gradient batch size.

num_steps of simulations are accumulated together to perform one gradient update

Attributes:	device events metrics model

Methods

eval_loss(batch) This is used to compute validation and test loss

fit(epochs[, context]) Execute a single batch

get_space(**fidelities) Return hyper parameter space

init([gamma, optimizer, lr_schedule, model, uid])

Parameters:

load_state_dict(state[, strict]) Try to load a previous unfinished state to resume

ppo(current_state, replay_vector) New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent.

state_dict([destination, prefix, keep_vars]) Save a state the task can go back to if an error occur

compute_returns
finish
parameters
report
resumed
set_device
summary

compute_returns(value, actions)[source]¶

finish()[source]¶

fit(epochs, context=None)[source]¶

Execute a single batch

Parameters:	epoch: int current step in the training process context: dict Optional Context

Notes

You should wrap whatever code you have here inside a BadResumeGuard to prevent users from resuming a failed task that can have a bad states

To resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint

get_space(**fidelities)[source]¶: Return hyper parameter space

init(gamma=0.99, optimizer=None, lr_schedule=None, model=None, uid=None)[source]¶

Parameters:	optimizer: Dict Optimizer hyper parameters lr_schedule: Dict lr schedule hyper parameters model: Dict model hyper parameters gamma: float reward discount factor trial: Optional[str] trial id to use for logging. When using orion usually it already created a trial for us we just need to append to it

load_state_dict(state, strict=True)[source]¶

Try to load a previous unfinished state to resume

Notes

You should wrap whatever code you have here inside a BadResumeGuard to prevent users from resuming a failed task that can have a bad states

To resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint

model¶

parameters()[source]¶

ppo(current_state, replay_vector)[source]¶

New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of mini-batch updates.

References

Original Paper https://arxiv.org/pdf/1707.06347.pdf

state_dict(destination=None, prefix='', keep_vars=False)[source]¶: Save a state the task can go back to if an error occur