Proximal Policy Optimization

class olympus.tasks.reinforcement.ppo.PPO(model: olympus.reinforcement.utils.AbstractActorCritic, dataloader, optimizer, lr_scheduler, device, ppo_epoch=5, ppo_batch_size=32, ppo_clip_param=10, ppo_max_grad_norm=1000, criterion=None, storage=None, logger=None)[source]

Bases: olympus.tasks.task.Task

Parameters:
actor_critic: Module

Torch Module that takes a state and return an action and a value

env: Env

Gym like environment

num_steps: int

number of simulation/environment steps to accumulate before doing a gradient step

Notes

RL has two batch size, the data loader batch size (lbs) which is equivalent to the number of simulation done in parallel and the gradient batch size.

num_steps of simulations are accumulated together to perform one gradient update

Attributes:
device
events
metrics
model

Methods

eval_loss(batch) This is used to compute validation and test loss
fit(epochs[, context]) Execute a single batch
get_space(**fidelities) Return hyper parameter space
init([gamma, optimizer, lr_schedule, model, uid])
Parameters:
load_state_dict(state[, strict]) Try to load a previous unfinished state to resume
ppo(current_state, replay_vector) New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent.
state_dict([destination, prefix, keep_vars]) Save a state the task can go back to if an error occur
compute_returns  
finish  
parameters  
report  
resumed  
set_device  
summary  
compute_returns(value, actions)[source]
finish()[source]
fit(epochs, context=None)[source]

Execute a single batch

Parameters:
epoch: int

current step in the training process

context: dict

Optional Context

Notes

You should wrap whatever code you have here inside a BadResumeGuard to prevent users from resuming a failed task that can have a bad states

To resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint

get_space(**fidelities)[source]

Return hyper parameter space

init(gamma=0.99, optimizer=None, lr_schedule=None, model=None, uid=None)[source]
Parameters:
optimizer: Dict

Optimizer hyper parameters

lr_schedule: Dict

lr schedule hyper parameters

model: Dict

model hyper parameters

gamma: float

reward discount factor

trial: Optional[str]

trial id to use for logging. When using orion usually it already created a trial for us we just need to append to it

load_state_dict(state, strict=True)[source]

Try to load a previous unfinished state to resume

Notes

You should wrap whatever code you have here inside a BadResumeGuard to prevent users from resuming a failed task that can have a bad states

To resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint

model
parameters()[source]
ppo(current_state, replay_vector)[source]

New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of mini-batch updates.

References

Original Paper https://arxiv.org/pdf/1707.06347.pdf

state_dict(destination=None, prefix='', keep_vars=False)[source]

Save a state the task can go back to if an error occur