Proximal Policy Optimization¶
-
class
olympus.tasks.reinforcement.ppo.PPO(model: olympus.reinforcement.utils.AbstractActorCritic, dataloader, optimizer, lr_scheduler, device, ppo_epoch=5, ppo_batch_size=32, ppo_clip_param=10, ppo_max_grad_norm=1000, criterion=None, storage=None, logger=None)[source]¶ Bases:
olympus.tasks.task.TaskParameters: - actor_critic: Module
Torch Module that takes a state and return an action and a value
- env: Env
Gym like environment
- num_steps: int
number of simulation/environment steps to accumulate before doing a gradient step
Notes
RL has two batch size, the data loader batch size (lbs) which is equivalent to the number of simulation done in parallel and the gradient batch size.
num_steps of simulations are accumulated together to perform one gradient update
Attributes: - device
- events
- metrics
- model
Methods
eval_loss(batch)This is used to compute validation and test loss fit(epochs[, context])Execute a single batch get_space(**fidelities)Return hyper parameter space init([gamma, optimizer, lr_schedule, model, uid])Parameters: load_state_dict(state[, strict])Try to load a previous unfinished state to resume ppo(current_state, replay_vector)New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent. state_dict([destination, prefix, keep_vars])Save a state the task can go back to if an error occur compute_returns finish parameters report resumed set_device summary -
fit(epochs, context=None)[source]¶ Execute a single batch
Parameters: - epoch: int
current step in the training process
- context: dict
Optional Context
Notes
You should wrap whatever code you have here inside a
BadResumeGuardto prevent users from resuming a failed task that can have a bad statesTo resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint
-
init(gamma=0.99, optimizer=None, lr_schedule=None, model=None, uid=None)[source]¶ Parameters: - optimizer: Dict
Optimizer hyper parameters
- lr_schedule: Dict
lr schedule hyper parameters
- model: Dict
model hyper parameters
- gamma: float
reward discount factor
- trial: Optional[str]
trial id to use for logging. When using orion usually it already created a trial for us we just need to append to it
-
load_state_dict(state, strict=True)[source]¶ Try to load a previous unfinished state to resume
Notes
You should wrap whatever code you have here inside a
BadResumeGuardto prevent users from resuming a failed task that can have a bad statesTo resume a task, you need to create a clean one with the same hyper parameters. It will pickup automatically where at its last checkpoint
-
model¶
-
ppo(current_state, replay_vector)[source]¶ New policy gradient methods for reinforcement learning, which alternate between split data through interaction with the environment, and optimizing a“surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of mini-batch updates.
References
Original Paper https://arxiv.org/pdf/1707.06347.pdf