RL environment & stable-baselines

字数统计: 657阅读时长: 2 min

 2020/07/19   Share

Customizing Environments

为了客制化符合自己需要的RL environment, 我们首先需要理解RL agent和env交互的主要逻辑：在每次开始前environment对环境中的各种变量进行重置并返回环境初始的observation obs，对应父类gym.Env中的reset()方法；agent将环境给出的obs映射到action上；环境接收到action之后对环境中的物理变量进行改变，并返回给agent一个四元组(obs, reward, done, info) ，其中done是一个表示环境的一个episode 是否结束的标志变量。

根据以上的逻辑，通用的 customized environments 只需要继承gym.Env上的四个函数，如下所示：

class CustomEnv(gym.Env):
    def __init__(self, *args):
        super(CustomEnv, self).__init__()
        # TODO: 
        self.reset(*args)

    def reset(self):
        # TODO: reset the state of the env.
        return obs

    def step(self, action):
        # TODO: internal state change according to action.
        return obs, reward, done, info

    def render(self, mode='human'):
        # TODO: render the internal state of the env.

Action & Obs Space

如果我们正在解决的是以algorithm为导向的项目，以上的逻辑已经满足了大部分的需求（因为model部分是我们自己开发的）。但是有时我们需要利用他人已经做好的RL algorithm来解决实际问题，而不太希望自己去处理算法细节，以上的实现就会缺少两个比较重要的components，分别是action_space和observation_space，它们限制了agent网络的input shape和output shape.

gym.spaces已经实现了非常多的space类型，比较常见的是Discrete和Box. The Discrete space allows a fixed range of non-negative numbers. The Box space represents an n-dimensional box, so valid observations will be bounded by each low/high bound at each dimension.

除了以上两个常见类型（它们被大多数的agent algorithm所支持）之外，还有MultiDiscrete和MultiBinary. 我们的项目中就使用到了MultiBinary的space，因为我们需要对每个object去进行0-1分类来决定是否这个object应该被执行某种操作. 这两种类型不是非常常见，而且有些RL算法对于这种类型的space不支持. Tuple是可以将多个不同质的action_space组合在一起的space类型，但是它基本不被现成的RL算法库支持.

因此我们给出了CustomEnv所有需要实现的骨架代码：

class CustomEnv(gym.Env):
    def __init__(self, *args):
        super(CustomEnv, self).__init__()
        # TODO:
        
        self.action_space = None
        self.observation_space = None
        self.reset(*args)

    def reset(self):
        # TODO: reset the state of the env.
        return obs

    def step(self, action):
        # TODO: internal state change according to action.
        return obs, reward, done, info

    def render(self, mode='human'):
        # TODO: render the internal state of the env.