NN architectures - Leveraging non-smooth multibody dynamics and deep reinforcement learning to

Part III Applications

7.4 NN architectures

7.3 Output and Exploration

PPO, like any Policy Gradient algorithm, makes use of a stochastic policy. This means that the policy, given the state, outputs a distribution of action probabilities.

In the case of continuous actions we can use tanh output layer described in 3.1.1.

The output in (−1, 1) can be scaled to the acceptable range of the physical problem and then the outputs are used as means of a multivariate gaussian distribution. The variances of the distributions are optimizable parameters the determine exploration vs exploitation: at the beginning of the training the variances are relatively large, as the training goes on the optimizer reduces their value, since taking actions closer to the means is on average more convenient as the policy performs better.

Fig. 7.1: The action is sampled from a Multivariate Gaussian whose means are the NN output

7.4 NN architectures

The algorithm explained in 7.2 imports a PyTorch Model about which very few assumptions are made:

• The output of the model must be in the form of a 2-elements tuple whose first element is a probability distribution and whose second element is a real number.

Obviously, the actor-critic algorithm expects an actor-critic model.

96 7 Implementation of tuple input capable a PPO Algorithm

• The model can process as input the state of the environment

To address the first requirement, all the models we provide contain 2 separate NN whose input is identical while the output of the actor has as many elements as the environment action and the critic has 1 output (the expected VF).

To match the input shape we implemented several NN architectures to accom-modate vector inputs, RGB image inputs and tuple inputs.

To output a distribution, in the form of a multivariate Gaussian, each model also creates a 1D tensor of parameters whose dimension is the dimension of the output to be fed to the PyTorch Normal distribution constructor, in order to pass the dis-tribution in the return. Since the variance is a Torch tensor it can be optimized to balance exploration and exploitation as mentioned in 7.3

Also parameters initialization done in init weights is the same for every class in Model.py, although slightly different for Fully-Connected and Convolutional layers.

In all NN we use tanh activation for the last layer of each Actor to enforce the action in the (−1, 1) range, while all other activation use Rectified Linear Unit (ReLU).

def init_weights(m):

if isinstance(m, nn.Linear):

nn.init.normal_(m.weight, mean=0., std=0.1) nn.init.constant_(m.bias, 0.1)

if isinstance(m, nn.Conv2d):

nn.init.xavier_normal_(m.weight) nn.init.constant_(m.bias, 0.0)

7.4.0.1 FF FCL Architecture

This is the simplest NN implemented. Please note:

• The size of inputs and outputs are not a constrain and are passed to the construc-tor.

• The Actor-Critic double NN, having the same input and the different output sizes.

• The forward method, implicitly called when calling the model, outputs a normal distribution and a number.

class ActorCritic(nn.Module):

def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):

super(ActorCritic, self).__init__() self.critic = nn.Sequential(

nn.Linear(num_inputs, hidden_size), nn.ReLU(),

nn.Linear(hidden_size, 1) )

7.4 NN architectures 97 self.actor = nn.Sequential(

nn.Linear(num_inputs, hidden_size), nn.ReLU(),

nn.Linear(hidden_size, num_outputs), nn.Tanh()

)

self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)

self.apply(init_weights) def forward(self, x):

value = self.critic(x) mu = self.actor(x)

std = self.log_std.exp().expand_as(mu) dist = Normal(mu, std)

return dist, value

7.4.0.2 CNN Architecture

This architecture, taken from [30], is used to process RGB image observation, such as vision-only driving tasks. While preserving some common aspects with the pre-vious NN, we can observe that:

• After 3 convolutional layers, the output of the last convolution is flattened and processed by a FCL into the final output. The flattening is a reshaping of the last convolutional feature done in Flatten.

• To flatten the last convolutional layers we must know its size. The size of the output feature of a convolution operation if evaluated according to equation 3.17.

The outputSize function evaluates the size of the final output feature after an arbitrary number of convolution operations.

• The forward operations has to permute the input dimensions since RGB images come iwith the cannel (R, G and B) as last dimension, while PyTorch expects it to be the first.

def outputSize(in_size, kernel_size, stride, padding):

conv_size = copy.deepcopy(in_size) for i in range(len(kernel_size)):

conv_size[0] = int((conv_size[0] - kernel_size[i] + 2*(padding[i])) / stride[i]) + 1

conv_size[1] = int((conv_size[1] - kernel_size[i] + 2*(padding[i])) / stride[i]) + 1

return(conv_size)

class Flatten(torch.nn.Module):

def forward(self, x):

batch_size = x.shape[0]

98 7 Implementation of tuple input capable a PPO Algorithm

return x.view(batch_size, -1)

class ActorCritic_nature_cnn(ActorCritic):

# CNN from Nature paper.

def __init__(self, image_shape, num_outputs, std=-.5):

super(ActorCritic, self).__init__() self.input_shape = image_shape

fc_size = outputSize(image_shape, [8,4,3], [4,2,1], [0,0,0])

self.actor = nn.Sequential(

nn.Conv2d(3, 32, kernel_size=8, stride=4), nn.ReLU(),

nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),

nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),

Flatten(),

nn.Linear(fc_size[0] * fc_size[1] * 64, num_outputs), nn.Tanh()

)

self.critic = nn.Sequential(

nn.Conv2d(3, 32, kernel_size=8, stride=4), nn.ReLU(),

nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),

nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),

Flatten(),

nn.Linear(fc_size[0] * fc_size[1] * 64, 1) )

self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)

self.apply(init_weights)

def forward(self, x):

# Input shape is [batch, Height, Width, RGB]

# Torch wants [batch, RGB, Height, Width]. Must Permute x = ((x-127)/255).permute(0,3,1,2)

value = self.critic(x)

mu = self.actor(x)#.mul_(2)).add_(-1) std = self.log_std.exp().expand_as(mu) dist = Normal(mu, std)

return dist, value

7.4 NN architectures 99 7.4.0.3 Tuple-Input architecture

In this NN the image is processed to a CNN as in the previous one, while the vector is passed to 1 hidden FC layer. Their output are oncatenated (torch.cat) and pro-cessed by 3 FC hidden layers.

The input is assumed to be a tuple of 2 elements, the first one being a 3D and the second one being a 1D tensor. There is no constrain on the size of the image or the vector, but different tuple would require a different NN architecture.

class MultiSensorEarlyFusion(nn.Module):

def __init__(self, image_shape, sens2_shape, num_outputs, std=-0.5):

super(MultiSensorEarlyFusion, self).__init__() self.input_shape = image_shape

self.sens2_shape = sens2_shape self.num_outputs = num_outputs

fc_size = outputSize(image_shape, [8, 4, 3], [4, 2, 1], [0, 0, 0])

self.actor_cnn = nn.Sequential(

nn.Conv2d(3, 32, kernel_size=8, stride=4), nn.ReLU(),

nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),

nn.Conv2d(64, 64, kernel_size=3, stride=1), Flatten(),

nn.ReLU(),

nn.Linear(fc_size[0] * fc_size[1] * 64, num_outputs * 5),

)

self.actor_fc0 = nn.Linear(sens2_shape, num_outputs * 5) self.actor_fc1 = nn.Linear(num_outputs * 10, num_outputs *

20)

self.actor_fc2 = nn.Linear(num_outputs * 20, num_outputs * 10)

self.actor_fc3 = nn.Linear(num_outputs * 10, num_outputs * 5)

self.actor_fc4 = nn.Linear(num_outputs * 5, num_outputs) self.critic_cnn = nn.Sequential(

nn.Conv2d(3, 32, kernel_size=8, stride=4), nn.ReLU(),

nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),

nn.Conv2d(64, 64, kernel_size=3, stride=1), Flatten(),

nn.ReLU(),

nn.Linear(fc_size[0] * fc_size[1] * 64, num_outputs * 5),

)

self.critic_fc0 = nn.Linear(sens2_shape, num_outputs * 5)

100 7 Implementation of tuple input capable a PPO Algorithm self.critic_fc1 = nn.Linear(num_outputs * 10, num_outputs

* 20)

self.critic_fc2 = nn.Linear(num_outputs * 20, num_outputs

* 10)

self.critic_fc3 = nn.Linear(num_outputs * 10, num_outputs

* 5)

self.critic_fc4 = nn.Linear(num_outputs * 5, 1)

self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)

self.apply(init_weights)

def forward(self, data):

x0 = ((data[0]-127)/255.).permute(0, 3, 1, 2) x1 = self.actor_cnn(x0)

x2 = nn.functional.relu(self.actor_fc0(data[1])) x = torch.cat((x1, x2), dim=1)

x = nn.functional.relu(self.actor_fc1(x)) x = nn.functional.relu(self.actor_fc2(x)) x = nn.functional.relu(self.actor_fc3(x)) mu = torch.tanh(self.actor_fc4(x))

std = self.log_std.exp().expand_as(mu) dist = Normal(mu, std)

y1 = self.critic_cnn(x0)#.view(-1)

y2 = nn.functional.relu(self.critic_fc0(data[1])) y = torch.cat((y1, y2), dim=1)

y = nn.functional.relu(self.critic_fc1(y)) y = nn.functional.relu(self.critic_fc2(y)) y = nn.functional.relu(self.critic_fc3(y)) value = self.critic_fc4(y)

return dist, value

Chapter 8 Replicating and Solving Benchmark Environments

Here we show the environments that we provide as an open source alternative to benchmark continuous control algorithm and how we solved them, being the first application of PyChrono to DRL.

Nel documento Leveraging non-smooth multibody dynamics and deep reinforcement learning to infer control policies for autonomous robots and vehicles (pagine 104-110)