The standard approach with policy gradients for continuous action spaces is to output a vector of parameters to a probability distribution. To resolve the policy into an action for the agent, you then sample from the distribution.

In policy gradients with discrete action spaces, this is actually already the case - the softmax layer is providing a discrete distribution for you, that you must sample from to choose the action.

The general rule is that your probability distribution function needs to be differentiable. A common choice is the Normal distribution, and for the output vector to be the mean and standard deviation. This adds an extra "interpretation layer" to the agent's model in addition to the NN, which needs to be included in the gradient calculation.

Your idea:

e.g. if I have values of 1.0/0.0 for left right, then make the hardest left turn possible, but make a much more gradual turn if my values are 0.6/0.4

. . . is almost there. However, you need to interpret the output values *stochastically*, not deterministically, in order to use policy gradients. A deterministic output based on your parameters has no gradient wrt to improvements to the policy, so the policy cannot be adjusted*. Another way to think of this is that policy gradient methods must have exploration built into the policy function.

It would be quite difficult to turn the left/right outputs you have into a PDF that could be made progressively tighter around the optimal value as the agent homed in on the best actions, so I would instead suggest the common mean, standard deviation split for this, and have the environment cut off the actions at min/max steer if the sampled action ended up as e.g. hard left times 1.7

* Actually this is incorrect as pointed out in Constantinos' answer. There are deterministic policy gradient solvers, and sometimes they are better. They work by learning off-policy. Your network could simply output a steering direction -1.0 to 1.0, but you would *also* need a behaviour policy which added some randomness to this output in order to learn.

I also *think* you would need to switch from A3C to A2C in order to take advantage of deterministic policy gradient solvers.

Thanks for the answer! Could you perhaps elaborate a bit on using deterministic or stochastic PGs? Would a deterministic one function as I suggested in my question? Are there differences in the use cases for each variant? – Levi Botelho – 2018-01-28T13:02:57.913

1Levi, in my answer I give you a link of a blog with a detailed implementation of a Deep Deterministic Policy Gradient algorithm for controlling a car. I think this will help you a lot! – Constantinos – 2018-01-28T18:19:29.853