There’s been a great deal of interest in Deep Reinforcement Learning lately, with several examples of its applications appearing online. Most Deep Reinforcement Learning algorithms use convolutional networks, despite the prevalent opinion that only shallow convolutional networks are easy to train for Reinforcement Learning. While this may be true in some cases, there’s a solution to this issue: use a pre-trained, deep network and freeze most of it. This leaves just the upper part of it to train, as frozen parts work as a feature detector without trainable parameters.
We conducted an experiment on Torch using a modified Gym-Torch framework. The input state was purely image-based. Some modifications were made to make Gym-Torch to produce images with higher resolution that’s more suitable for ResNet input. Although the original ResNet has RGB, we used three input channels of a previous RBG image to stack three, consecutive grayscale images. Instead of using the original images, we used the difference between the successive, visual images produced by Torch.
Most of ResNet-18 was frozen; only the last block and fully-connected layers on top were trained. We trained ResNet-18 to drive the car with a discrete, Q-learning algorithm. Following standard practice, we used a replay buffer, along with two networks – a current network for driving the car and a target network that changes slowly over time. The parameters of the target network were updated to reflect the average of the current network.
In this case, total reward is merely a reward for current action. For discrete action reinforcement, learning becomes regression, and simple regression is much easier than Reinforcement Learning. With this information, we concluded that gamma parameters are critical to training. Gamma close to one can produce erratic behavior in the middle of training, and this erratic trajectory can persist for a long time if the trajectory is stable. This kind of behavior can encourage the choice of “minimal gamma,” which is indicative of stable, predictive policy.
When training in an environment with a termination condition, it’s important for the reward function to be nonnegotiable on the whole, reachable state space. If the reward is negative on an undesirable part of the state space, the network can choose to terminate training rather than exit and accumulate more negative rewards. Termination is equivalent to no rewards for an infinite number of steps, which the network prefers over a multi-step negative reward. The network would train to stop rather than continue to suffer.