Unity Machine Learning: Teaching MO Agents to Jump Over Walls

There have been major breakthroughs in reinforcement learning (RL) over the past few years: from the first successful use of it in raw pixel training to the training of Open AI Roborists, and increasingly sophisticated environments are needed for further progress, to which help Unity comes.

The Unity ML-Agents tool is a new plugin in the Unity game engine, allowing you to use Unity as an environment constructor for training MO agents.

From playing football to walking, jumping from walls, and training AI dogs to play sticks, the Unity ML-Agents Toolkit provides a wide range of conditions for training agents.

In this article, we will look at how Unity MO agents work, and then we will teach one of these agents to jump over walls.

image


What is Unity ML-Agents?


Unity ML-Agents is a new plugin for the Unity game engine, which allows you to create or use ready-made environments for training our agents.

The plugin consists of three components:



The first - a learning environment ( the Learning Environment ), containing scenes of Unity and environmental elements.

The second is the Python API , which contains the RL algorithms (such as PPO - Proximal Policy Optimization and SAC - Soft Actor-Critic). We use this API to launch training, testing, etc. It is connected to the learning environment through the third component - an external communicator .


What the learning environment consists of


The training component consists of various elements:



The first agent is the stage actor. It is he who we will train, optimizing a component called the “brain” (Brain), in which it is recorded what actions need to be performed in each of the possible states.

The third element, the Academy, manages agents and their decision-making processes and processes requests from the Python API. To better understand its role, let's recall the RL process. It can be represented as a cycle that works as follows:



Suppose an agent needs to learn how to play a platformer. The RL process in this case will look like this:

  • The agent receives state S 0 from the environment - this will be the first frame of our game.
  • Based on state S 0, the agent performs action A 0 and shifts to the right.
  • The environment goes into a new state S 1 .
  • Agent receives R 1 reward for not being dead ( Positive reward +1).

This RL cycle forms a sequence of state, action and reward. The agent's goal is to maximize the expected total reward.



Thus, Academy sends instructions to agents and provides synchronization in their execution, namely:

  • Collection of observations;
  • The choice of action in accordance with the laid down instructions;
  • Action execution;
  • Reset if the number of steps has been reached or the goal has been reached.


We teach the agent to jump through the walls


Now that we know how Unity agents work, we will train one to jump through walls.

Already trained models can also be downloaded on GitHub .

Wall Jumping Learning Environment


The goal of this environment is to teach the agent to get to the green tile.

Consider three cases:

1. There are no walls, and our agent just needs to get to the tile.

image

2. The agent needs to learn how to jump in order to reach the green tile.

image

3. The most difficult case: the wall is too high for the agent to jump over, so he needs to jump onto the white block first.

image

We will teach the agent two scenarios of behavior depending on the height of the wall:

  • SmallWallJump in cases without walls or at low wall heights;
  • BigWallJump in the case of high walls.

This is what the reward system will look like:



In our observations, we are not using a regular frame, but 14 reykast, each of which can detect 4 possible objects. In this case, reykast can be perceived as laser beams that can determine whether they pass through an object.

We will also use the global agent position in our program.

image

Four options are possible in our space:



The goal is to achieve a green tile with an average reward of 0.8 .

So let's get started!


First of all, open the UnitySDK project .

Among the examples you need to find and open the WallJump scene .

As you can see, there are many agents on the stage, each of which is taken from the same prefab, and they all have the same “brain”.

image

As with the classic Deep Reinforcement Learning, after we launched several instances of the game (for example, 128 parallel environments), now we simply copy and paste the agents to have more different states. And since we want to train our agent from scratch, first of all we need to remove the “brain” from the agent. To do this, go to the prefabs folder and open Prefab.

Next, in the Prefab hierarchy, you need to select the agent and go to the settings.

In the Behavior Parameters, you need to delete the model. If we have several GPUs at our disposal, you can use the Inference Device from the CPU as a GPU.

image

In the Wall Jump Agent component, you must remove the Brains for a case with no walls, as well as for low and high walls.

image

After that, you can start training your agent from scratch.

For our first training, we simply change the total number of training steps for two behavior scenarios: SmallWallJump and BigWallJump. So we can achieve the goal in just 300 thousand steps. To do this, in config / trainer config.yaml, change max_steps to 3e5 for the SmallWallJump and BigWallJump cases.

image

To train our agent, we will use PPO (Proximal Policy Optimization). The algorithm includes the accumulation of experience in interacting with the environment and using it to update decision-making policies. After updating it, the previous events are discarded, and the subsequent data collection is carried out already under the terms of the updated policy.

So, first, using the Python API, we need to call an external communicator so that it instructs the Academy to launch agents. To do this, open the terminal where ml-agents-master is located and type in it:

mlagents-learn config/trainer_config.yaml — run-id=”WallJump_FirstTrain” — train

This command will ask you to start the Unity scene. To do this, press ► at the top of the editor.

image

You can watch the training of your agents in Tensorboard with the following command:

tensorboard — logdir=summaries

When the training is over, you need to move the saved model files contained in ml-agents-master / models to UnitySDK / Assets / ML-Agents / examples / WallJump / TFModels . Then, open the Unity editor again and select the WallJump scene , where we open the finished WallJumpArea object .

After that, select the agent and in its behavior parameters drag the SmallWallJump.nn file into the Model Placeholder.

image

Also move:

  1. SmallWallJump.nn at No Wall Brain Placeholder.
  2. SmallWallJump.nn at Small Wall Brain Placeholder.
  3. BigWallJump.nn at No Wall Brain Placeholder.

image

After that, press the ► button at the top of the editor and you're done! The agent training configuration algorithm is now complete.

image

Experiment time


The best way to learn is to constantly try to bring something new. Now that we have already achieved good results, we will try to put some hypotheses and test them.


Reducing the discount coefficient to 0.95


So we know that:

  • The larger the gamma, the lower the discount. That is, the agent is more concerned about long-term rewards.
  • On the other hand, the smaller the gamma, the greater the discount. In this case, the agent's priority is short-term compensation.

The idea of ​​this experiment is that if we increase the discount by decreasing the gamut from 0.99 to 0.95, the short-term reward will be a priority for the agent - which may help him to quickly approach the optimal behavior policy.



Interestingly, in the case of a jump through a low wall, the agent will strive for the same result. This can be explained by the fact that this case is quite simple: the agent only needs to move to the green tile and, if necessary, jump if there is a wall in front.



On the other hand, in the case of Big Wall Jump, this works worse, because our agent cares more about the short-term reward and therefore does not understand that he needs to climb onto the white block in order to jump over the wall.

Increased neural network complexity


Finally, we hypothesize whether our agent will become smarter if we increase the complexity of the neural network. To do this, increase the size of the hidden level from 256 to 512.

And we find that in this case the new agent works worse than our first agent. This means that it makes no sense for us to increase the complexity of our network, because otherwise the learning time will also increase.



So, we trained the agent to jump over the walls, and that's all for today. Recall that to compare the results, trained models can be downloaded here .

image

All Articles