Motion Matching is a simple yet powerful way of animating characters in games. Compared to other methods, it doesn’t require very much manual work once you have a basic set-up: there is no need to structure clips in graphs, to carefully cut or synchronize them, or to explicitly create new transitions between states. However, Motion Matching works best when combined with lots of motion capture data, and all of that data comes at a cost: large memory usage, which only gets worse the more the system grows, and the more situations it is applied in. Here we present a solution to this problem called Learned Motion Matching which leverages Machine Learning to drastically reduce the memory usage of Motion Matching based animation systems.
Motion Matching
First, let’s try to understand how Motion Matching works itself. A good place to start is by examining the data we’re going to be working with – long, unstructured animations – usually from a motion capture dataset. Here’s a typical animation from such a dataset:
Deciding when to start playing a new animation rather than continuing the animation you are already playing is key. And picking another clip, which is to say, making a jump to a different frame of animation in our dataset, requires us to be able to measure how much better or worse this switch would be for our task. Deciding exactly how to do this is difficult, because a single frame of animation contains a lot of information – and some of the information might not be useful for this decision.
A good solution is to hand pick some of this information, which we will call features, and use these to decide how well a particular frame of animation matches the task. In this case we need features which express two things – the path which the character is going to follow so we can see how similar it is to our desired path – and the current pose of the character so that if we start playing back the new clip there is not a large change in the character’s posture. With some experimentation, we can find that the feet positions, their velocities, the character’s hip velocity, and a couple of snapshots from the future trajectory position and direction are the only features we need to achieve this path following task. This is what they look like:
This information is much more compact and manageable, not just for our eyes, but also for the algorithm that must search for a good match at runtime. If you gather all the feature values from each frame of animation into an array you obtain a vector we call the feature vector (the colored array in the video above). This is a compact numerical representation of the ability of this frame to perform the task. We extract these features for all frames in the dataset and stack the resulting vectors into a big matrix. This, we call the matching features database. When it’s time to decide if we should switch to a new clip, we form a query feature vector from the current pose and the coming portion of the desired path, and we search the matching features database for the entry that matches the query the best. Once found, we lookup the corresponding full pose in the animation dataset and start the playback from there. Here is a visualisation of the search and the best match it finds:
Finally, we can apply some foot locking and inverse kinematics to produce a nice, polished result:
And that’s it – by choosing different kinds of features to match we can build animation systems that achieve different kinds of tasks, such as interaction with props, navigation over rough terrain, or even reactions to other characters. This is the secret sauce behind many of the best animation systems in games .
The Scalability Problem
Now that you know how Motion Matching works, have a think about how much data may be required for a AAA production. Well, it depends on the design. There are many factors to consider, such as what types of locomotion you want to support (idle, walk, jog, run, sprint, slow strafe, fast strafe, jump, hop, crouch, …), the available actions and their parameters (opening a door, going to sit on a chair, picking up an object, mounting a horse,…), combinations (like walking, walking while holding a light weapon, while holding a heavy two-handed weapon), and all of that possibly compounded by archetype styles (like flavors of civilians and soldiers) and gameplay states (injured, drunk, etc.). Before you know it, you will have many hundreds of megabytes, perhaps gigabytes, of data going into the system. And, let us emphasize one particularly important point about Motion Matching: it does not combine data for you. If you have an animation of walking and one for drinking, Motion Matching will not be able to produce a scene where the character is walking while drinking. Motion Matching will only play back the data you provide it.
And, not only do we have to keep all this animation data in memory, but there is also the matching features database, which grows proportionally to the number of features, and the amount of animation data used. Naturally, searching a larger database means poorer runtime performance too. This is why we say Motion Matching scales poorly in terms of data.
Our goal in this research is simple: to replace the Motion Matching machinery with something that produces exactly the same result, but which does not require keeping as much data in memory. The work flow for animators should remain exactly the same: let them craft whatever system they like using Motion Matching and as much data as they want. Then, once they are finished, plug-in an alternative system which produces the same result but which has both lower memory usage and constant CPU cost. This is where Learned Motion Matching comes in.
Learned Motion Matching
To start, we need to think a bit more abstractly about our animation system, and consider it more as a logical system that takes as input some control query (be it a desired path to follow, controller stick positions and button presses, etc.) and produces as output a continuous, fluid animation. An animation system based on Motion Matching, as we saw, does this using the animation data that is supplied to it plus the features database which it periodically searches.
To be more specific we can say an animation is described by a sequence of full poses, each referenced by a frame index, which is something that points to where the pose is stored in the animation dataset. To play back a clip means to increment the current index on every frame and lookup the corresponding full pose from the animation dataset. The motion matching search is done every few frames, and works by comparing the query features to every entry in the matching features database – returning the frame index of the best match. The best frame index then replaces the current index, and playback resumes from there. The following diagram summarizes this logic:
Let’s first think about how we might be able to remove the animation dataset in this diagram. One idea is to try and re-use the matching features database computed for use in the search. After all, the features stored in this database capture many important aspects of the animation. Let’s train a neural network – called the Decompressor – to take as input a feature vector from the matching features database and produce as output the corresponding full pose. With this network, the per-frame logic now looks like this:
Let’s see how well this performs compared to the standard pose lookup. Below you can see a comparison between using the Pose Lookup to generate the pose in grey, and using the Decompressor to generate the pose in red.
We can see that although the animations are similar, there are occasionally some visible errors (look at the hand positions on the wire skeleton to the left). This is because the matching features do not quite carry enough information to allow us to reconstruct the pose in all cases. Nonetheless, the quality of the reconstructed animation is surprisingly good. Instead of relying on the matching features database alone, what if we give the Decompressor a little extra information, such as a few more additional features?
In fact rather than picking these extra features by hand we can use an auto-encoder network to extract them automatically (see the paper for more information). Training this auto-encoder gives us a vector of extra features per frame, automatically selected to improve the Decompressor’s accuracy. We must store these extra features for all frames into their own extra features database alongside the matching features database. Let’s see what that means for our logic:
And here is the difference in quality with these additional features, where Pose Lookup is shown in grey, and the result using the Decompressor & Extra Features is shown in green:
Nice! The output is now almost completely identical to the original animations. How much memory have we saved in the process?
Now that we don’t have to store the animation dataset in memory we have managed to significantly decrease the memory usage with practically imperceptible changes to the output. We have also kept the original Motion Matching behaviour exactly the same.
Unfortunately, the scalability problem is not entirely gone: the remaining matching and extra features databases still grow proportionally to the size of the original animation dataset. First, let’s simplify our diagram by joining the matching feature database and the extra features database into a combined features database. A combined features vector for a given frame is therefore just the concatenation of the matching and extra features vectors.
The next step is to try and tackle the reliance on the combined features database at every frame. Here, instead of incrementing the current frame and then doing the feature lookup, we will train another neural network – called the Stepper – and use it to predict the next frame’s combined features vector from the current frame’s combined features vector. Because we expect to provide transitions frequently (say 5 times per second), the Stepper only has to learn to predict over a short period of time, which makes it small in size while still fairly accurate. Using the Stepper the new logic looks like this:
Great! Now we don’t need to use the combined features database every frame. But during the transition stage there are still two final operations that rely on having the combined feature database in memory: the actual motion matching search and following feature lookup. Here, instead of searching for the best matching features, looking up the corresponding extra features, and returning the combined vector, we are going to train a third, final neural network – called the Projector – to predict the combined features directly from the query vector, essentially emulating the search and lookup. Now we are able to do away with all databases completely:
With all those networks trained, the full Learned Motion Matching logic looks like this:
Not only is this logic quite simple, but just as we found with the Decompressor, if each of these individual networks is trained accurately enough, the difference between Learned Motion Matching and Basic Motion Matching is almost imperceptible. Here we show Learned Motion Matching applied to the same path following example we’ve been using:
And here are the memory costs involved:
So for this example Learned Motion Matching provides a ten times improvement in memory usage – not bad!
However, if we really want to see massive gains, we need to substantially increase the content of our animation dataset. In the experiment we add 47 new joints to our character to include hands and fingers, and add around 30 new styles of locomotion to our data set, all of them controllable via a switch. This is how the interactive Learned Motion Matching locomotion controller looks once we’ve added all this data and trained all the networks involved:
And here are the memory costs involved:
Even though we’ve added a huge amount of data, including many more joints and features, growing the memory usage of Motion Matching up to over half a gigabyte, the memory usage of Learned Motion Matching still remains relatively small. In the end, all we need to store is around 17 MB of neural network weights. We even found that network weights could be compressed further by a factor of two without any visual impact on the result by quantizing to 16-bit integers. If we include that, we can achieve a 70 times improvement in memory usage – compressing the whole 590 MB controller into 8.5 MB of memory!
So Learned Motion Matching is a powerful, generic, systematic way of compressing Motion Matching based animation systems that scales up to very large datasets. Using it, complex data hungry animation controllers can be achieved within production budgets, such as in the following video where we show characters walking on an uneven terrain, seamlessly interacting with other characters and props. Every character in this scene is animated with Learned Motion Matching:
We hope that Learned Motion Matching can fundamentally change what is possible with Motion Matching based animation systems, and allow artists, designers, and programmers to completely unleash their creativity, building characters which can react realistically and uniquely to the thousands of different situations that are presented to them in game, without ever having to worry about the impact it might have on memory or performance.
For more details and results, check out the supplementary video , and read the full paper here.
Other Animation Research
Some of the other animation research performed by Ubisoft La Forge:
Automatic In-Betweening for Faster Animation Authoring
Making Machine Learning Work: From Ideas to Production Tools
Ubisoft La Forge Animation Dataset