Link to the paper: https://aclanthology.org/2023.nlp4convai-1.11.pdf
Intro
Improvements to Large Language Models (LLMs) have led to a tremendous number of applications in the video game industry, ranging from offline features like assisting writers in their writing process to online ones as making lifelike non playable characters. While modern LLMs can generate fluent dialog lines, these lines can often sound monotone and generic. In actual video game scripts, each NPCs has its own personality, style, and speech patterns which bring them to life, and incorporating those nuances in the LLM output is not trivial. At Ubisoft La Forge, in collaboration with in-house writers and narrative directors we have conducted a research project that aims at solving this challenge: We know we can generate text but how can we imbue it with the style of a video game NPC to create truly believable characters?
Leveraging one of our main NLP competitive advantages
As a AAA studio with a long track record of narrative heavy games, Ubisoft has several unique competitive advantages when it comes to creating stylish dialog, including our staff of full-time professional scriptwriters, narrative designers and voice designers, but also our complete catalog of Ubisoft AAA video games, with text data ranging from NPC barks to cinematic scenes, as well as lore and UI text such as Quest descriptions to guide the training of our models.
For this project, we collected the scripts of 23 previously published Ubisoft video games full of well-known titles from the Assassin's Creed, Watchdogs, Far Cry, and Tom Clancy universes and composed a first of its kind video games cutscene dataset, which we call the UbiScene dataset. We performed extensive cleanup to polish away the idiosyncrasies of each production, such as the duplication of lines for female and male versions of the character that require different voice overs. In the end, we arrived at a corpus whose overall statistics and a sample scene can be seen below.
Existing work on stylistic dialog generation
Recent academic work on stylistic dialog generation focuses on using attributes such as an explicit outline of the intended style, character descriptions, a small number of previous character utterances, or the local conversation history to represent a character's style. While these methods produce interesting results, we argue that the style of a video game NPC is too complex to be summarized by a few sentences and additionally that these approaches do not synergize with the gradual development of a character that occurs during the writing process, which is by its nature incremental.
Preferential word choice as a proxy for character style
It was a sunny summer day in the Montreal office of Ubisoft and our team, was headed to the rooftop balcony to eat our lunch. One of us, a bilingual Quebecer, mentioned that they hoped we could get a table under a parasol. Another teammember, this time an Anglophone, remarked how the choice of the word "parasol" while totally correct was an odd one in their native USA where the word "umbrella" would have been the natural choice.
It was this conversation on the stairs, and the much more excited and linguistically nerdy one that immediately followed over lunch on the rooftop, that led us to our guiding hypothesis for this project: that a strong component of a person's speaking style is simply their preferential choice of words that are either optional or semantically exchangeable with other words in context.
To formalize this intuition into a computational model, we set out to build on top of the k-nearest neighbor language model to add character style to LLM generations, as they are designed to mine a reference corpus of text for examples of what text should be generated next. While the reference corpus can take a variety of forms, we realized that if it consisted of the full list of every line previously authored to come from a character, all the preferential word choices that the scriptwriters had bestowed on the character would be readily available to be reused.
After a few months of data cleaning, model building, and experimentation, our final method employed two components that work together to comprise the dialog generation process:
- The dialog generation module for which we used GPT-J, a 6 billion parameter open source LLM. This component is responsible to encode the dialog context and keep the flow of the dialog consistent with the scene at hand.
- A non-parametric token retrieval module with access to a collection of all the previous contextualized lines from the character. This component is responsible for incorporating the style of the character (in terms of word choice) in the generated lines.
Both components generate text by iteratively selecting the word that comes next, and our primary academic contribution is the design of a learned switching algorithm that chooses which component should supply the continuation at each iteration. This is illustrated below, where the colored text in the generated dialog is drawn from the "Past Utterances" while the black text is generated by the dialog generation module.
How to build a character style index?
These "Past Utterances" are represented in our model by a character specific style index and used as the datastore in the kNN-LM model. Given a character's list of past utterances , we can define its style index as the following set of key-value pairs:
where f(wi-) is a vector encoding of the prefix of s at index i, but before the decision to produce wi has been made. Put in plain English, we create a lookup entry in the index for every word in every past utterance, where the key for the entry is an encoded representation of the dialog leading up to that word.
At inference time we are given the dialog context and the style index of the currently speaking character, and we retrieve the k = 10 nearest neighbors of f(c) among the keys of S using L2 distance. Once we have retrieved the tokens, we create a probability distribution over the LLM vocabulary (we denote this pkNN ) and combine it with the probability distribution returned by the LLM ( pLM ) using an interpolation term λ.
The parameter λ is important as it is essentially a switch between pkNN and pLM, mathematically formalizing a fuzzy decision to use the character style or the generative LLM to choose the next word. Using a higher λ across the board would result in more stylistic generated lines but less fluent outputs; because of the small size of the character style index relative to the training corpus of an LLM it is frequently the case that all tokens retrieved are not relevant in our very particular dialog context. On the other hand, a constant but low λ would result in more fluent but less stylistic sentences.
It stands to reason that the choice of λ should depend on several factors, including the actual L2 distances of the nearest neighbors and the dialog context. As such, our primary research focus was to find a way to dynamically set λ at each generation step to avoid using the character style index when the LLM does not need to incorporate style or when the tokens retrieved are not relevant in a particular dialog context.
Creation of a Style Adapter
We identified the following factors that allow us to train a model to interpolate between the two probability distributions:
- The last hidden state of the LLM: giving a representation of the dialog context.
- The raw distances returned by the kNN component: showing how confident the model is in the retrieved tokens.
- The probability of the k tokens retrieved under the LLM: highlighting how appropriate are the tokens retrieved with respect to the current context.
We concatenate the vectorized representations of these factors into a vector that we denote as , and use it to produce a runtime prediction for λ at every time step as follows:
where σ is the sigmoid function, and W is a parameter vector. With all the pieces in place our overall probability model can be summarized in the figure below.
After training our architecture on the UbiScene dataset, we looked at a histogram of the λ values predicted across our development set. We observed a bimodal distribution with one peak near zero, which indicates that the model was indeed able to construct a switching style behavior in which it was able to identify cases when style was better discarded as well as cases where it was useful to incorporate in the system's predictions.
Results
We compared our model to a version with a fixed λ and to an unoptimized version of a similar technique for style adaptation called Pseudo Dialog Prompting (see our paper for more information).
For evaluation metrics we used perplexity, the accuracy of an auxiliary style classifier and n-gram overlap[*], following previous work on stylized dialog generation.
From those results, it is worth highlighting that our method outperforms the standard kNN-LM with a fixed interpolation term (where λ was tuned on the validation set of Ubiscenes). We also note that our method can be used in combination with Pseudo Dialog Prompting and other state of the art stylized dialog generation methods for further improvements on 4/5 of the metrics, as shown in the lower half of the table.
Through qualitative analysis of the model's output, we can observe clearly that our model is making use of the speech patterns of the character to make its predictions. However, it is important to highlight that in our qualitative analysis we noticed a small decrease in fluency especially when the model tries to write lines for narrators, for which the use of word choice as a proxy for style is slightly inappropriate and manifests with a bias to include named entities from the particular games.
Conclusion
As can be seen, there is still a significant disparity between all the models we tested and authentic human authored scripts. While this indicates that LLMs are far from achieving the level of stylistic generation of human authors, our model takes a significant step forward in the space of tools to create drafts for editing or use as a source of inspiration to spur forward the creative process by adapting generation to arbitrary characters' speaking patterns. In future work, we will continue to develop our techniques and integrate SASS into our deployments of language generation in the AAA video game pipeline.
* More information in our paper.