What exactly is a build?
Have you ever wondered how video games are created? A plethora of roles are involved in the creation of a video game, e.g., developers write code, designers create stories and characters, while artists provide all kinds of assets. Finally, all the created artifacts are built together to produce a playable game.
But what exactly does a "build" mean in the game development context?
[Image 1]
Building a game is just like domino art, where a sequence of standing dominoes is set up and then toppled, causing a chain reaction where each falling domino knocks down the next one. In the context of game development, the code, assets, and sound effects are the domino tiles. Dominoes have a specific path dictating how dominos should be knocked down, the game components also have dependencies on each other. In this chain reaction, one mistaken tile can lead to a complete failure of the domino game. Setting up thousands of dominoes and building a complex video game that contains millions of source files are both expensive in terms of computation and time.
What is build outcome prediction?
Build outcome prediction refers to the process of predicting the outcome or result of a software build process based on analyzing the historical builds and the current build before it is actually executed. Build outcome prediction aims to provide developers with insights into the potential success or failure of a build.
[Image 2]
Picture yourself mushroom picking with a limited-space basket, aiming to gather as many edible mushrooms as possible. To streamline the process, you leverage your mushroom knowledge to pre-filter the poisonous ones. You identify common features like volva, white gills, red cap, and specific odors associated with poisonous mushrooms. While this pre-filtering method doesn't guarantee that all mushrooms in your basket are edible or that all discarded mushrooms are poisonous, it offers a quick and cost-effective approach compared to sending every mushroom to a laboratory for toxicity examination.
This is similar to the process of build outcome prediction, where we aim to determine the result of a build based on early indicators before executing it. Just as with the poisonous mushrooms, passing and failing builds exhibit certain common features. These features may not be readily apparent to humans, but machine learning techniques can effectively capture and analyze these intricate correlations.
Benefits
Build outcome prediction can contribute to a smoother development experience by providing developers with more confidence in the success of their builds by offering immediate prediction.
In modern software development, the emphasis is on making small incremental changes and conducting frequent testing to detect defects early on, although this approach comes with significant costs. One notable consequence is the increased computational expenses associated with frequent testing.
We can leverage build outcome prediction in three build execution strategies that reduce the computational cost substantially, anticipate potential failure, and prevent build breakage from contaminating the build environment.
1. Build Skipping
Skipping passing builds is a practice where developers or build systems decide to skip the execution of certain build tasks or steps when they believe that the outcome will be the same as a previous successful build. The idea behind skipping is to save time and resources by avoiding the repetition of build processes that are expected to produce identical results.
Build outcome prediction can act as an indicator for the builds-to-skip. Based on the predictions, we make a decision on whether to skip the build for certain changes or not. If the prediction indicates a high probability of a successful build, and there are no critical or high-risk changes, the build can be skipped, saving time and resources.
[Figure 1: The number of executions reduces to 5 from 8. ]
2. Build Batching
Batching builds involves grouping together multiple changes or commits and building them collectively instead of building each change individually. The main purpose of batching builds is to improve efficiency and save time by reducing the number of build executions. Instead of triggering a separate build for every individual change, developers wait until a certain number of changes have accumulated before initiating a build. This allows multiple changes to be built and tested together during busy periods, streamlining the build process.
When batching builds, the build outcome predictor can be utilized to make informed decisions about which changes to include in a batch. Instead of simply batching all changes together, the predictor can be used to identify changes that are more likely to result in a failed build. By prioritizing these changes, the risk of including potentially problematic changes in a batch can be reduced.
[Figure 2: The number of executions reduces to 6 from 8, and no builds are skipped. ]
3. Build Preflight
Build preflight is the practice of building the risky changes locally before merging them into the main pipeline. When the main pipeline is broken, all other incoming builds will fail until the culprit is fixed, delaying developers who committed later from getting feedback, and forcing them to perform a mentally taxing context switch to another task or remain effectively idle. Failing builds can also contaminate the build environment. The purpose of preflight is to avoid failing builds blocking the other incoming builds that would pass.
With build outcome prediction, builds with a higher likelihood of failure can be forced to preflight. This approach increases the stability of the main pipeline, leading to more efficient development processes.
[Figure 3: The number of executions is not reduced, but the number of failures in the main pipeline is reduced from 3 to 1. ]
4. Hybrid
We also recommend a mix of the above approaches, as shown in the figure. If a build is predicted to fail, then it should be forced to build locally (preflight) before merging into the main branch. If multiple builds are predicted to pass, then we batch them together to save executions. As such, we can reduce the number of executions and the number of times a broken build pipeline is encountered with a safe guarantee.
[Figure 4: The number of executions is reduced to 6 from 8, and the number of failures in the main pipeline is reduced from 3 to 1. ]
How does build outcome prediction work?
Current build outcome prediction research [cite BuildFast] incorporates feature engineering to represent a build using three key aspects: the current build data, the previous build data, and the historical build data.
Although these approaches show plenty of promise, the distinct characteristics of video games present new challenges for build outcome prediction. Prior work on build outcome prediction has largely focused on projects that are code-intensive. As such, many features adopted by these studies are code-specific, e.g., the number of lines changed in source code. In the video game setting, data artifacts and changes therein are more prevalent than source code and code changes. These data artifacts play a crucial part in the game experience. The game engine compiles the data artifacts with source code in an order-sensitive manner that respects the specified dependencies. Therefore, if data artifacts are corrupted, or dependencies are not respected, data changes will incur build failures.
To accommodate the unique data changes in video game settings, we explore the following aspects of builds.
Context
Change sets are typically submitted with the intention of integrating new features or bug fixes. This context transitively applies to the builds that are invoked for them. We infer the context of a build from the metadata of its change set, e.g., commit message, file types, and prior build activity.
Relevance
In the development process, a set of builds may relate to a single task. For example, when an initial build error occurs and subsequent failing builds are attempts to fix it; or when a large task is decomposed into a series of incremental change sets. Therefore, the features that exploit the status or context of a previous build should be constrained to the context when the prior and current builds are part of the same task. Thus, we propose features to suggest to the model when the previous build features are relevant for the current prediction, and when they should be ignored.
Dependency
[Image 3: An example of a multidisciplinary dependency graph, with green nodes representing data files, orange nodes representing the boundary nodes, and pink nodes representing the code files. Dependency graphs of video games may contain millions of nodes. ]
Modifying files that are depended upon by a large number of files is riskier than modifying a file that has no dependents. For example, changing a popular library function will propagate that change to its many use points, and is inherently riskier than changing a short script that calls a function from the library. Therefore, we propose features that represent the scale of impact on dependencies of the current change.
In video game development, the interaction between code and data files defines the game functions, manages the game assets, and determines the player experience. Cross-boundary changes occur when a code change updates a code file that is depended upon by data files.
[Figure 5: An example of a cross-boundary change. The impacted nodes refer to the nodes that are dependent on the changed node.]
We find that cross-boundary changes have a substantially larger impact than other types of changes, and thus incur significantly more build failures than other types of changes. Therefore, we propose features that indicate if the current change is crossing boundaries.
RavenBuild
To complement the existing work, we propose RavenBuild, which describes builds from three file type-agnostic aspects: (1) context-aware features that characterize the intention and context of the change being submitted; (2) relevance-aware features that compare the current build and its immediate predecessor to provide hints of when the model should consider the previous build information; and (3) dependency-aware features that assess the impact of the change on other files in the dependency graph.
Results
[Figure 6]
Compared to the state-of-the-art (BuildFast), RavenBuild improves by 19.8 percentage points the F1-score in the failure class, by 24.4 percentage points the recall of the failure class, and by 19.4 percentage points the AUC.
Takeaways
Build outcome prediction holds significant potential for reducing the cost of delivering games through various strategies such as build skipping, build batching, and build preflighting. When employing machine learning techniques to identify the commonality of the passing and failing builds, it is crucial to consider the inherent characteristics of the build system to gain a deeper understanding. Leveraging the context, relevance, and dependency knowledge allows build outcome prediction to have more accurate and efficient build outcome prediction, leading to cost savings in game deliveries.