Gradient Update #22: Turing Award for High Performance Computing, Finding Bugs in Games with CLIP
In which we discuss Turing awardee and HPC pioneer Jack Dongarra and CLIP's zero-shot capabilities applied to game physics.
News Highlight: 2021 Turing Award Winner: Jack Dongarra
Professor Jack Dongarra received the 2021 Turing Award for his contributions to high performance computing (HPC) algorithms and software. He has made especially large contributions to numerical algorithms and parallel computing, in large part through leading the design and implementation of numerous ubiquitous open source libraries. Dongarra’s work in HPC is fundamental to compute-intensive applications like AI, climate science, energy, genomics, and beyond.
Professor Jack Dongarra, currently a professor of computer science at the University of Tennessee with joint appointments at Oak Ridge National Laboratory and the University of Manchester, has been awarded the “Nobel Prize of Computing” after a prolific career spanning multiple decades. Before joining the University of Tennessee in 1989, he received his PhD in applied mathematics from the University of New Mexico and worked as a scientist at Argonne National Laboratory.
Dongarra has helped lead the development of libraries including LINPACK, LAPACK, BLAS, MAGMA, and MPI, which are widely-used in a great deal of software. For instance, if you use a numerical library like numpy or scipy to multiply two matrices, the library will call a BLAS routine. If you then compute eigenvectors and eigenvalues, this will be done with a call to a LAPACK routine. If you use PyTorch on GPU, many linear algebra operations are handled with calls to MAGMA.
One extremely useful advancement for machine learning pioneered in part by Jack Dongarra is mixed precision arithmetic. While certain application areas like “computational astrophysics, computational fluid dynamics, nuclear engineering, and quantum computing” often require higher numerical precision, for many tasks in machine learning it suffices to use lower precision arithmetic. While supercomputer performance has historically been benchmarked in double precision, Dongarra recently developed new benchmarks for mixed precision computing.
Why does it matter?
Machine learning and other computationally intensive tasks in diverse application areas heavily rely on HPC innovations. When working in these application areas, people often abstract away or take for granted the low-level hardware and software paradigms that enable them in the background. While this is done with good intentions, we must keep in mind that low-level technology directly influences what high-level applications succeed (cf. the hardware and software lottery). Conversely, high-level applications influence developments in low-level technology. For instance, much effort has been put into developing better and better AI accelerators, and Dongarra has recently helped develop the HPL-AI supercomputer benchmark for the mixed precision computations that are widely used in modern machine learning. All in all, the amazingly impactful work of Jack Dongarra and the high performance computing community is certainly worthy of the recognition that it has received through this Turing Award.
Derek: I’m rather ashamed to admit that I did not know of Jack Dongarra before this year, even though my code calls his libraries everyday, I recently dug into the LAPACK source code for symmetric eigensolvers, and I even took five different numerical computing courses in undergrad! This just goes to show how underappreciated HPC work is; it’s great to see the recognition.
Daniel: I also have to admit I did not know of Jack Dongarra before, and my experience with libraries like LAPACK and BLAS has mostly been limited to installation woes. Having recently become more familiar with some of the low-level details of the ML libraries I use, l’m also happy to see this work being recognized.
Paper Highlight: CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning
A frame from a gameplay video identified using the query “A horse in the air” from the GamePhysics dataset
Building on successes in text-prompted image retrieval, researchers at the University of Alberta developed a framework to find bugs of a certain category from a large dataset of gameplay videos. Such videos are publicly available on the internet and can serve as a trove of information for game developers, particularly to identify bugs. While a huge amount of such data exists, there are no reliable automated methods to filter and analyze this data. In this paper, the authors utilize CLIP, OpenAI’s language and image embedding model, to develop a new methodology for autonomously finding gameplay videos queried through text inputs.
Game testing as a field has significantly gained attention in recent years as the scale of video games has rapidly increased. However, developments in automated testing have not seen the same growth, which creates a practical limit to how extensively games can currently be tested. There is, however, a large body of publicly-available gameplay video data that can be used to find and explore bugs in videos that remains largely unused. Many individuals in the gaming community post videos of bugs on platforms such as Reddit, YouTube and Twitch, which often contain more information on reproducing the bug than a typical bug report.
In this paper, the authors use gameplay videos from the r/GamePhysics sub-reddit to develop a framework that video game developers can use to find real gameplay videos by searching for specific bugs, such as “horse in the air”, or “person stuck in a barrel”. Perhaps the most significant contribution of the paper is a new dataset, called ‘GamePhysics,” that consists of ~27,000 videos from ~1,900 games, with each video lasting between 2 and 60 seconds. Their methodology uses a pre-trained CLIP (Contrastive Language-Image Pre-training) model, which generates embeddings for images and their text descriptions in the same embedding space.
The authors first embed each frame of each video in the GamePhysics dataset in a latent space with no fine-tuning or further training of the CLIP model. Subsequently, a text query from the user is embedded in the same latent space using CLIP’s text embedder. Video frames nearest to the embedded text are then returned using the cosine distance and an optimized search strategy through the Faiss library. This pipeline allows for a large data set of gameplay videos to be easily queried using text inputs from a user. Results from this model show this approach’s promise–the pipeline retrieves videos with text queries like “Person flying in the air”, and “Vehicle on top of building” with high accuracy across multiple games.
Why does it matter?
While game testing is essential for the development of robust games, it currently remains an open problem to be solved at scale. Testing is particularly hard for ‘open-world’ games, which contain a virtual world that players can autonomously explore and play through in many different ways. Currently, game testers play-test games to find bugs, but this significantly limits testing as all possible scenarios for an open-world game cannot be tested manually. The framework developed by the authors allows for a fully open-sourced method with no training time that can effectively provide developers with sample gameplays of specific bug instances that can help with reproducing the bug. Companies can also use the method to query their own datasets of gameplay videos and bug reports, allowing for curation of large unlabeled datasets with minimal overhead. This paper also highlights the transfer-learning capabilities of large models such as CLIP, which was able to find instances of objects and events in videos with no explicit training on the dataset itself. The work adds to a steadily increasing domain of research focused on learning with minimal supervision that will enable AI to scale across industries and use-cases.
Daniel: First off, this is a really cool demonstration of how powerful the zero-shot capabilities of models like CLIP are. Second, this is indicative of many exciting applications we might see for multi-modal models in the future. I hope to see not only image retrieval with text queries, but also advances in work toeing the speech/image domains.
Derek: I like works on image retrieval with text queries (I really hope that I will be able to easily search all photos I have with text queries in the future). Another direction I would like to see is whether outlier detection can be used to find bugs in games without a query that describes the type of bug; imagine an unsupervised ML model monitoring livestreams or youtube videos and flagging visual bugs that should not be a part of normal gameplay.
New from the Gradient
Other Things That Caught Our Eyes
Face scanner Clearview AI aims to branch out beyond police “A controversial facial recognition company that's built a massive photographic dossier of the world's people for use by police, national governments and — most recently — the Ukrainian military is now planning to offer its technology to banks and other private businesses.”
AI algorithms could disrupt our ability to think “In other words, we could already be in the process of outsourcing our thinking to machines and, as a result, losing a portion of our agency.”
How AI and Humans Can Best Collaborate at Work “Who decides who does what? And how can humans learn to trust AI? Research offers some answers.”
California suggests taking aim at AI-powered hiring software “A newly proposed amendment to California's hiring discrimination laws would make AI-powered employment decision-making software a source of legal liability.”
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding… We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
PaLM: Scaling Language Modeling with Pathways To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM… on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. “
The Effects of Regularization and Data Augmentation are Class Dependent “In this study, we demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes… Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay. Those results demonstrate that our search for ever increasing generalization performance -- averaged over all classes and samples -- has left us with models and regularizers that silently sacrifice performances on some classes… designing novel regularizers without class-dependent bias remains an open research question.
HybridNets: End-to-End Perception Network This paper systematically studies an end-to-end perception network for multi-tasking and proposes several key optimizations to improve accuracy… the paper proposes efficient segmentation head and box/class prediction networks based on weighted bidirectional feature network … [and] an efficient training loss function and training strategy to balance and optimize network. Based on these optimizations, we have developed an end-to-end perception network to perform multi-tasking, including traffic object detection, drivable area segmentation and lane detection simultaneously, called HybridNets… HybridNets achieves 77.3 mean Average Precision on Berkeley DeepDrive Dataset, outperforms lane detection with 31.6 mean Intersection Over Union with 12.83 million parameters and 15.6 billion floating-point operations.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language “Large foundation models… store different forms of commonsense knowledge across different domains. In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue -- in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning. In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection.”
Have something to say about this edition’s topics? Shoot us an email at firstname.lastname@example.org and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!