The Gradient Update #16: China's World-leading Surveillance Research and a ConvNet for the 2020s

In which we consider China's intense interest in computer vision surveillance research and ConvNets make a comeback.

Jan 20, 2022

Welcome to the 16th update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter! And if not, feel free to share this post!

Summary

A recent report from the Center for Security and Emerging Technology (CSET) revealed that China conducts substantial research into core AI-related surveillance technologies. The study states that the country has a ‘disproportionate share’ of research in three key areas: person re-identification (REID), crowd counting, and spoofing detection (i.e. technologies that aim to expose attempts to subvert identification technologies). Moreover, the study found out that China also produces a substantial amount of work in human-facing computer vision technologies that pertain to action, emotion, and facial recognition.

Background

CSET is a policy research organization within Georgetown University that focuses on the challenges and opportunities of emerging AI technologies. The center’s new report, Trends in AI Research for the Visual Surveillance of Populations, authored by Ashwin Acharya, Max Langenkamp and James Dunham, used SciBERT to analyze papers from 2015-2019.

SciBERT, a transformer model that is trained on scientific and technical text, including ArXiv preprints, combed through over 100 million individual publications across six academic datasets and found out that facial recognition was the most recurrent name. It found that crowd-counting and face-spoofing recognition are up and coming areas of interest as well.

However, analyzing the data on a country-by-country basis revealed that compared to the world average, China is particularly interested in computer vision research.

The graphs above show China’s surveillance research output is either competitive with or exceeds the combined work of the United States, the European Union, CANZUK, East Asian democracies, and India. In ‘Emerging Tasks,’ China’s percentage of surveillance papers eclipses that of the rest of the world combined. In addition, the study revealed that researchers with Chinese institutional affiliations were responsible for more than one third of publications in computer vision and visual surveillance research.

Why does it matter?

China as a country is infamous for its perspective on the privacy and security of its populace. The study argues that these numbers might represent China’s support of or reliance on human-facing computer vision tasks. The raison d'être could lie in benign use cases or repressive purposes. “These algorithms are often applied for benign, commercial uses, such as tagging individuals in social media photos. But progress in computer vision could also empower some governments to use surveillance technology for repressive purposes,” the authors of the study wrote.

Taking a deeper look, contentious issues like China’s treatment of the Uyghur population and COVID-zero policy, are all reliant in one form or another on such computer vision and surveillance technologies. Overall, the study concluded that the general trend and numbers show that “...China (is) by far the most prolific country in both areas [computer vision and surveillance research].”

There are a few caveats to the study, however. Firstly, as stated earlier, the researchers used SciBERT, a pre-trained language model that was initially trained on English text. So, the dataset used to draw these conclusions consisted entirely of papers penned in English. This means that the numbers above might not represent China’s contributions in both computer vision and visual surveillance tasks. The true numbers could sway one way or the other.

Editor Comments

Justin: While in absolute quantity, we are seeing a drastic increase in the amount of computer vision research published, I wouldn’t necessarily use that as a benchmark for the quality of the research. Judging by the frequency of emails I receive from ambiguous Chinese research institutions offering to pay me to republish papers from my undergraduate days, I am highly skeptical that each paper published represents a new contribution, let alone a meaningful one. Given the vast proliferation of surveillance technologies domestically and abroad, I believe the locus of our criticisms should be on specific uses of these technologies rather than follow a shamefully sinophobic media trope of castigating the entire Chinese computer research community as a modern American boogeyman.

Daniel: I strongly agree with Justin’s comments above re where the locus of our criticisms should be. It’s tempting to simplify and dismiss China’s motivations and uses of their technology, but I think we need to better understand the contexts in which they operate to offer more useful perspectives. That is not to dismiss things like the repression of the Uyghur population, but to note that China is a nation with a different governance structure and different value system from our own, and that the way in which we think about these problems doesn’t map cleanly onto analyzing China.

Andrey: This study highlights the need to not think simplistically in terms of an ‘AI race’ or phrases like ‘China is rapidly surpassing the US in its AI capabilities’. It’s important to understand that the situation is more nuanced, with countries having different priorities and different applications of AI that they are heavily investing in and excelling at. For more on this, check out our podcast episode Jeffrey Ding on China's AI Dream, the AI 'Arms Race', and AI as a General Purpose Technology.

Paper Highlight: A ConvNet for the 2020s

Summary

Iterating on a standard ResNet to make it more similar to Transformers, the authors introduce the ConvNeXt, a vision model constructed entirely of standard ConvNet modules that compares favorably with Transformers in ImageNet top-1 accuracy and with Swin Transformers on COCO detection and ADE20K segmentation. The authors first adopt a Transformer-inspired training recipe that includes training for more epochs, using the AdamW optimizer, and adding a number of data augmentations. They then include a number of ViT design choices, such as changing the stage compute ratio, using depthwise convolutions and increasing the network width to match Swin-T’s, creating an inverted bottleneck, and so on.

Why does it matter?

Transformers have now long been the standard backbone for NLP models. When Vision Transformers were introduced, few adjustments were made to include image-specific inductive biases, though Swin Transformers needed changes that indicated the importance of convolutions to image model backbones.

While the authors hope some of their introduced changes promote discussion regarding the fundamentality of convolutions, the road to their conclusions offers a much subtler lesson on training extremely large models. By varying the Macro Design, Inverted Bottleneck and Kernel Size, the authors lay out a detailed roadmap for iteratively improving their models as well as devising a relative ranking that the impact on the various design choices have in performance. There should be plenty of opportunities for future work in exploring the roles that the design choices have on performance as well as (hopefully) how those design choices could generalize across domains (image, text, audio, etc).

Editor Comments

Justin : While I am normally the first to praise the inclusion of Limitations and Societal Impact sections, I think the authors really dropped the ball here. Given the largess of the corporation sponsoring the research and the real world impacts from the deployment of these models, we are warranted more than just a few sentences at the end of a 14 page papers appendix. Real considerations of the limitations and harms have to be undertaken and those results have to be extensively communicated rather than treating this as a box checking exercise necessary for a publication.

Daniel: It’s often surprising to consider how recently transformers and ViTs weren’t even a thing. The prevalence of ViTs and Swin Transformers has led to some great advances for vision, but does seem to miss out on useful inductive biases that ConvNets have, as the authors pointed out in their introduction. As we develop vision models in the future, I feel that building on ones that have task-aligned inductive biases is a promising direction in terms of reaping returns.

Andrey: This paper is a welcome reality check for the field of AI. The simple truth is that rapid progress with Transformers may have been due to the collective effort of the community to optimize that particular architecture, rather than anything inherent in it. It may well be that such careful analysis and improvement of RNNs could vastly improve their performance. Hopefully the field will take note of this result going into the future.

New from the Gradient

A Science Journalist’s Journey to Understand AI

https://thegradient.pub/content/images/2022/01/333883_INT_001-128_Welcome-to-the-Future_UKUS_121-crop.png

“I’m a science journalist who also writes books and articles that introduce scientific concepts to kids and teens. Though I’ve written about everything from outer space to dinosaurs, the topics I gravitate towards most often are computers, robots, and artificial intelligence. … I’d like to share some of the new understandings I’ve come to about AI and cognitive science along the way, as well as what changed my mind or shifted my perspective. I hope these pointers help when you are communicating about AI to those who aren’t experts.”

Continue Reading ->

Eric Jang on Robots Learning at Google and Generalization via Language

In episode 20 of The Gradient Podcast, we talk to Eric Jang, a research scientist on the Robotics team at Google.

Listen ->

Other Things That Caught Our Eyes

News

Former Google scientist says the computers that run our lives exploit us — and he has a way to stop them “As artificial intelligence lays claims to growing parts of our social and consumer lives, it’s supposed to eliminate all the creeping flaws humans introduce to the world.”

Microsoft forms new coalition for AI in healthcare “Microsoft has created the Artificial Intelligence Industry Innovation Coalition (AI3C) to drive the use of artificial intelligence (AI) in healthcare by providing recommendations, tools and best practices.”

ArXiv.org Reaches a Milestone and a Reckoning “What started in 1989 as an e-mail list for a few dozen string theorists has now grown to a collection of more than two million papers—and the central hub for physicists, astronomers, computer scientists, mathematicians and other researchers. On January 3 the preprint server arXiv.”

U.S. Chamber Launches Bipartisan Commission on Artificial Intelligence to Advance U.S. Leadership “The U.S. Chamber of Commerce today announced the launch of its Artificial Intelligence (AI) Commission on Competition, Inclusion, and Innovation to advance U.S. leadership in the use and regulation of AI technology.”

Papers

Explaining in Style: Training a GAN to explain a classifier in StyleSpace - “Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions … We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be modified in different ways to change its classifier output. Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are human-interpretable as measured in user-studies.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - “Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. "make breakfast"), to a chosen set of actionable steps (e.g. "open fridge") … We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. … Website at this https URL”

Scaling Vision with Sparse Mixture of Experts - “Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. “

Memory-assisted prompt editing to improve GPT-3 after deployment - “Large LMs such as GPT-3, while powerful, are not immune to mistakes, but are prohibitively costly to retrain. One failure mode is misinterpreting a user's instruction (e.g., GPT-3 interpreting "What word is similar to good?" to mean a homonym, while the user intended a synonym). Our goal is to allow users to correct such errors directly through interaction -- without retraining. Our approach pairs GPT-3 with a growing memory of cases where the model misunderstood the user's intent and was provided with feedback, clarifying the instruction. … All the code and data is available at this https URL.”

CM3: A Causal Masked Multimodal Model of the Internet - “We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks.”

Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at gradientpub@gmail.com and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient, and consider sharing it!