Gradient Update #7: Deepfake Voice Cloning and How Transformers See the World

In which we discuss the present and future effects of deepfakes and the different ways that transformers and CNNs store representations of the data they process.

Welcome to the seventh update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!

Recently From The Gradient

An Introduction to AI Story Generation
Gradient Community Discussion Thread #1
Systems for Machine Learning
Alexander Veysov on Self-Teaching AI and Creating Open Speech-To-Text

News Highlight: Audio Deepfakes

This edition’s news story is AI deepfakes of Anthony Bourdain's voice are only a taste of what's coming.

Summary In his latest documentary, Roadrunner, director Morgan Neville pays homage to the life and works of renowned American chef and journalist, Anthony Bourdain. Despite the film being well-received, Neville received backlash when he revealed in an interview that artificial intelligence was used to synthesize 45 seconds of Bourdain’s voice for several quotes in the documentary. The generated audio is used as a voice-over, making it easy to assume it is a recorded clip such as is often used throughout this and other documentaries. While Neville’s choice to synthesize Bourdain’s voice may or may not have been a step too far, it certainly speaks to the fidelity of the technology that almost nobody noticed the deepfake audio until Neville went public with the news.

Background Deepfakes are audio clips, images, or videos that are synthesized using artificial intelligence. The term itself was first coined in 2017 by a Reddit user who was using the technology to create fake pornographic content featuring celebrities by editing their faces into pornographic videos. This is one of the most common ways in which deepfakes are used today. Other common applications include the augmenting of facial features and emotions and synthesizing of new faces altogether. When deepfakes first emerged in 2017, their usage was almost entirely limited to hobbyists generating pornographic content. In fact, as recently as 2019, an estimated 96% of all publicly posted deepfake videos were pornographic. However, as the technology has matured it has begun to be commercialized for use in various applications such as sports ads, tv shows, political campaigns, and more. The use of an audio deepfake in Roadrunner represents a continuation of this trend. 

Why Does it Matter?  Anthony Bourdain’s deepfake voice in Roadrunner is the first prominent use of deepfakes in a high-profile film release. Till now, the technology has largely been used for entertainment and leisure, but the 45-second deepfake in the documentary cements the idea that the technology can have concrete and practical usage too. In terms of its advantages, Bourdain’s deepfake decreases our reliance on voice actors, since deepfakes present a quicker and cheaper alternative in such cases. Sonatic, a British startup that clones voices for actors and studios using AI, is expressly working in this domain. The startup helped Hollywood actor Val Kilmer recreate his voice. Kilmer had lost his natural voice due to mouth cancer and it became increasingly difficult to understand him. Sonatic worked in conjunction with the actor, dug up old recordings of his, and helped him regain his voice.

While these are tangible benefits arising from the advent of deepfakes, unfortunately, the same applications also pose a number of risks and quandaries. These range from the ethical and moral issues that sprung up in the posthumous Bourdain documentary to the legal issues surrounding deepfake use. Tech giants across the world are also beginning to recognize how deepfakes can be exploited to spread disinformation on the internet. In 2019, Facebook launched the Deepfake Detection Challenge to build AI models for flagging deepfakes on the internet. The winner of that challenge only achieved 65% accuracy on correctly flagging deepfakes, highlighting the difficulty in spotting them for both humans and AI. Recreating voices isn’t without its fair share of controversy either. Recently, Canadian voice-over performer Bev Standing filed a lawsuit against TikTok in which she accused the app of using her voice without permission. She states that TikTok used her old recordings to create a text-to-speech model that is baked into the app. Copyright issues aside, news anchors, celebrities, politicians, and other prominent media personalities have expressed concerns over the use of deepfakes for defamation and slander as well.

What Do Our Editors Think?

Andrey: Deepfakes have been a big topic of discussion for years, but have yet to have any real impact. It appears this is just now starting to change, with stories like this demonstrating that their use will become more frequent in the coming years. Personally, I am excited to see this happen, as it’ll lead to film and tv directors to have an expanded toolbox to make art with. Of course, negative uses of deepfakes will likely also become more frequent. Deepfakes are really just a new technology to make special effects with, and like any technology it will have both positive and negative impacts. Personally I am not very worried about that, as ‘shallowfakes’ have turned out to still be far more harmful.

Hugh: In the long run, I do not think deepfakes are going to be a big issue. In an old Gradient perspective, I described how people currently “believe” the picture / video / audio evidence they perceive online because the knowledge that it can be generated is not yet widespread. As these technologies get more and more mature, people will slowly start to understand that not everything they see is real. Deepfakes will eventually be the new photoshop. Not harmless (and also useful for many productive purposes), but highly unlikely to cause a dramatic impact in the long run.

Daniel: I somewhat agree with Hugh here, although I think “people will slowly start to understand that not everything they see is real” is doing a lot of work. Deepfakes and synthetic media generally are bound to get more realistic and easier to use, which has clear implications for the information available online. I think the argument that synthetic media alone is not responsible for attacks on our epistemic commons is a fair one, but the incremental impact of synthetic media is probably dependent on our ability to tease fact from fiction. While I and many people I know are at least aware of deepfakes, I don’t pretend to know how many social media / internet users in general have that same level of awareness. We’ve already seen real-life consequences from disinformation, and synthetic media could exacerbate that, although I admit this remains a more theoretical concern. Overall, I’d paint my opinion as a picture of uncertainty. I hope Hugh’s right, but I can see a worse picture as well.

Paper Highlight: How Transformers See the World

This edition’s paper highlight is Do Vision Transformers See Like Convolutional Neural Networks?

Summary Researchers from Google Brain, aware of the tremendous success Vision Transformers (ViT) have had in solving computer vision tasks, attempt to answer the fundamental question; “how are Vision Transformers solving these tasks?”. They compare and contrast two variants of ViT alongside two popular Convolutional Neural Net (CNN) models. Utilizing internal representations and probing exercises they show the role that pretraining, skip connections, and self-attention have in influencing the learning process of ViT compared to the role of convolutions in CNNs.

Why does it matter? Understanding how Vision Transformers solve tasks compared to CNNs will be crucial for future innovation in the computer vision world. While it may seem like Transformers are a natural fit for all tasks, understanding the role that the various components play when compared to other methods is crucial for properly assessing the strengths and weaknesses for a particular problem. These learnings can also be relevant across domains, not just in computer vision. In particular, the roles that pretraining, skip connections, and self-attention play in ViT share many parallels to that in Natural Language Processing (NLP), creating future opportunities for further leveraging ideas across domains.

Pretraining - Over the last decade pre-training has rightfully had its moment in the spotlight. Starting with the initial tremendous success of pretraining on ImageNet for transfer learning, through BERT’s state of the art results in 8 natural language processing tasks, to most recently MIXER, which does away with Convolutions and Attention in favor of exclusively pretraining; pre-training has been a key component in some of the most successful and prominent machine learning models. It should come as no surprise that for ViT, researchers found that large models need to use more and more data during a pre-training phase to learn similar representations.

Skip Connections - Similar to pretraining skip connections have also shown a tremendous amount of success and versatility. From the OG ResNet to the extremely recent Perceiver IO, skip connections are a frequent contributor to state of the art results. In the context of ViT, researchers found that the removal of a single random skip connection leads on average to a 4% loss of accuracy. This is further exemplified by their figure showing the role that skip connections play in the learning of representational structures.

Self Attention
- The self attention mechanism in ViT leads to learned representations that differ quite a bit from representations learned via CNNs. The self attention representations are largely uniform throughout the model whereas the representations learned from CNNs differ greatly depending on where in the layer stack they occur. Folks who go about leveraging hidden layers for downstream tasks should pay close attention to one's choices of layers and aggregation functions depending on the model architecture. One could assume that for a ViT there would be comparable performance regardless of which layer(s) get chosen; however, representations from a CNN would differ depending on if taken from an early layer or later.

What Do Our Editors Think?

Andrey: Vision transformers have been all the rage recently, so it’s nice to see research being done to understand them better, especially with respect to the far more established model of CNNs. None of the results here seem hugely surprising or enlightening, but they are still informative and seem like a good basis for follow up research.

Justin: One thing that took me by surprise was seeing uniform attention representations agnostic to where in the layer stack we are inspecting. Specifically, for BERT in NLP we know that “The final layers are more task specific” and I was preconditioned to believe that would happen here as well. It’s always great to be surprised by the literature and I am looking forward to future work which can hopefully elucidate this area a bit more. 


Elon Musk unveils 'Tesla Bot,' a humanoid robot that would be made from Tesla's self-driving AI “Musk unveiled the "Tesla Bot," a 5-foot-8, 125-pound robot that would have a screen where its face should be that would present information. According to the CEO, the humanoid robot would be capable of deadlifting 150 pounds and carrying about 45 pounds, though it would travel at only about 5 mph“

Reddit user reverse-engineers what they believe is Apple's solution to flag child sex abuse “Reddit user u/AsuharietYgvar is confident that they have uncovered Apple's NeuralHash algorithm, which will combat child sexual abuse, deep in iOS' source code. A GitHub repo contains their findings.” Six hours after the repository was published, Cory Cornelius (@dxoigmn on GitHub) posted a hash collision using it.

Without Code for Deepminds Protein AI, this lab wrote its own. “...a team led by David Baker, director of the Institute for Protein Design at the University of Washington, released their own model for protein structure prediction. For a month, this model, called RoseTTAFold, was the most successful protein prediction algorithm that other scientists could actually use....”

Twitter AI Bias Contest shows beauty filters hoodwink the algorithm. “A researcher at Switzerland's EPFL technical university won a $3,500 prize for determining that a key Twitter algorithm favors faces that look slim and young and with skin that is lighter-colored or with warmer tones. Twitter announced on Sunday it awarded the prize to Bogdan Kulynych, a graduate student examining privacy, security, AI and society....”

Boston Dynamics' robots can parkour better than you - "Don't expect an easy getaway if one of Boston Dynamics' Atlas robots ever chases you down. The Hyundai-owned firm has shared a video (below) of the humanoid bots successfully completing a parkour routine in an obstacle course for the first time."


Improving Contrastive Learning by Visualizing Feature Transformation. In this paper, we attempt to devise a feature-level data manipulation, differing from data augmentation, to enhance the generic contrastive self-supervised learning. To this end, we first design a visualization scheme for pos/neg score, which enables us to analyze, interpret and understand the learning process. We gain some significant observations, which inspire our novel Feature Transformation proposals including the extrapolation of positives ... Besides, we propose the interpolation among negatives, which provides diversified negatives and makes the model more discriminative. Experiment results show that our proposed Feature Transformation can improve at least 6.0% accuracy on ImageNet-100 over MoCo baseline, and about 2.0% accuracy on ImageNet-1K over the MoCoV2 baseline. Transferring to the downstream tasks successfully demonstrate our model is less task-bias. 

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction. Different from previous methods, in this paper, we formulate the task as a set prediction problem and propose a novel Transformer-based framework, dubbed Paint Transformer, to predict the parameters of a stroke set with a feed-forward network. … Experiments demonstrate that our method achieves better painting performance than previous ones with cheaper training and inference costs.

Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. In this paper, we analyze CLIP and highlight some of the challenges such models pose. CLIP reduces the need for task-specific training data, potentially opening up many niche tasks to automation. CLIP also allows its users to flexibly specify image classification classes in natural language, which we find can shift how biases manifest. Given the wide and unpredictable domain of uses for such models, this raises questions regarding what sufficiently safe behaviour for such systems may look like. These results add evidence to the growing body of work calling for a change in the notion of a 'better' model--to move beyond simply looking at higher accuracy at task-oriented capability evaluations, and towards a broader 'better' that takes into account deployment-critical features such as different use contexts, and people who interact with the model when thinking about model deployment.

Program Synthesis with Large Language Models. This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models on two new benchmarks, MBPP and MathQA-Python. On both datasets, we find that synthesis performance scales log-linearly with model size. … Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution.

On the Opportunities and Risks of Foundation Models. This report provides a thorough account of the opportunities and risks of foundation models (e.g., BERT, DALL-E, GPT-3), ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). … Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.


Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. hanks for reading the latest Update from the Gradient!