Tesla AI Day 2022; Meta Universal Speech Translator; Google AI@22; NN Zero-to-hero
Vision Geek AI Newsletter #7
Tesla AI Day 2022
Tesla Bot
Last year Elon Musk announced that Tesla will be building a humanoid robot in their AI Day event and revealed the conceptual outlook of the Tesla Bot. “Tesla AI Day 2022” was held recently and the team showcased the prototype they have been working on. There are two variants Bumblebee and Optimus.
The team started off with Bumblebee where they used readily available off-the-shelf hardware components and moved on to Optimus where all the hardware components are designed by Tesla in-house, giving them a lot more control and performance.
Most of the software built for Autopilot in the cars is repurposed for the Bot. We have seen impressive humanoid robots from other companies in the past, they are very expensive and made in small quantities.
What makes Tesla Bot interesting is that they are planning to build these Bots in large quantities such that the cost of a bot will become lower than a car, hopefully less than $20,000. They already have the hardware, AI software and real world data from the cars. Tesla has the potential to make humanoid robots mainstream. We just have to wait and see if they can pull this off.
Autopilot
The Autopilot team gave a glimpse of how they are handling complex real world scenarios. They are slowly rolling out FSD (Full Self Driving) Beta to more and more customers.
One interesting thing about Tesla is that they don't use LiDAR at all in their cars. Everything is done using the 8 cameras present in the vehicle. It's quite hard. Most of the self driving car companies use expensive LiDARs.
They showcased the Occupancy Network they use to predict what's present on the scene around the vehicle in 3D and how they are taking inspirations from language modelling to tackle lane prediction at complex intersections.
They have developed their own compiler, file format (.smol) and neural network accelerator (TRIP Engine) to make the best use of their FSD hardware. They also use simulation to create 3D scenes to create training data of specific scenarios to improve the model accuracy, completely automated, no 3D artists involved. It's fascinating to see how the team is optimizing every single aspect to achieve better performance and accuracy.
Dojo Supercomputer
Tesla gets tens of thousands of real world video clips from the cameras on their fleet everyday. Training machine learning models on this ever-growing dataset is not an easy task.
Tesla uses a cluster of 14,000 GPUs. Out of which 10,000 GPUs are used for model training and 4000 GPUs are used for auto labeling task. Even with this massive GPU cluster, it takes months to train the models on huge datasets.
The team has built a supercomputer named "Dojo" to tackle this problem. Designing the entire hardware and software stack from the ground up has given them a huge boost in performance at a fraction of the usual GPU cost. The same models can now be trained in less than a week instead of a month.
From building a completely new compiler to creating a customer protocol (TTP - Tesla Transfer Protocol) to building a custom network interface card (DNIC - Dojo Network Interface Card) the team has innovated at all the layers. Truly inspiring work.
Meta Universal Speech Translator
Eight months ego Meta announced their “Universal Speech Translator” project, which aims to develop new AI methods that will allow real-time speech-to-speech translation across many languages. Now the team has come up with an interesting demo.
So far, AI powered speech translation systems convert the speech to text first and use NLP (Natural Language Processing) models to understand the text and convert it to text in the destination language and then convert that text to speech.
This approach works well for the languages that have a standard writing system. But nearly half of the world’s 7,000+ living languages are primarily oral and do not have a standard or widely used writing system.
This makes it impossible to build machine translation tools using standard techniques, which require large amounts of written text in order to train language models.
To address this challenge, Meta’s AI team has built the first AI-powered speech-to-speech translation system for Hokkien, a primarily oral language that’s widely spoken in China, Taiwan and few other countries but lacks a standard written form.
The team has developed a variety of novel approaches and systems to achieve this. Meta is open-sourcing the translation models, evaluation datasets and research papers so that others can reproduce and build on their work.
Though the Hokkien translation model is still a work in progress and can translate only one full sentence at a time, it has the potential to allow anyone to communicate with anyone else from anywhere in the world in their own native language. Exciting times we live in !
Google AI@22
In the recent Google event “AI @ 22” , their research team shared the recent advancements in generative modelling. There has been a lot of improvement particularly in image and video generation from text.
“Imagen” and “Parti” are two models the team has built with slightly different approaches to generate images from text. The results from these models are so crisp and of very high quality. We have definitely come a long way from generating low resolution handwritten numbers with MNIST.
Now that we are able to generate realistic high definition images from text, naturally the next step is to try generating videos from text. “Imagen Video” and “Phenaki” are two such attempts in creating consistent videos from text.
Taking inspirations from language modelling in solving computer vision problems has become common now. “Parti” and “Phenaki” are two such examples. Text-to-video models are still in its infancy but the revolution has definitely started.
NN Zero-to-hero
In case you noticed, in the Tesla AI Day 2022 event, Autopilot session was not presented by Andrej Karpathy. He was leading the AI team at Tesla until last year.
It seems that after a four month break from work, Andrej has decided to leave Tesla. For his tweet, Elon Musk has replied as "Thanks for everything you have done for Tesla! It has been an honor working with you.“
Andrej has mentioned that he has no concrete plans for what’s next but look to spend more time revisiting his long-term passions around technical work in AI, open source and education. He is currently working on creating a course called “Neural Networks - Zero to Hero”. He is making the course available for free to everyone.
It is hosted on GitHub. Course lectures are available as YouTube videos and the coding exercises are available as Google colab notebooks. It is currently in progress. Some of the videos and notebooks are already available.
PyImageSearch got acquired
PyImageSearch is one of the widely read blogs in the field of Computer Vision and Deep Learning. If you are a computer vision practitioner, most probably you might have read some of the blog posts on PyImageSearch in your learning journey.
Adrian Rosebrock is the core author and owner of the blog. It seems that PyImageSearch has been acquired last year. The details about the acquisition are not yet publicly disclosed.
In his new venture Info Product Mastery, Adrian has mentioned that
“I launched my company, PyImageSearch.com, in 2014. By 2017 I had grown it to 7 figures per year in revenue by selling eBooks and online courses I had created. In 2021 PyImageSearch was acquired for a life changing exit. I’m here to share my experiences so you can learn from the mistakes I’ve made while building and growing your own info product business.”
The podcast aims to help developers, educators, and entrepreneurs launch and grow their online education businesses. Whether they are just looking to create a passive income stream, or build a full-time living.
AI fun :)
Support this newsletter ❤️
If you are getting value out of my work, consider supporting me on Patreon and unlock exclusive benefits.