Tesla vs. OpenAI: Dueling Paths to AI’s Physical Understanding

Sora’s launch, Elon Musk may find himself grappling with the most intricate of emotions. This isn’t solely due to his own early entanglements with OpenAI, but also because Sora represents a tangible realization of the direction Tesla has been charting over the past few years.

On February 18th, Musk left a commentary beneath a video titled “OpenAI’s revelation confirms Tesla’s trajectory” by technology pundit @Dr.KnowItAll, remarking, “Tesla has adeptly harnessed precise physical principles to manifest tangible realities. The video is approximately a year old.”

Subsequently, he shared a 2023 video on X, featuring Ashok Elluswamy, Tesla’s director of autonomous driving, elucidating how Tesla employs AI to emulate real-world driving. The footage showcased AI generating seven driving scenarios from varied perspectives simultaneously, responsive to mere directional prompts such as “proceed straight” or “change lanes.”

However, it would be erroneous to infer that Tesla had mastered Sora’s technology a year prior. While Tesla’s generative technology was confined to simulating vehicular motion, Sora exhibited prowess in handling multifaceted information encompassing environments, scenes, cues, and physical laws—marking a divergence in complexity between the two.

Nevertheless, the underlying training philosophies of Tesla AI and Sora remain congruent: the objective is not merely to train AI in video generation but to imbue it with a profound comprehension to faithfully simulate real-world environments. The video merely serves as a temporal and spatial lens through which to observe these environments from distinct perspectives. Although these two entities operate within disparate spheres of commerce, they converge in their pursuit of AGI (artificial general intelligence), or more precisely, embodied intelligence and autonomous agents.

Central to grasping this perspective is recognizing OpenAI’s mandate for Sora, which extends beyond mere video generation; rather, it employs video generation as a “simulator” to facilitate AI’s comprehension of the tangible world. If Tesla’s fleet of millions still necessitates a “physical embodiment” to interface with the world, Sora relies solely on data inputs to construct a cognitive model of reality.

OpenAI’s official documentation dubs this research endeavor as “Video Generation Model as a World Simulator,” underscoring the pivotal role of “world simulation” over mere video generation.

In fact, Tesla had already showcased similar capabilities with the release of FSD V12. This automotive AI company, though primarily focused on consumer automobiles, had demonstrated analogous feats.

How does one reconcile this? Firstly, with FSD V12, engineers expunged over 300,000 lines of code delineating driving protocols. Instead, the system assimilates real-world driving dynamics from “fed” driving videos, diverging from conventional rule-based frameworks. Unlike Sora, which operates as a “generative model,” FSD aspires towards autonomous driving, obviating the necessity to generate specific videos. One could analogize it to a driver (or an agent) engaging in “defensive driving,” extrapolating future traffic dynamics based on past experiences, obviating the need for explicit visualization. Hence, Tesla’s FSD eschews the production of future videos, opting for mental projection instead.

Consequently, OpenAI and Tesla emerge as divergent entities employing distinct methodologies to converge upon the shared objective of AI’s comprehension of the physical world via video generation.

A cursory examination of Sora’s operational framework reveals its fusion of Transformer and Diffusion models—two seminal models in recent years. Language models such as ChatGPT, Gemini, and LLaMA are founded on the Transformer model, facilitating word tagging and predictive text generation, while the Diffusion model epitomizes the “Vincent Diagram.”

Viewing Sora through the lens of “world comprehension,” the fidelity and interrelation of individual frames transcend mere visual quality. Even the 60-second vignette showcased on the official website constitutes a peripheral aspect. What truly distinguishes Sora is its capacity for seamless video editing—maintaining spatial consistency across diverse camera perspectives, be it wide-angle, medium shot, close-up, or extreme close-up—lending an unparalleled sense of realism.

This parallels Tesla’s “pure visual” approach towards FSD. While conventional wisdom dictates the inclusion of lidar for gauging spatial relationships, Musk boldly jettisoned over 300,000 lines of code and even eschewed radar, relying solely on high-definition cameras and neural networks to discern spatial dynamics.

This poses a formidable challenge for both Tesla and OpenAI. While input imagery remains two-dimensional, the resultant output—whether driving instructions or videos—necessitates a profound understanding of three-dimensional space.

The efficacy of training models hinges upon both scale and quality. Tesla garners data from sensor-equipped vehicles traversing real-world terrain, whereas OpenAI’s data, per current public disclosures, is predominantly sourced from the internet. In terms of quality, Isaacson’s “The Biography of Musk” recounts Tesla’s collaboration with Uber to procure data from “five-star drivers” for FSD training, while Altman’s recent endeavor to secure trillions in funding underscores the imperative of colossal investments in computational infrastructure.

Returning to the initial query: why liken Sora to FSD V12? What vistas of future innovation do Sora and OpenAI herald? And how do they intersect with AGI?

According to Musk, AGI will materialize once artificial intelligence conquers multifarious domains such as physics, mathematics, and chemistry. However, there exists an additional dimension—embodied intelligence. The real world transcends mere mathematical formulations and written decrees; even creatures with modest cognitive faculties, like kittens and puppies, navigate and interact with their milieu through movement. Such feats were hitherto unattainable for AI confined to two-dimensional inputs. Hence, Musk lauds Sora for its potential to exert profound real-world impact.

Much like Tesla harnesses its generative capabilities to train vehicles, Sora’s value transcends mere video generation or serving as a tool for cinematic productivity (albeit vital applications). As Zhou Hongyi opines, “Sora serves as a litmus test, not only showcasing video production prowess but also heralding novel breakthroughs as large-scale models comprehend and simulate reality.”

error: Content is protected !!