Vasu Sharma

Head of AI at Pocket FM · ex-Tesla Optimus, Meta FAIR, Citadel

Building causal, dynamic, real-time world models for humanoid robotics. Passionate about multimodal foundation models, efficient video generation, and self-supervised learning. 100+ papers at NeurIPS, CVPR, ICLR, ACL and others, with 15k+ citations.

EB1A Green Card Recipient · Guest lecturer at Stanford, CMU, MIT, Oxford, O'Reilly

LinkedIn Google Scholar Resume

100+

Publications

15k+

Citations

11+

Years Experience

Top AI Labs

Experience

Head of Artificial Intelligence Pocket FM

2026 — Present · San Francisco Bay Area

Leading AI at Pocket Entertainment, a Lightspeed-backed startup with 300M+ users. Built and scaled a 60-person AI org across GenAI Research, Applications, and Personalization.

Developing custom LLMs and diffusion models for long-form, multimodal storytelling

Creating the world's best creative writing model for long-form content generation

Architecting agentic content generation pipelines ingesting billions of tokens monthly

Building the most performant multilingual narrative TTS model

Senior Staff ML Scientist & World Models Tech Lead Tesla Optimus

2025 — 2026 · Palo Alto, CA

Built causal, real-time, dynamic world models for closed-loop RL training on humanoid robotics.

Designed the flagship multimodal foundational model (TRON) for Optimus and Tesla FSD

Improved DiT-based video generation throughput by ~100x via efficient attention

Enhanced visual quality on Wan3.x video generation with high-motion MoE expert

Created high-fidelity auto-labelling pipeline for automated annotations at scale

Applied Research Scientist Lead Meta (FAIR / MSL)

2022 — 2025 · Menlo Park, CA

Worked with Facebook AI Research as part of Meta Superintelligence Labs on large-scale multimodal foundational models.

DINOv2: Learning Robust Visual Features without Supervision (TMLR)

Chameleon: Mixed-Modal Early-Fusion Foundation Models

CM3Leon: Scaling Autoregressive Multi-Modal Models (NeurIPS)

MetaCLIP: Demystifying CLIP Data (ICLR 2024)

MaViL: Masked Audio-Video Learners (NeurIPS 2023)

Data, post-training and evals for Llama 3 and Llama 4

Quantitative Researcher Citadel

2019 — 2024 · Chicago, IL

Leveraged machine learning and statistical methods to model financial markets and time-series data at scale.

Applied Scientist Amazon Lab126 (Alexa AI)

2021 — 2022 · Sunnyvale, CA

Worked on multimodal models, embodied AI, and the Alexa Prize SimBot Challenge for visual-language navigation.

Alexa Arena: Interactive platform for embodied AI (NeurIPS 2023)

CHMARL: Cooperative heterogeneous multi-agent RL benchmark (RSS 2022)

ε-ViLM: Efficient Video-Language Model (WACV 2024)

Graduate Research Assistant (PhD Dropout) Carnegie Mellon University

2017 — 2019 · Pittsburgh, PA

Worked with Prof. Louis-Philippe Morency on multimodal machine learning, adversarial attacks on VQA, facial landmark detection, and natural visual perception.

Research Interests

My work spans the intersection of vision, language, speech, and robotics — pushing the boundaries of what AI systems can perceive, generate, and act upon.

World Models

Causal, dynamic, real-time world models for closed-loop RL training in humanoid robotics.

Multimodal Foundation Models

Large-scale models spanning vision, language, speech, and audio on trillion-token datasets.

Video Generation

Efficient DiT-based video generation backbones with 100x throughput improvements.

Robotics Foundation Models

Building generalization capabilities across diverse real-world robotic use cases.

Self-Supervised Learning

DINOv2, MetaCLIP, MaViL — learning robust representations without supervision.

Embodied AI

Visual-language navigation, multi-agent RL, and smart robot intelligence.

Selected Publications

A selection from 100+ papers across NeurIPS, CVPR, ICLR, ACL, EMNLP, TMLR, COLM, NAACL, WACV, Interspeech and more.

Education

M.S. Machine Learning & AI (PhD Dropout)

Carnegie Mellon University

MLT at Language Technologies Institute (LTI)

GPA: 4.19/4.33 (Department Rank 1)

BTech. in Computer Science & Engineering

Indian Institute of Technology, Kanpur

GPA: 9.99/10.0

JEE All India Rank: 165 (out of >1M people)

Musings

Reflections on AI research, building at scale, and lessons from the frontier.

“Really proud to share the first two accepted papers from the team’s research efforts at ICML 2026 (Culture x AI). Both papers sit in a space that has received far less attention from mainstream AI research: narratology, long-form storytelling, cultural adaptation, and how humans sustain emotional engagement across thousands of interactions with content.”

Jun 2025Read on LinkedIn →

“The sharpest AI memo I read this week came from someone with a lot to lose. The durable advantage will not be ‘we use the best model.’ Everyone will have access to strong models. The advantage will come from whether your company can turn daily work into a learning system. The scary failure mode is not ‘AI replaces workers.’ It is companies outsourcing their own learning.”

Jun 2025Read on LinkedIn →

“There is a type of delusion that collapses when reality pushes back, and another type that keeps recruiting engineers, rewriting constraints, and shipping hardware until reality finally moves. SpaceX is the second kind. How many ‘obviously impossible’ companies are we underestimating right now because they still look delusional from the outside?”

Jun 2025Read on LinkedIn →

“A while ago I had delivered a course on Multimodal AI with O’Reilly! From reading O’Reilly books growing up to delivering a course for them, life truly feels like coming full circle! Teaching is something I enjoy a lot. One of the easiest and most impactful ways to give back to the community that gave me so much.”

Jan 2025Read on LinkedIn →

Vasu Sharma

Experience

Research Interests

World Models

Multimodal Foundation Models

Video Generation

Robotics Foundation Models

Self-Supervised Learning

Embodied AI

Selected Publications

DINOv2: Learning Robust Visual Features without Supervision

Chameleon: Mixed-Modal Early-Fusion Foundation Models

CM3Leon: Scaling Autoregressive Multi-Modal Models

MetaCLIP: Demystifying CLIP Data

MaViL: Masked Audio-Video Learners

Branch-Train-MiX: Mixing Expert LLMs into a MoE LLM

An Introduction to Vision-Language Modeling

Alexa Arena: Interactive Platform for Embodied AI

Education

Carnegie Mellon University

Indian Institute of Technology, Kanpur

Musings