Vasu Sharma

Vasu Sharma

Head of AI at Pocket FM · ex-Tesla Optimus, Meta FAIR, Citadel

Building causal, dynamic, real-time world models for humanoid robotics. Passionate about multimodal foundation models, efficient video generation, and self-supervised learning. 100+ papers at NeurIPS, CVPR, ICLR, ACL and others, with 15k+ citations.

EB1A Green Card Recipient · Guest lecturer at Stanford, CMU, MIT, Oxford, O'Reilly

100+

Publications

15k+

Citations

11+

Years Experience

6+

Top AI Labs

Research Interests

My work spans the intersection of vision, language, speech, and robotics — pushing the boundaries of what AI systems can perceive, generate, and act upon.

World Models

Causal, dynamic, real-time world models for closed-loop RL training in humanoid robotics.

Multimodal Foundation Models

Large-scale models spanning vision, language, speech, and audio on trillion-token datasets.

Video Generation

Efficient DiT-based video generation backbones with 100x throughput improvements.

Robotics Foundation Models

Building generalization capabilities across diverse real-world robotic use cases.

Self-Supervised Learning

DINOv2, MetaCLIP, MaViL — learning robust representations without supervision.

Embodied AI

Visual-language navigation, multi-agent RL, and smart robot intelligence.

Experience

Head of Artificial Intelligence Pocket FM

2026 — Present · San Francisco Bay Area

Leading AI at Pocket Entertainment, a Lightspeed-backed startup with 300M+ users. Built and scaled a 60-person AI org across GenAI Research, Applications, and Personalization.

Developing custom LLMs and diffusion models for long-form, multimodal storytelling
Creating the world's best creative writing model for long-form content generation
Architecting agentic content generation pipelines ingesting billions of tokens monthly
Building the most performant multilingual narrative TTS model
Senior Staff ML Scientist & World Models Tech Lead Tesla Optimus

2025 — 2026 · Palo Alto, CA

Built causal, real-time, dynamic world models for closed-loop RL training on humanoid robotics.

Designed the flagship multimodal foundational model (TRON) for Optimus and Tesla FSD
Improved DiT-based video generation throughput by ~100x via efficient attention
Enhanced visual quality on Wan3.x video generation with high-motion MoE expert
Created high-fidelity auto-labelling pipeline for automated annotations at scale
Applied Research Scientist Lead Meta (FAIR / MSL)

2022 — 2025 · Menlo Park, CA

Worked with Facebook AI Research as part of Meta Superintelligence Labs on large-scale multimodal foundational models.

DINOv2: Learning Robust Visual Features without Supervision (TMLR)
Chameleon: Mixed-Modal Early-Fusion Foundation Models
CM3Leon: Scaling Autoregressive Multi-Modal Models (NeurIPS)
MetaCLIP: Demystifying CLIP Data (ICLR 2024)
MaViL: Masked Audio-Video Learners (NeurIPS 2023)
Data, post-training and evals for Llama 3 and Llama 4
Quantitative Researcher Citadel

2019 — 2024 · Chicago, IL

Leveraged machine learning and statistical methods to model financial markets and time-series data at scale.

Applied Scientist Amazon Lab126 (Alexa AI)

2021 — 2022 · Sunnyvale, CA

Worked on multimodal models, embodied AI, and the Alexa Prize SimBot Challenge for visual-language navigation.

Alexa Arena: Interactive platform for embodied AI (NeurIPS 2023)
CHMARL: Cooperative heterogeneous multi-agent RL benchmark (RSS 2022)
ε-ViLM: Efficient Video-Language Model (WACV 2024)
Graduate Research Assistant (PhD Dropout) Carnegie Mellon University

2017 — 2019 · Pittsburgh, PA

Worked with Prof. Louis-Philippe Morency on multimodal machine learning, adversarial attacks on VQA, facial landmark detection, and natural visual perception.

Selected Publications

A selection from 100+ papers across NeurIPS, CVPR, ICLR, ACL, EMNLP, TMLR, COLM, NAACL, WACV, Interspeech and more.

DINOv2: Learning Robust Visual Features without Supervision

TMLR

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta AI

CM3Leon: Scaling Autoregressive Multi-Modal Models

NeurIPS

MetaCLIP: Demystifying CLIP Data

ICLR 2024

MaViL: Masked Audio-Video Learners

NeurIPS 2023

Branch-Train-MiX: Mixing Expert LLMs into a MoE LLM

COLM 2024

An Introduction to Vision-Language Modeling

arXiv

Alexa Arena: Interactive Platform for Embodied AI

NeurIPS 2023

View all publications on Google Scholar

Education

M.S. Machine Learning & AI (PhD Dropout)

Carnegie Mellon University

MLT at Language Technologies Institute (LTI)

GPA: 4.19/4.33 (Department Rank 1)

BTech. in Computer Science & Engineering

Indian Institute of Technology, Kanpur

JEE Rank 165

GPA: 9.99/10.0

Advising & Teaching

Startup Advisor

Advising multiple startup founders and VCs on AI strategies from ideation to Series A/B/C. Domains include multimodal foundation models, world models, RL environments, video generation, robotics, and data annotation.

Guest Lectures

Regular guest lecturer at Stanford, CMU, MIT, Oxford, and O'Reilly on topics spanning multimodal AI, large-scale model training, and applied research.

AI Research Director — Algoverse

Led the development of a cutting-edge AI research program empowering students to publish at ICML, NeurIPS, EMNLP, and EACL. Students admitted to CMU, Stanford, MIT, UC Berkeley and received offers from Anthropic, TikTok, and more.

Get in Touch

Interested in collaborating on research, startup advising, or speaking engagements? Reach out on LinkedIn.