| Xinlong Wang

About Me

I am a technical lead at Beijing Academy of Artificial Intelligence (BAAI), leading and founding the vision and multimodal research center (a.k.a. BAAI Vision). I received my PhD degree from The University of Adelaide, supervised by Prof. Chunhua Shen. Before that I obtained my Bachelor's degree from Tongji University.

My research interests lie in the area of computer vision and foundation models. I work on visual perception (SOLO, SOLOv2), visual representation (DenseCL, EVA), visual generalist (Painter, SegGPT), multimodal representation (EVA-CLIP, Uni3D) and multimodal generalist (Emu, Emu2, Emu3).

Contact

We are always looking for full-time researchers, engineers and interns at BAAI, feel free to shoot an email if interested!

Email: xinlong.wang96@gmail.com

News

[Apr.2025] See3D for scalable 3D generation is accepted by CVPR 2025 as Highlight. JudgeLM for scalable LLM Judges is accepted by ICLR 2025 as Spotlight.
[Sept.2024] We have released Emu3, new state-of-the-art multimodal models trained solely with next-token prediction.
[Sept.2024] EVE for encoder-free VLMs is accepted by NeurIPS 2024 as Spotlight.
[Feb.2024] Emu2 and CapsFusion are accepted by CVPR 2024.
[Feb.2024] We have released EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters.
[Jan.2024] Emu and Uni3D are accepted by ICLR 2024. Uni3D is selected for Spotlight presentation.
[Dec.2023] We have released Emu2, open and largest generative multimodal models that achieve new state of the art on multimodal understanding and generation tasks.
[Sept.2023] EVA is one of the most influential papers (7th/2359) in CVPR 2023.
[Jul.2023] SegGPT is accepted by ICCV 2023.
[Jul.2023] We have released Emu, a multimodal generalist that can seamlessly generate images and texts in multimodal context.
[Feb.2023] Painter and EVA are accepted by CVPR 2023.
[Dec.2022] We have released Painter, a generalist model using "image" as the general-purpose interface.
[Nov.2022] We have released EVA, the best 1B Vision Foundation Model to date. All the code and models are available.

Selected Publications

Emu3: Next-Token Prediction is All You Need
Emu3 Team, BAAI
arXiv, 2024
[arXiv] [code] [models] [project page]

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma*, Huachen Gao*, Haoge Deng*, Zhengxiong Luo, Tiejun Huang, Lulu Tang†, Xinlong Wang†
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Highlight (3% acceptance rate)
[arXiv] [code] [project page]

JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu, Xinggang Wang, Xinlong Wang†
International Conference on Learning Representations (ICLR), 2025
Spotlight (3% acceptance rate)
[arXiv] [code] [models]

Autoregressive Video Generation without Vector Quantization
Haoge Deng*, Ting Pan*, Haiwen Diao*, Zhengxiong Luo*, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang†
International Conference on Learning Representations (ICLR), 2025
[arXiv] [code] [models]

Unveiling Encoder-Free Vision-Language Models
Haiwen Diao*, Yufeng Cui*, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang†
Advances in Neural Information Processing Systems (NeurIPS), 2024
Spotlight (2% acceptance rate)
[arXiv] [code] [models]

Generative Multimodal Models are In-Context Learners
Quan Sun*, Yufeng Cui*, Xiaosong Zhang*, Fan Zhang*, Qiying Yu*, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang†
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[project page] [arXiv] [code] [demo] [models] [video demo]

Uni3D: Exploring Unified 3D Representation at Scale
Junsheng Zhou* , Jinsheng Wang*, Baorui Ma*†, Yu-Shen Liu, Tiejun Huang, Xinlong Wang†
International Conference on Learning Representations (ICLR), 2024
Spotlight (5% acceptance rate)
[arXiv] [code] [models]

Generative Pretraining in Multimodality
Quan Sun*, Qiying Yu*, Yufeng Cui*, Fan Zhang*, Xiaosong Zhang*, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang†
International Conference on Learning Representations (ICLR), 2024
[arXiv] [code] [demo]

SegGPT: Segmenting Everything In Context
Xinlong Wang*, Xiaosong Zhang*, Yue Cao*, Wen, Wang, Chunhua Shen, Tiejun Huang
IEEE International Conference on Computer Vision (ICCV), 2023
[arXiv] [code] [demo]

Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Xinlong Wang*, Wen Wang*, Yue Cao*, Chunhua Shen, Tiejun Huang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[arXiv] [code]

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Highlight (2.6% acceptance rate)
[arXiv] [code]

FreeSOLO: Learning to Segment Objects without Annotations
Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, Jose M. Alvarez
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
[arXiv] [bibtex] [code]

SOLO: A Simple Framework for Instance Segmentation
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, Lei Li
IEEE T. Pattern Analysis and Machine Intelligence (TPAMI), 2021
[arXiv] [bibtex] [demo] [code] [code@adet]

BoxInst: High-Performance Instance Segmentation with Box Annotations
Zhi Tian, Chunhua Shen, Xinlong Wang, Hao Chen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
[arXiv] [demo] [code]

End-to-End Video Instance Segmentation with Transformers
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, Huaxia Xia
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Oral (4.3% acceptance rate)
[arXiv] [code]

Dense Contrastive Learning for Self-Supervised Visual Pre-Training
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, Lei Li
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Oral (4.3% acceptance rate)
[arXiv] [bibtex] [code][usage@adet]

SOLOv2: Dynamic and Fast Instance Segmentation
Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, Chunhua Shen
Advances in Neural Information Processing Systems (NeurIPS), 2020
[arXiv] [bibtex] [demo] [code] [code@adet]

SOLO: Segmenting Objects by Locations
Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li
European Conference on Computer Vision (ECCV), 2020
[arXiv] [bibtex] [code]

Associatively Segmenting Instances and Semantics in Point Clouds
Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen and Jiaya Jia
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
[arXiv] [bibtex] [code]

Repulsion Loss: Detecting Pedestrians in a Crowd
Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun and Chunhua Shen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[arXiv] [bibtex]

Professional Activities

Journal Reviewer
IEEE TPAMI, Nature Communications, IEEE TIP, IEEE TMM, IEEE TRO, PR, TMLR
Conference Reviewer
CVPR, ECCV, ICCV, ICLR, NeurIPS, ICML
Conference Area Chair
ICCV, ICLR, NeurIPS