Iā€˜m pursuing a Ph.D. in Computer Science at Nanjing University, advised by Prof. Limin Wang and Prof. Tong Lu.

My research interests are General visual perception and human-computer and multimodal interaction system. I am focusing on video understanding, egocentric vision perception and user-centric visual computing.

Currently, I am working on frontier Vision-Language Models at NVIDIA, collaborating with Zhiding Yu, Guilin Liu, and other outstanding researchers on Project Eagle. Eagle2 is contributing to NVIDIA Cosmos Nemotron and NVIDIA Isaac GR00T N1. Eagle-2.5 is contributing to NVIDIA Cosmos Reasoner1 and NVIDIA Nemotron-H.

šŸ”„ News

  • 2025-03-17: Eagle2 has been adopted by NVIDIA GEAR Team to develop robotic foundation model GR00T N1.

  • 2025-01-22: 3 ICLR papers are accepted. They are CG-Bench for long video benchmark, EgoHOD for egocentric foundation model, and X-Gen for ego-exo cross-view video prediction.

  • 2025-01-20: We present the frontier VLM, Eagle2 and the model weight has been released at huggingface.

  • 2025-01-10: CG-Bench has been integrated into VlmevalKit.

  • 2024-12-30: We present a real-time Embodied Smart Assistant, Vinci, based on Egocentric VLM. The code are at github

  • 2024-12-16: We present the clue-grounded long video understanding benchmark CG-Bench and basic evaluation code at github.

  • 2024-07-01: Our team wins Top-1 rankings in 7 tracks of 1st EgoVis ECCV2024 Challenge and the code are integrated into github.

  • 2024-07-01: InternVideo2 has been accepted by ECCV2024.

  • 2024-03-22: We present the InternVideo2 and the code is integrated into github.

  • 2024-03-15: We present the suite of modeling video with mamba video-mamba-suite and release the code at github.

  • 2024-02-27: 4 CVPR papers are accepted. They are InternVL for general visual understanding, MVBench for chat-centric video understanding, EgoInstructor for egocentric captioning and EgoExoLearn for ego-exo cross-view datasets and model suite.

  • 2023-12-26: We present the generlist visual-language model InternVL and release the code at github.

  • 2023-10-10: In the first Perception Test challenge, We obtain the best performance in Temporal Sound Localisation & runner-up in Temporal Action Localisation. The code of solution is here.

  • 2023-08-16: Code of MAT is released in here.

  • 2023-07-14: Our paper MAT is accepted by ICCV.

  • 2023-05-22: We present a novel Video Sequence Understanding Framework VideoLLM.

  • 2023-04-03: BasicTAD is accepted by CVIU.

  • 2023-01-17: Our team wins the champion of WSDM Cup 2023 Toloka VQA Challenge.

  • 2022-11-17: šŸŽ‚ We provide the final Ego4D report and the code.

  • 2022-09-19: Our team wins Top-1 rankings in 7 tracks of Ego4D ECCV2022 Challenge.

  • 2022-09-15: We have released the source code of BasicTAD.

  • 2022-06-21: Code of DCAN is released in here.

  • 2022-05-05: We present the BasicTAD, an end-to-end TAD baseline method.

  • 2021-12-01: Our paper DCAN is accepted by AAAI.

šŸ“ Publications

Arxiv
sym

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang, Timo Roman, Tong Lu, Jose M Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao, Guilin Liu, Zhiding Yu

PDF code

  • This work focuses on developing open-source vision-language models by emphasizing data strategy in post-training, resulting in the performant Eagle2 models that achieve state-of-the-art results across various multimodal benchmarks.
ICLR 2025
sym

Cg-bench: Clue-grounded question answering benchmark for long video understanding

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang

PDF code

  • CG-Bench tests multimodal models on long videos with clue-based QA, featuring 1,219 videos and 12,129 questions. It highlights challenges in video comprehension and the gap between open-source and commercial models.
ECCV 2024
sym

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Jilan Xu, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

PDF code

  • InternVideo2 offers advanced video models with a 6B encoder, excelling in video recognition, text alignment, and dialogue, achieving top results in over 60 tasks.
CVPR 2024
sym

Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, Yu Qiao

PDF code

  • EgoExoLearn is a dataset with 120 hours of egocentric and demonstration videos, gaze data, and multimodal annotations, designed to advance AI learning through observation and cross-view task benchmarks.
CVPR 2024
sym

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai#

PDF code

  • We developed InternVL, a 6 billion parameter vision-language model, aligned with large language models using vast image-text data. It achieves state-of-the-art results on 32 benchmarks, advancing multi-modal AGI.
ICCV 2023
sym

Memory-and-Anticipation Transformer for Online Action Understanding

Jiahao Wang*, Guo Chen, Yifei Huang, Limin Wang, Tong Lu#

Homepage

  • This work presents a memory-anticipation-based method for online action understanding.
CVIU
sym

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang#

PDF code

  • This work presents a simple yet effective end-to-end training framework for temporal action detection.
AAAI 2022
sym

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

Guo Chen, Yin-Dong Zheng, Limin Wang, Tong Lu#

PDF code

  • This work explored boundary-based methods for temporal action detection and proposed a novel network, termed DCAN, to improve temporal action detection via temporal-level and proposal-level context aggregation.

šŸ“ Projects

ECCV Workspace
sym

Egovideo: Exploring egocentric foundation model and downstream adaptation

Guo Chen, Baoqi Pei, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao

PDF code

  • This work presents our champion solutions to seven tracks at 1st EgoVis challenge.
Generalist Model
sym

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao#

PDF code

  • This work presents our champion solutions to seven tracks at Ego4D challenge.
ECCV Workshop
sym

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei Huang, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, Limin Wang, Yu Qiao#

PDF code

  • This work presents our champion solutions to five tracks at Ego4D challenge.

šŸŽ– Honors and Awards

  • 2024-07-01: 1st EgoVis Challenge, ECCV2024, 7 Top-1 Rankings
  • 2023-10-01: 1st Perception Test Challenge, Top-1 and Top-2 Rankings
  • 2023-01-01: WSDM Cup 2023 Toloka VQA Challenge, WSDM2023, Top-1 Ranking
  • 2022-10-01: 2nd Ego4D Challenge, ECCV2022, 7 Top-1 Rankings
  • 2017-12-01: CCPC Final Contest, Bronze Medal
  • 2017-10-01: CCPC Regional Contest, Bronze Medal
  • 2017-10-15: ACM-ICPC Asia Regional Contest, Silver Medal

šŸ“– Educations

  • 20.09 - present, Nanjing University, Nanjing, China
  • 15.09 - 19.06, University of South China, Hengyang, China