Xi Xiao 肖熙
Logo Ph.D. Student @ UAB
Logo Research Intern @ ORNL

Hi, everyone! I am Xi Xiao, a second-year Ph.D. student in the Department of Computer Science at the University of Alabama at Birmingham, advised by Dr. Tianyang Wang, and co-advised by Dr. Min Xu from Carnegie Mellon University.

I am currently conducting a long-term research internship at the Computational Sciences and Engineering Division of Oak Ridge National Laboratory, supervised by Dr. Xiao Wang.

I am a finalist for the ACM Gordon Bell Prize in 2025.

My research interests include:

  • Parameter-Efficient Fine-Tuning and Pretraining for Large Foundation Models (SC’25, ACM MM’25, COLM’25, CVPRW’25)
  • Model Efficiency / Acceleration (SC’25, ICASSP’25, ECML-PKDD’25)
  • Image / Video Generation / Understanding (ICCV’25, ICMLW’25)

Education
  • University of Alabama at Birmingham
    Ph.D. in Computer Science
    Jan. 2024 - present
  • Sichuan University Jincheng College
    B.S. in Artificial Intelligence
    Sep. 2019 - Jun. 2023
Experience
  • Oak Ridge National Laboratory
    Research Intern
    May. 2025 - present
Honors & Awards
  • 2025 ACM Gordon Bell Prize Finalist
    2025
  • SC 2025 Best Paper Finalist
    2025
News
2025
Our work selected as a finalist for the ACM Gordon Bell Prize🏆 in 2025!
Jul 25
One paper accepted by COLM 2025
Jul 09
One paper accepted by ACM MM 2025
Jul 05
One paper accepted by SC 2025 — Best Paper Finalist 🏆! I'm deeply honored to be the only student author on this remarkable team.
Jun 26
One paper accepted by ICCV 2025!
Jun 26
One paper accepted by ECML-PKDD 2025!
May 19
One paper accepted by ICML Workshop on FM4LS 2025!
May 17
One paper accepted by ACL Workshop on MAGMaR 2025 — Oral!
May 17
One paper accepted by CVPR Workshop on FGVC 2025!
May 07
Joined the Computational Sciences and Engineering Division at Oak Ridge National Laboratory for a long-term research internship!
May 01
2024
One paper accepted by ICASSP 2025!
Dec 19
Serve a Reviewer for IEEE Transactions on Circuits and Systems for Video Technology!
Mar 01
Started Ph.D. in Computer Science at the University of Alabama at Birmingham!
Jan 10
Selected Publications (view all )
ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling
ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Xi Xiao, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu

International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2025 Best Paper Finalist🏆

Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.

ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling

Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Xi Xiao, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu

International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2025 Best Paper Finalist🏆

Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.

Visual Instance-aware Prompt Tuning
Visual Instance-aware Prompt Tuning

Xi Xiao*, Yunbei Zhang*, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, Min Xu (* equal contribution)

ACM International Conference on Multimedia (ACM MM) 2025

Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information.

Visual Instance-aware Prompt Tuning

Xi Xiao*, Yunbei Zhang*, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, Min Xu (* equal contribution)

ACM International Conference on Multimedia (ACM MM) 2025

Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Hengjia Li*, Lifan Jiang*, Xi Xiao*, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai (* equal contribution)

International Conference on Computer Vision (ICCV) 2025

Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce MagicID, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Hengjia Li*, Lifan Jiang*, Xi Xiao*, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai (* equal contribution)

International Conference on Computer Vision (ICCV) 2025

Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce MagicID, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences.

Describe Anything in Medical Images
Describe Anything in Medical Images

Xi Xiao*, Yunbei Zhang*, Thanh-Huy Nguyen*, Ba-Thinh Lam, Janet Wang, Lin Zhao, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, Hao Xu, Tianming Liu, Min Xu (* equal contribution)

ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences 2025

Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images.

Describe Anything in Medical Images

Xi Xiao*, Yunbei Zhang*, Thanh-Huy Nguyen*, Ba-Thinh Lam, Janet Wang, Lin Zhao, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, Hao Xu, Tianming Liu, Min Xu (* equal contribution)

ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences 2025

Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images.

TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection
TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Xi Xiao, Zhengji Li, Wentao Wang, Jiacheng Xie, Houjie Lin, Swalpa Kumar Roy, Tianyang Wang, Min Xu

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

Object detection has witnessed remarkable advancements over the past decade, largely driven by breakthroughs in deep learning and the proliferation of large scale datasets. However, the domain of road damage detection remains relatively under explored, despite its critical significance for applications such as infrastructure maintenance and road safety. This paper addresses this gap by introducing a novel top down benchmark that offers a complementary perspective to existing datasets, specifically tailored for road damage detection. Additionally, we present a novel real time object detection framework, TDYOLOV10, designed to handle the unique challenges posed by the TDRD dataset.

TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Xi Xiao, Zhengji Li, Wentao Wang, Jiacheng Xie, Houjie Lin, Swalpa Kumar Roy, Tianyang Wang, Min Xu

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

Object detection has witnessed remarkable advancements over the past decade, largely driven by breakthroughs in deep learning and the proliferation of large scale datasets. However, the domain of road damage detection remains relatively under explored, despite its critical significance for applications such as infrastructure maintenance and road safety. This paper addresses this gap by introducing a novel top down benchmark that offers a complementary perspective to existing datasets, specifically tailored for road damage detection. Additionally, we present a novel real time object detection framework, TDYOLOV10, designed to handle the unique challenges posed by the TDRD dataset.

All publications
Academic Services
  • Conference Reviewer: ICML 2025, ACM MM 2025, ICANN 2025, IJCNN 2025, ECCV 2024, ICONIP 2024
  • Journal Reviewer: IEEE TCSVT
  • Area Chair: IEEE PRAI 2022
  • IEEE Student Member