Xingjian Diao

About Me



I am currently a Ph.D. student at Dartmouth College. My research interests focus on multimodal learning. I have published papers and developed codes for temporal modeling, efficient training, and audio-video-language integration, advancing state-of-the-art multimodal large language models (MLLMs) in audio-visual question answering, video captioning, and text-to-talking face generation.


Prior to Dartmouth, I earned a Master's degree in Computer Science from Northwestern University (2021), advised by Prof. Nabil Alshurafa (Thank you, Nabil!), and a B.S. degree in Computer Science from the University of Pittsburgh (2020).



Interests

  • Multimodal Large Language Models
  • Video Understanding
  • Natural Language and Speech Processing

Education

  • Ph.D. in Computer Science, -Present
    Dartmouth College
  • M.S. in Computer Science, 2021
    Northwestern University
  • B.S. in Computer Science, 2020
    University of Pittsburgh
Publications
Selected Publications
On The Design Choices of Next Level LLMs
Yijun Tian, Xingjian Diao, Ming Cheng, Chunhui Zhang, Jiang Gui, Soroush Vosoughi, Xiangliang Zhang, Nitesh V. Chawla, Shichao Pei
arXiv, 2025
Music’s Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs
Wenhao You, Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Zhongyu Ouyang, Chiyu Ma, Tingxuan Wu, Noah Wei, Zong Ke, Ming Cheng, Soroush Vosoughi, Jiang Gui
arXiv, 2025
Learning Sparsity for Effective and Efficient Music Performance Question Answering
Xingjian Diao, Tianzhen Yang, Chunhui Zhang, Weiyi Wu, Ming Cheng, Jiang Gui
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Temporal Working Memory: Query-Guided Temporal Segment Refinement for Enhanced Multimodal Understanding
Xingjian Diao*, Chunhui Zhang*, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025) Findings
FT2TF: First-Person Statement Text-To-Talking Face Generation
Xingjian Diao, Ming Cheng, Wayner Barrios, SouYoung Jin
Winter Conference on Applications of Computer Vision (WACV), 2025
Learning Musical Representations for Music Performance Question Answering
Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui
Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) Findings
Projects
Selected Projects
Intake Detection Tool with Multiple Classifiers

An Android application for wrist-worn devices to detect feeding patterns with low energy consumption and fast inference times. It applied template-based multi-centroid classifier which could provide an end-to-end battery-efficient approach for feeding detection.

Interactive Active Learning Annotation Tool

An interactive annotation software that utilizes active learning to reduce data labeling time and cost. The front-end was created with PyQt5 and pyqtgraph, offering features such as time synchronization and video frame-by-frame rewinding. The back-end, utilizing cv2, sklearn and xgboost, performed data processing, K-means clustering, and clustered entropy active learning.

iPADshiny

iPADshiny (integrated Protein Array Data management,analysis and visualization tools) is a desktop application that simplifies protein analysis for biologists. It integrates multiple algorithms, including the auto-antibody Profiling Analysis, and utilizes state-of-the-art computational methods for efficient and effective analysis.

Online Drawing Management System

An Online Drawing Management System with B/S structure and Windows OS, including features such as notice announcement, navigation menu, user and role management, flexible authorization, and online management and preview of large drawing documents. It automatically loads existing document storage structures, eliminating the need for manual entry of basic information. (Copyright: 2018SR071476)

Remote Voting System

A remote voting system that uses SMS texts to count unique votes while recording phone numbers to prevent repetitive voting, offering an accessible and transparent solution for remote voting scenarios.

Introvert

An inclusive online chat environment for introverted students, utilizing JavaScript, Python, and Google Cloud platform to implement anonymous chatting and user-friendly direct messaging features, aimed at promoting engagement and improving the chat experience for introverted individuals.

Teaching

TA indicates Teaching Assistant

Services
Contact