Enabling Chatbots with Eyes and Ears:
An Immersive Multimodal Conversation System for Dynamic Interactions

1POSTECH, 2UNIST, 3UIUC, *Equal contribution.
Proceedings of ACL 2025

Abstract

As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation (M3C), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the M3C, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model’s strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.

Teaser
A sample of our M3C. The main speaker (Alex) engages in conversation with two different partners per session, where all speakers simultaneously experience the provided multimodal inputs in a shared environment. At the end of each session, the main speaker collects each partner’s memory from their own perspective and utilizes this information to guide the conversation in subsequent sessions. In later sessions, Alex can encounter new partners and continue the interaction. The memories referenced when generating utterances are marked with symbols, where shared or connected memories are indicated by the same symbol.

Dataset Comparision

We propose M3C, a machine-generated multimodal conversation dataset uniquely designed for multi-session, multi-speaker, and multimodal (image & audio) interactions. Unlike previous datasets that typically support only single-session or two-speaker conversations, M3C features:
  • Three sessions per episode, capturing temporal continuity,
  • Four speakers per episode, enabling diverse partner interactions,
  • Both image and audio modalities, grounding conversations in shared perceptual context.
Our dataset contains 54K episodes and 2.5M dialogue turns, significantly expanding the scale and depth of multimodal conversational benchmarks. The comparison below highlights how M3C differs from existing datasets across structure, modality, and scale.
Datasets Type Multiple Sessions Multiple Speakers Image (# of Images) Audio (# of Audios) # of Sessions # of Turns
AMI Open-Domain - 279 -
VisDial Modality-QA (120K) 123K 2.4M
MELD Open-Domain - - 1.4K 13K
ImageChat Modality-Centric (202K) 202K 401K
MMConv Modality-Centric (114K) 5.1K 39.8K
PhotoChat Open-Domain (10.9K) 12K 156K
MMDD Modality-Centric (13K) 17K -
MMDialog Modality-Centric (1.53M) 1.08M 4.92M
MPCHAT Modality-Centric (153K) 15K 42.5K
Audio Dialogues Modality-QA - 163K -
MiSC Open-Domain 51K -
DialogCC Open-Domain (129.8K) 83K -
LOCOMO Open-Domain (2K) 1.7K -
Stark Open-Domain (900K) 500K -
Ours Open-Domain (24K) (73K) 16K 2.5M

Type: Modality-QA = question-answering, Modality-Centric = modality-centered (e.g., image/audio), Open-Domain = general conversation.
Note: '-' means unreported data.

Model Architecture

We propose a multimodal, multi-session, multi-party conversation model that perceives both images and audio—enabling the system to engage in conversations as if it has "eyes and ears." The model is designed to maintain coherence across sessions while interacting with different speakers in a shared environment.
Our architecture consists of two main modules:
  • Dialogue Module: Generates responses grounded in the current multimodal context, constructs session memories, and links past interactions to maintain dialogue consistency.
  • Retriever Module: Retrieves relevant multimodal memories—spanning image, audio, and text modalities—from prior sessions to inform ongoing conversations.
By integrating these modules, the model ensures temporally-aware, partner-sensitive, and modality-grounded interactions.
Teaser

Examples of datasets and our model's responses

Each episode in the dataset consists of three sessions

Live human chat examples comparing model responses to various modalities

More details coming soon!