M3C

As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation (M³C), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the M³C, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model’s strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.

A sample of our M³C. The main speaker (Alex) engages in conversation with two different partners per session, where all speakers simultaneously experience the provided multimodal inputs in a shared environment. At the end of each session, the main speaker collects each partner’s memory from their own perspective and utilizes this information to guide the conversation in subsequent sessions. In later sessions, Alex can encounter new partners and continue the interaction. The memories referenced when generating utterances are marked with symbols, where shared or connected memories are indicated by the same symbol.

We propose M³C, a machine-generated multimodal conversation dataset uniquely designed for multi-session, multi-speaker, and multimodal (image & audio) interactions. Unlike previous datasets that typically support only single-session or two-speaker conversations, M³C features:

Three sessions per episode, capturing temporal continuity,
Four speakers per episode, enabling diverse partner interactions,
Both image and audio modalities, grounding conversations in shared perceptual context.

Our dataset contains 54K episodes and 2.5M dialogue turns, significantly expanding the scale and depth of multimodal conversational benchmarks. The comparison below highlights how M³C differs from existing datasets across structure, modality, and scale.

Datasets	Type	Multiple Sessions	Multiple Speakers	Image (# of Images)	Audio (# of Audios)	# of Sessions	# of Turns
AMI	Open-Domain	❌	✔	❌		✔	-	279	-
VisDial	Modality-QA	❌	❌	✔	(120K)	❌		123K	2.4M
MELD	Open-Domain	❌	✔	✔	-	✔	-	1.4K	13K
ImageChat	Modality-Centric	❌	❌	✔	(202K)	❌		202K	401K
MMConv	Modality-Centric	❌	❌	✔	(114K)	❌		5.1K	39.8K
PhotoChat	Open-Domain	❌	❌	✔	(10.9K)	❌		12K	156K
MMDD	Modality-Centric	❌	❌	✔	(13K)	❌		17K	-
MMDialog	Modality-Centric	❌	❌	✔	(1.53M)	❌		1.08M	4.92M
MPCHAT	Modality-Centric	❌	❌	✔	(153K)	❌		15K	42.5K
Audio Dialogues	Modality-QA	❌	❌	❌		✔	-	163K	-
MiSC	Open-Domain	✔	✔	❌		❌		51K	-
DialogCC	Open-Domain	❌	❌	✔	(129.8K)	❌		83K	-
LOCOMO	Open-Domain	✔	❌	✔	(2K)	❌		1.7K	-
Stark	Open-Domain	✔	❌	✔	(900K)	❌		500K	-
Ours	Open-Domain	✔	✔	✔	(24K)	✔	(73K)	16K	2.5M

We propose a multimodal, multi-session, multi-party conversation model that perceives both images and audio—enabling the system to engage in conversations as if it has "eyes and ears." The model is designed to maintain coherence across sessions while interacting with different speakers in a shared environment.
Our architecture consists of two main modules:

Dialogue Module: Generates responses grounded in the current multimodal context, constructs session memories, and links past interactions to maintain dialogue consistency.
Retriever Module: Retrieves relevant multimodal memories—spanning image, audio, and text modalities—from prior sessions to inform ongoing conversations.

By integrating these modules, the model ensures temporally-aware, partner-sensitive, and modality-grounded interactions.

Enabling Chatbots with Eyes and Ears:
An Immersive Multimodal Conversation System for Dynamic Interactions

Abstract

Dataset Comparision

Model Architecture

Examples of datasets and our model's responses

Each episode in the dataset consists of three sessions

Live human chat examples comparing model responses to various modalities

More details coming soon!

Enabling Chatbots with Eyes and Ears:An Immersive Multimodal Conversation System for Dynamic Interactions