RoadTones: Tone Controllable Text Generation from Road Event Videos

What is RoadTones

RoadTones is a dataset-model-evaluation stack designed for tone-controllable text generation for road event videos. While existing models generate neutral, factual descriptions, they lack control over how events are expressed: their tone, urgency, or style. RoadTones bridges this gap by enabling audience-adaptive communication across mobility, ADAS development, and public engagement. RoadTones highlights:

RoadTones-51K: A dataset featuring 51k tone-aware captions along with rich tonal annotations across 215 personality traits, 16 writing styles and 8 structural controls: informativeness, target word count, viewpoint (first-person vs. external), hashtags, emojis, user mentions, location, and date/time.
RoadTones-VL-CoT: A multi-tone controllable video-to-text model that also generates Chain-of-Thought style intermediate caption drafts for partial interpretability.
RoadTones-Eval: New evaluation metrics, benchmarks to assess tone adherence and factual consistency.

Dataset Statistics

Intensity distribution of personality traits and writing styles in RoadTones-51K

Word count and informativeness correlation in RoadTones-51K

Results

Zero shot evaluation of open-source models

Video Presentation

References: GPT-5 [14]; Gemini-2.5-Pro [3]; Dolphins [9]; RoboTron-Drive [5]; RoadSocial [17]; VideoLLaMA3 [29]; InternVL3.5 [25]; Qwen2.5-VL [1]; MiniCPM-V 4.5 [27]; Qwen3-VL [19];

Citation

@misc{parikh2026roadtonestonecontrollabletext,
                        title={RoadTones: Tone Controllable Text Generation from Road Event Videos}, 
                        author={Chirag Parikh and Siddhi Pravin Lipare and Ravi Kiran Sarvadevabhatla},
                        year={2026},
                        eprint={2605.21411},
                        archivePrefix={arXiv},
                        primaryClass={cs.CV},
                        url={https://arxiv.org/abs/2605.21411}, 
                    }