Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
Abstract
Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLMs). In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. To facilitate this, we curated a structured dataset comprising two subsets: 1) ClockQA, which comprises various types of clock styles-standard, black-dial, no-second-hand, Roman numeral, and arrow-hand clocks-paired with time related questions; and 2) CalendarQA, which consists of yearly calendar images with questions ranging from commonly known dates (e.g., Christmas, New Year's Day) to computationally derived ones (e.g., the 100th or 153rd day of the year). We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. Our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.
Community
In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. We created a small dataset comprising two subsets, ClockQA and CalendarQA and our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.
Interesting to know that ALL MLLM's struggle with tell 2, 4 and 5 o'clock. I can see future training directly on this dataset, just so clocks are no longer a blind spot.
Thanks for your comment!
Fine-tuning could help with these weaknesses, but our goal was to highlight the gap between claimed reasoning skills and real performance. Given the breadth of data and tasks typically used in pre-training and SFT, one might expect models to handle simple time-telling tasks in a zero-shot setting.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! (2025)
- Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries (2024)
- DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests (2025)
- Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models (2024)
- Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach (2024)
- ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming (2024)
- Learning Free Token Reduction for Multi-Modal LLM (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper