arxiv:2502.05092

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs

Published on Feb 7

· Submitted by

rohitsaxena on Feb 10

Upvote

Authors:

Aryo Pradipta Gema ,

Pasquale Minervini

Abstract

Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLMs). In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. To facilitate this, we curated a structured dataset comprising two subsets: 1) ClockQA, which comprises various types of clock styles-standard, black-dial, no-second-hand, Roman numeral, and arrow-hand clocks-paired with time related questions; and 2) CalendarQA, which consists of yearly calendar images with questions ranging from commonly known dates (e.g., Christmas, New Year's Day) to computationally derived ones (e.g., the 100th or 153rd day of the year). We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. Our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.

View arXiv page View PDF Add to collection

Community

rohitsaxena

Paper submitter 4 days ago

•

edited 4 days ago

In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. We created a small dataset comprising two subsets, ClockQA and CalendarQA and our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.

pauljones0

4 days ago

Interesting to know that ALL MLLM's struggle with tell 2, 4 and 5 o'clock. I can see future training directly on this dataset, just so clocks are no longer a blind spot.

rohitsaxena

3 days ago

Thanks for your comment!
Fine-tuning could help with these weaknesses, but our goal was to highlight the gap between claimed reasoning skills and real performance. Given the breadth of data and tasks typically used in pre-training and SFT, one might expect models to handle simple time-telling tasks in a zero-shot setting.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.05092 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.05092 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.