Generate a detailed image caption with highlighted entities
Tess-R1 capabilities to produce a Chain-of-Thought (CoT).
Transcribe and translate audio into text