Generate text responses using images and text prompts
A unified multimodal understanding and generation model.