Align-DS-V

🏠 Homepage | 🤗 Align-Anything Dataset | 🤗 T2T_Instruction-tuning Dataset | 🤗 TI2T_Instruction-tuning Dataset | 👍 Our Official Code Repo

Introduction

Align-DS-V is an experimental vision-language model from DeepSeek-R1-Distill-Llama-8B, developed by the PKU-Alignment team and HKUST, focusing on enhancing reasoning capabilities by all-modality alignment.

Performance

VQA Tasks

As a vision-language model, Align-DS-V shows strong performance in various VQA chat and reasoning tasks.

	Align-DS-V (8B)	GPT-4o
MathVista	27.0	30.4
MathVision	63.8	62.2
LLaVA-Bench-COCO	105.3	104.9
A-OKVQA	83.7	87.9

Math Tasks

In addition, we were pleasantly surprised to find that Align-DS-R1, which extends the DeepSeek-R1-Distill-Llama-8B to the visual modality, also achieved a significant improvement in its original text modality reasoning capabilities.

	Align-DS-V (8B)	DeepSeek-R1-Distill-Llama-8B
ARC (5-shot)	34.2	32.7
ARC-Challenge (5-shot)	40.5	21.4
BigBench-Hard (3-shot)	73.4	72.2

Quick Start

We will show how to use Align-DS-V to solve a math problem shown in the image below.

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "PKU-Alignment/Align-DS-V"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What is the result of this problem?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "./assets/demo.jpg" # in this repo
raw_image = Image.open(image_file)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))

# <think>To solve the problem, I will first interpret the image to understand what
# mathematical operation is being represented. Then, I will perform the calculation
# based on the numbers provided in the image and confirm the result. The image shows
# a chalkboard with the equation \(18 + 23 = 41\) written on it. The numbers 18 and
# 23 are in light blue, and the result 41 is in light green. The equation \(18 + 23 = 41\)
# is presented on the chalkboard. To solve this, I will add the two numbers on the
# left side of the equation: 18 and 23. Adding these together, \(18 + 23\), I calculate
# that the sum is 41. This matches the number on the right side of the equation,
# confirming its correctness.</think>41

Citation

The reproduction script for Align-DS-V will be released in the align-anything repository.

Please cite the repo if you find the model or code in this repo useful 😊

@inproceedings{ji2024align,
  title={Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback},
  author={Jiaming Ji and Jiayi Zhou and Hantao Lou and Boyuan Chen and Donghai Hong and Xuyao Wang and Wenqi Chen and Kaile Wang and Rui Pan and Jiahao Li and Mohan Wang and Josef Dai and Tianyi Qiu and Hua Xu and Dong Li and Weipeng Chen and Jun Song and Bo Zheng and Yaodong Yang},
  year={2024},
  url={https://arxiv.org/abs/2412.15838}
}

PKU-Alignment
/

Align-DS-V