--- license: mit license_link: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/LICENSE language: - multilingual pipeline_tag: text-generation tags: - nlp - code - vision widget: - messages: - role: user content: <|image_1|>\nWhat action should the robot take to {lang}? --- ## TraceVLA-7B ``TraceVLA-7B`` model is a vision-language-action model obtained by finetuning the base [OpenVLA](https://huggingface.co/openvla/openvla-7b) model with [visual trace prompting technique](https://arxiv.org/abs/2412.10345). ### Results on SimplerEnv Fractal + SimplerEnv: #### Fractal: | Policy/Settings | Pick up Coke | Move near | Open/Close Drawer | Put in Drawer | Average Success Rate | |:------:|:------------:|:---------:|:------------:|:-----------:|:-------:| | (Visual Matching) OpenVLA-7B | 23.7% | **65.0%** | 57.4% | 0.% | 36.5% | | (Visual Matching) TraceVLA-7B | **45.0%** | 63.8% | **63.1%** | **11.1.%** | 45.8% | | (Variant Aggregation) OpenVLA-7B | 61.3% | 55.8% | 24.9% | 1.0% | 35.8% | | (Variant Aggregation) TraceVLA-7B | **64.3%** | **60.6%** | **61.6%** | **12.5.%** | **49.8%** | #### Bridge: | Policy/Settings | Put Spoon | Put Carrot | Stack Block | Put Eggplant | Average Success Rate | |:------:|:------------:|:---------:|:------------:|:-----------:|:-------:| | OpenVLA-7B | 8.3% | 8.3% | 4.2% | 45.8% | 16.7% | | TraceVLA-7B | **12.5%** | **16.6%** | **16.6%** | **65.0%** | **27.7%** | ### Sample Inference Code Here is the sample inference code of TraceVLA-7B model. ``` model_path = "furonghuang-lab/tracevla_7b" # Load Processor & VLA processor = AutoProcessor.from_pretrained( model_path, trust_remote_code=True, num_crops=1, ) vla = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.bfloat16, trust_remote_code=True, _attn_implementation='flash_attention_2', use_cache=True ).to(device='cuda') # Load Visual Trace Processor # cotracker_model_path corresponds to the path to your downloaded scaled_offline.pth checkpoint from prismatic.eval.trace_processor import TraceProcessor trace_processor = TraceProcessor(cotracker_model_path) # Grab image input & format prompt # In case where the visual trace returned by Co-Tracker is not valid, we use the default openvla prompt. openvla_prompt_template = "In: What action should the robot take to {task_description}?\nOut:" tracevla_prompt_template = "In: You are given two images: one with the original robot observation, and another one marked with historical traces of the robot end effector and moving objects, separated by a special separator token. What action should the robot take to {task_description}?\nOut:" image: Image.Image = get_from_camera(...) image_overlaid, has_trace = trace_processors.process_image(image) if not has_trace: prompt = openvla_prompt_template.format(task_description=task_description) inputs = processor(prompt, [image, image]).to(device='cuda', dtype=torch.bfloat16) else: prompt = tracevla_prompt_template.format(task_description=task_description) inputs = processor(prompt, [image, image_overlaid]).to(device='cuda', dtype=torch.bfloat16) ### Predict the action with torch.inference_mode(): action = vla.predict_action(**inputs) # Execute the action robot.act(action, ...) ``` For more examples, including scripts for finetuning TraceVLA models on your own robot demonstration datasets, check out our [repository](https://github.com/FrankZheng2022/tracevla). ### Citation If you find our code or models useful in your work, please cite [our paper](https://arxiv.org/abs/2412.10345): ```bibtex @misc{zheng2024tracevlavisualtraceprompting, title={TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies}, author={Ruijie Zheng and Yongyuan Liang and Shuaiyi Huang and Jianfeng Gao and Hal Daumé III and Andrey Kolobov and Furong Huang and Jianwei Yang}, year={2024}, eprint={2412.10345}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2412.10345}, } ```