--- library_name: transformers license: apache-2.0 --- # SynthPose (Transformers 🤗 VitPose Huge variant) The SynthPose model was proposed in [OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics](https://arxiv.org/abs/2406.09788) by Yoni Gozlan, Antoine Falisse, Scott Uhlrich, Anthony Gatti, Michael Black, Akshay Chaudhari. This model was contributed by [Yoni Gozlan](https://huggingface.co/yonigozlan) # Intended use cases This model uses a VitPose Huge backbone. SynthPose is a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. More details are available in [OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics](https://arxiv.org/abs/2406.09788). This particular variant was finetuned on a set of keypoints usually found on motion capture setups, and include coco keypoints as well. The model predicts the following 52 markers: ```py { 0: "Nose", 1: "L_Eye", 2: "R_Eye", 3: "L_Ear", 4: "R_Ear", 5: "L_Shoulder", 6: "R_Shoulder", 7: "L_Elbow", 8: "R_Elbow", 9: "L_Wrist", 10: "R_Wrist", 11: "L_Hip", 12: "R_Hip", 13: "L_Knee", 14: "R_Knee", 15: "L_Ankle", 16: "R_Ankle", 17: "sternum", 18: "rshoulder", 19: "lshoulder", 20: "r_lelbow", 21: "l_lelbow", 22: "r_melbow", 23: "l_melbow", 24: "r_lwrist", 25: "l_lwrist", 26: "r_mwrist", 27: "l_mwrist", 28: "r_ASIS", 29: "l_ASIS", 30: "r_PSIS", 31: "l_PSIS", 32: "r_knee", 33: "l_knee", 34: "r_mknee", 35: "l_mknee", 36: "r_ankle", 37: "l_ankle", 38: "r_mankle", 39: "l_mankle", 40: "r_5meta", 41: "l_5meta", 42: "r_toe", 43: "l_toe", 44: "r_big_toe", 45: "l_big_toe", 46: "l_calc", 47: "r_calc", 48: "C7", 49: "L2", 50: "T11", 51: "T6", } ``` Where the first 17 keypoints are the COCO keypoints, and the next 35 are anatomical markers. # Usage ## Image inference Here's how to load the model and run inference on an image: ```py import torch import requests import numpy as np from PIL import Image from transformers import ( AutoProcessor, RTDetrForObjectDetection, VitPoseForPoseEstimation, ) device = "cuda" if torch.cuda.is_available() else "cpu" url = "http://farm4.staticflickr.com/3300/3416216247_f9c6dfc939_z.jpg" image = Image.open(requests.get(url, stream=True).raw) # ------------------------------------------------------------------------ # Stage 1. Detect humans on the image # ------------------------------------------------------------------------ # You can choose detector by your choice person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365") person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device) inputs = person_image_processor(images=image, return_tensors="pt").to(device) with torch.no_grad(): outputs = person_model(**inputs) results = person_image_processor.post_process_object_detection( outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3 ) result = results[0] # take first image results # Human label refers 0 index in COCO dataset person_boxes = result["boxes"][result["labels"] == 0] person_boxes = person_boxes.cpu().numpy() # Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0] person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1] # ------------------------------------------------------------------------ # Stage 2. Detect keypoints for each person found # ------------------------------------------------------------------------ image_processor = AutoProcessor.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf") model = VitPoseForPoseEstimation.from_pretrained("yonigozlan/synthpose-vitpose-huge-hf", device_map=device) inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs) pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes]) image_pose_result = pose_results[0] # results for first image ``` ### Visualization for supervision user ```py import supervision as sv xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy() scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy() key_points = sv.KeyPoints( xy=xy, confidence=scores ) vertex_annotator = sv.VertexAnnotator( color=sv.Color.PINK, radius=2 ) annotated_frame = vertex_annotator.annotate( scene=image.copy(), key_points=key_points ) annotated_frame ```