Spaces:
Runtime error
Runtime error
File size: 14,806 Bytes
7faba42 552b7f5 5dae26f 7faba42 5dae26f 7faba42 7d59523 65f3818 7d59523 3253f8e 5992fad 04b9fba 286b758 3253f8e 65f3818 3253f8e 04b9fba 3253f8e 04b9fba 3253f8e 7d59523 3253f8e 7d59523 3253f8e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
---
title: OmniScience -- Agentic Imaging Analysis
emoji: 🔬🧫
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: openrail
---
# Using Landing AI's Vision Agent to architect an app for brain tumor detection
- a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture.
- the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis.
- the design-time flow in the most recent version of the official [Vision Agent](https://va.landing.ai/agent) app has shifted somewhat, but the number of concepts it helped bring together for me was amazing.
- if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave. Additional information on how to instrument the app can be found in [this GitHub repo](https://github.com/donbr/vision-agent).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/653d62fab16f657d28ce2cf2/KPV1Szj6IkY457n3Hqjl6.png)
# Summary of Vision Agent design-time workflow
MY PROMPT: use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image.
## Creating instructions
- Load the image from the given file path '/home/user/xxxx.jpg'.
- Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'.
- Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'.
- Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility.
- Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility.
- Save the final image with both bounding boxes and segmentation masks to a specified output path.
## Retrieving Vision Agent tools
- 'load_image' is a utility function that loads an image from the given file path string.
'save_image' is a utility function that saves an image to a file path.
- 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.
- 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes.
- 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores.
- 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.
- 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image.
- 'overlay_heat_map' is a utility function that displays a heat map on an image.
- 'overlay_segmentation_masks' is a utility function that displays segmentation masks.
### Retrieving tools - detailed notes from Vision Agent tool selection
load_image(image_path: str) -> numpy.ndarray:
'load_image' is a utility function that loads an image from the given file path string.
Parameters:
image_path (str): The path to the image.
Returns:
np.ndarray: The image as a NumPy array.
Example
-------
>>> load_image("path/to/image.jpg")
save_image(image: numpy.ndarray, file_path: str) -> None:
'save_image' is a utility function that saves an image to a file path.
Parameters:
image (np.ndarray): The image to save.
file_path (str): The path to save the image file.
Example
-------
>>> save_image(image)
owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]:
'owl_v2' is a tool that can detect and count multiple objects given a text
prompt such as category names or referring expressions. The categories in text prompt
are separated by commas. It returns a list of bounding boxes with
normalized coordinates, label names and associated probability scores.
Parameters:
prompt (str): The prompt to ground to the image.
image (np.ndarray): The image to ground the prompt to.
box_threshold (float, optional): The threshold for the box detection. Defaults
to 0.10.
iou_threshold (float, optional): The threshold for the Intersection over Union
(IoU). Defaults to 0.10.
Returns:
List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
bounding box of the detected objects with normalized coordinates between 0
and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
top-left and xmax and ymax are the coordinates of the bottom-right of the
bounding box.
Example
-------
>>> owl_v2("car. dinosaur", image)
[
{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
{'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
]
florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]:
'florencev2_object_detection' is a tool that can detect common objects in an
image without any text prompt or thresholding. It returns a list of detected objects
as labels and their location as bounding boxes.
Parameters:
image (np.ndarray): The image to used to detect objects
Returns:
List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
bounding box of the detected objects with normalized coordinates between 0
and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
top-left and xmax and ymax are the coordinates of the bottom-right of the
bounding box. The scores are always 1.0 and cannot be thresholded
Example
-------
>>> florencev2_object_detection(image)
[
{'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]},
{'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
{'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5},
]
grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]:
'grounding_sam' is a tool that can segment multiple objects given a
text prompt such as category names or referring expressions. The categories in text
prompt are separated by commas or periods. It returns a list of bounding boxes,
label names, mask file names and associated probability scores.
Parameters:
prompt (str): The prompt to ground to the image.
image (np.ndarray): The image to ground the prompt to.
box_threshold (float, optional): The threshold for the box detection. Defaults
to 0.20.
iou_threshold (float, optional): The threshold for the Intersection over Union
(IoU). Defaults to 0.20.
Returns:
List[Dict[str, Any]]: A list of dictionaries containing the score, label,
bounding box, and mask of the detected objects with normalized coordinates
(xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
and xmax and ymax are the coordinates of the bottom-right of the bounding box.
The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
the background.
Example
-------
>>> grounding_sam("car. dinosaur", image)
[
{
'score': 0.99,
'label': 'dinosaur',
'bbox': [0.1, 0.11, 0.35, 0.4],
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
},
]
detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]:
'detr_segmentation' is a tool that can segment common objects in an
image without any text prompt. It returns a list of detected objects
as labels, their regions as masks and their scores.
Parameters:
image (np.ndarray): The image used to segment things and objects
Returns:
List[Dict[str, Any]]: A list of dictionaries containing the score, label
and mask of the detected objects. The mask is binary 2D numpy array where 1
indicates the object and 0 indicates the background.
Example
-------
>>> detr_segmentation(image)
[
{
'score': 0.45,
'label': 'window',
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
},
{
'score': 0.70,
'label': 'bird',
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
},
]
overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray:
'overlay_bounding_boxes' is a utility function that displays bounding boxes on
an image.
Parameters:
image (np.ndarray): The image to display the bounding boxes on.
bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding
boxes.
Returns:
np.ndarray: The image with the bounding boxes, labels and scores displayed.
Example
-------
>>> image_with_bboxes = overlay_bounding_boxes(
image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}],
)
overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray:
'overlay_heat_map' is a utility function that displays a heat map on an image.
Parameters:
image (np.ndarray): The image to display the heat map on.
heat_map (Dict[str, Any]): A dictionary containing the heat map under the key
'heat_map'.
alpha (float, optional): The transparency of the overlay. Defaults to 0.8.
Returns:
np.ndarray: The image with the heat map displayed.
Example
-------
>>> image_with_heat_map = overlay_heat_map(
image,
{
'heat_map': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 125, 125, 125]], dtype=uint8),
},
)
overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray:
'overlay_segmentation_masks' is a utility function that displays segmentation
masks.
Parameters:
image (np.ndarray): The image to display the masks on.
masks (List[Dict[str, Any]]): A list of dictionaries containing the masks.
Returns:
np.ndarray: The image with the masks displayed.
Example
-------
>>> image_with_masks = overlay_segmentation_masks(
image,
[{
'score': 0.99,
'label': 'dinosaur',
'mask': array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
}],
)
## Vision Agent Tools - model summary
- any mistakes in the following table are mine. my efforts to do some QUICK reverse engineering to identify target models.
| Model Name | Hugging Face Model | Primary Function | Use Cases |
|---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------|
| OWL-ViT v2 | google/owlv2-base-patch16-ensemble | Object detection and localization | - Open-world object detection<br>- Locating specific objects based on text prompts |
| Florence-2 | microsoft/florence-base | Multi-purpose vision tasks | - Image captioning<br>- Visual question answering<br>- Object detection |
| Depth Anything V2 | LiheYoung/depth-anything-v2-small | Depth estimation | - Estimating depth in images<br>- Generating depth maps |
| CLIP | openai/clip-vit-base-patch32 | Image-text similarity | - Zero-shot image classification<br>- Image-text matching |
| BLIP | Salesforce/blip-image-captioning-base | Image captioning | - Generating text descriptions of images |
| LOCA | Custom implementation | Object counting | - Zero-shot object counting<br>- Object counting with visual prompts |
| GIT v2 | microsoft/git-base-vqav2 | Visual question answering and image captioning | - Answering questions about image content<br>- Generating text descriptions of images |
| Grounding DINO | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts |
| SAM | facebook/sam-vit-huge | Instance segmentation | - Text-prompted instance segmentation |
| DETR | facebook/detr-resnet-50 | Object detection | - General object detection |
| ViT | google/vit-base-patch16-224 | Image classification | - General image classification<br>- NSFW content detection |
| DPT | Intel/dpt-hybrid-midas | Monocular depth estimation | - Estimating depth from single images |
|