File size: 2,405 Bytes
a426793
 
6e6f093
 
 
 
 
 
 
a426793
 
6e6f093
a426793
6e6f093
a426793
13d6ca3
 
 
 
 
 
6e6f093
a426793
13d6ca3
a426793
6e6f093
a426793
6e6f093
a426793
6e6f093
a426793
6e6f093
a426793
13d6ca3
a426793
6e6f093
 
 
 
 
 
a426793
6e6f093
 
 
 
 
 
a426793
6e6f093
 
 
 
a426793
6e6f093
 
 
 
 
 
a426793
6e6f093
a426793
6e6f093
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
library_name: transformers
license: mit
datasets:
- SpursgoZmy/MMTab
- apoidea/pubtabnet-html
language:
- en
base_model: google/pix2struct-base
---

# pix2struct-base-table2html

*Turn table images into HTML!*


## Demo app

Try the [demo app]() which contains both table detection and recognition!


## About

This model takes an image of a table and outputs HTML - the model parses the image and performs optical character recognition (OCR) and structure recognition to HTML format. 

The model expects an image containing only a table. If the table is embedded in a document, first use a table detection model to extract it.

The model is finetuned from [Pix2Struct base model](https://huggingface.co/google/pix2struct-base) using a max_patch_length of 1024 and max generation length of 1024. The max_patch_length should likely not be changed for inference but the generation length can be changed.

The model has been trained using two datasets: [MMTab](https://huggingface.co/datasets/SpursgoZmy/MMTab) and [PubTabNet](https://huggingface.co/datasets/apoidea/pubtabnet-html).

## Usage

Below is a complete example of loading the model and performing inference on an example table image (example from the [MMTab dataset](https://huggingface.co/datasets/SpursgoZmy/MMTab)):

```python
import torch
from transformers import AutoProcessor, Pix2StructForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO

# Load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("pix2struct-base-table2html")
model = Pix2StructForConditionalGeneration.from_pretrained("pix2struct-base-table2html")
model.to(device)
model.eval()

# Load example image from URL
url = "https://example.com/path_to_table_image.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Run model inference
encoding = processor(image, return_tensors="pt", max_patches=1024)
with torch.inference_mode():
    flattened_patches = encoding.pop("flattened_patches").to(device)
    attention_mask = encoding.pop("attention_mask").to(device)
    predictions = model.generate(flattened_patches=flattened_patches, attention_mask=attention_mask, max_new_tokens=1024)

predictions_decoded = processor.tokenizer.batch_decode(predictions, skip_special_tokens=True)

# Show predictions as text
print(predictions_decoded[0])
```