File size: 4,534 Bytes
1ebfb47
f0a1c54
 
 
 
 
 
 
 
 
1ebfb47
 
f0a1c54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8be51bf
f0a1c54
 
 
7030a08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0a1c54
7030a08
f0a1c54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- mistral
- inferentia2
- neuron
- neuronx
license: apache-2.0
---
# Neuronx for [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - Updated Mistral 7B Model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) Using AWS Neuron SDK version 2.18~

This model has been exported to the `neuron` format using specific `input_shapes` and `compiler` parameters detailed in the paragraphs below.

Please refer to the 🤗 `optimum-neuron` [documentation](https://huggingface.co/docs/optimum-neuron/main/en/guides/models#configuring-the-export-of-a-generative-model) for an explanation of these parameters.

Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096.

## Usage with 🤗 `TGI`
Refer to container image on [neuronx-tgi](https://gallery.ecr.aws/shtian/neuronx-tgi) Amazon ECR Public Gallery.
```shell
export HF_TOKEN="hf_xxx"

docker run -d -p 8080:80 \
       --name mistral-7b-neuronx-tgi \
       -v $(pwd)/data:/data \
       --device=/dev/neuron0 \
       --device=/dev/neuron1 \
       --device=/dev/neuron2 \
       --device=/dev/neuron3 \
       --device=/dev/neuron4 \
       --device=/dev/neuron5 \
       --device=/dev/neuron6 \
       --device=/dev/neuron7 \
       --device=/dev/neuron8 \
       --device=/dev/neuron9 \
       --device=/dev/neuron10 \
       --device=/dev/neuron11 \
       -e HF_TOKEN=${HF_TOKEN} \
       public.ecr.aws/shtian/neuronx-tgi:latest \
       --model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18 \
       --max-batch-size 4 \
       --max-input-length 16 \
       --max-total-tokens 32
```
There seems no support for sending list of prompts to server, refer to this [GitHub issue](https://github.com/huggingface/text-generation-inference/issues/1008).

```python
from huggingface_hub import InferenceClient
import concurrent

client = InferenceClient(model="http://127.0.0.1:8080")
batch_text = ["1+1=", "2+2=", "3+3=", "4+4="]

bs = 4

def format_text_list(text_list):
    return ['[INST] ' + text + ' [/INST]' for text in text_list]

def gen_text(text):
    return client.text_generation(text, max_new_tokens=16)

with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor:
    out = list(executor.map(gen_text, format_text_list(batch_text)))

print(out)
```

## Usage with 🤗 `optimum-neuron pipeline`

```python
from optimum.neuron import pipeline

p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18')
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

[{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there
now I would take my partner on a romantic getaway where we could lay on the grass in the park,
eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}]
```

## Usage with 🤗 `optimum-neuron NeuronModelForCausalLM`

```python
import torch
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18")

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id

def model_sample(input_prompt):
    input_prompt = "[INST] " + input_prompt + " [/INST]"

    tokens = tokenizer(input_prompt, return_tensors="pt")

    with torch.inference_mode():
        sample_output = model.generate(
            **tokens,
            do_sample=True,
            min_length=16,
            max_length=32,
            temperature=0.5,
            pad_token_id=tokenizer.eos_token_id
        )
        outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output]

    res = outputs[0].split('[/INST]')[1].strip("</s>").strip()
    return(res + "\n")

print(model_sample("how are you today?"))
```

This repository contains tags specific to versions of `neuronx`. When using with 🤗 `optimum-neuron`, use the repo revision specific to the version of `neuronx` you are using, to load the right serialized checkpoints.

## Arguments passed during export

**input_shapes**

```json
{
  "batch_size": 4,
  "sequence_length": 2048,
}
```

**compiler_args**

```json
{
  "auto_cast_type": "bf16",
  "num_cores": 24,
}
```