File size: 4,534 Bytes
1ebfb47 f0a1c54 1ebfb47 f0a1c54 8be51bf f0a1c54 7030a08 f0a1c54 7030a08 f0a1c54 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- mistral
- inferentia2
- neuron
- neuronx
license: apache-2.0
---
# Neuronx for [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - Updated Mistral 7B Model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) Using AWS Neuron SDK version 2.18~
This model has been exported to the `neuron` format using specific `input_shapes` and `compiler` parameters detailed in the paragraphs below.
Please refer to the 🤗 `optimum-neuron` [documentation](https://huggingface.co/docs/optimum-neuron/main/en/guides/models#configuring-the-export-of-a-generative-model) for an explanation of these parameters.
Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096.
## Usage with 🤗 `TGI`
Refer to container image on [neuronx-tgi](https://gallery.ecr.aws/shtian/neuronx-tgi) Amazon ECR Public Gallery.
```shell
export HF_TOKEN="hf_xxx"
docker run -d -p 8080:80 \
--name mistral-7b-neuronx-tgi \
-v $(pwd)/data:/data \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
-e HF_TOKEN=${HF_TOKEN} \
public.ecr.aws/shtian/neuronx-tgi:latest \
--model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18 \
--max-batch-size 4 \
--max-input-length 16 \
--max-total-tokens 32
```
There seems no support for sending list of prompts to server, refer to this [GitHub issue](https://github.com/huggingface/text-generation-inference/issues/1008).
```python
from huggingface_hub import InferenceClient
import concurrent
client = InferenceClient(model="http://127.0.0.1:8080")
batch_text = ["1+1=", "2+2=", "3+3=", "4+4="]
bs = 4
def format_text_list(text_list):
return ['[INST] ' + text + ' [/INST]' for text in text_list]
def gen_text(text):
return client.text_generation(text, max_new_tokens=16)
with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor:
out = list(executor.map(gen_text, format_text_list(batch_text)))
print(out)
```
## Usage with 🤗 `optimum-neuron pipeline`
```python
from optimum.neuron import pipeline
p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18')
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)
[{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there
now I would take my partner on a romantic getaway where we could lay on the grass in the park,
eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}]
```
## Usage with 🤗 `optimum-neuron NeuronModelForCausalLM`
```python
import torch
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM
model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id
def model_sample(input_prompt):
input_prompt = "[INST] " + input_prompt + " [/INST]"
tokens = tokenizer(input_prompt, return_tensors="pt")
with torch.inference_mode():
sample_output = model.generate(
**tokens,
do_sample=True,
min_length=16,
max_length=32,
temperature=0.5,
pad_token_id=tokenizer.eos_token_id
)
outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output]
res = outputs[0].split('[/INST]')[1].strip("</s>").strip()
return(res + "\n")
print(model_sample("how are you today?"))
```
This repository contains tags specific to versions of `neuronx`. When using with 🤗 `optimum-neuron`, use the repo revision specific to the version of `neuronx` you are using, to load the right serialized checkpoints.
## Arguments passed during export
**input_shapes**
```json
{
"batch_size": 4,
"sequence_length": 2048,
}
```
**compiler_args**
```json
{
"auto_cast_type": "bf16",
"num_cores": 24,
}
``` |