|
--- |
|
base_model: NousResearch/Meta-Llama-3-70B-Instruct |
|
model_type: llama |
|
pipeline_tag: text-generation |
|
quantized_by: Compressa |
|
license: other |
|
license_name: llama3 |
|
license_link: https://llama.meta.com/llama3/license |
|
tags: |
|
- llama3 |
|
- omniquant |
|
- gptq |
|
- triton |
|
--- |
|
|
|
|
|
# Llama 3 70B Instruct – OmniQuant |
|
|
|
Based on [Llama 3 70B Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct). |
|
|
|
Quantized with [OmniQuant](https://github.com/OpenGVLab/OmniQuant). |
|
|
|
|
|
## Evaluation |
|
|
|
### PPL (↓) |
|
|
|
| | wiki | |
|
| ------------- | ---- | |
|
| FP | 5,33 | |
|
| **Quantized** | 5,90 | |
|
|
|
|
|
### Accuracy on English Benchmarks, % (↑) |
|
|
|
| | piqa | arc_easy | arc_challenge | boolq | hellaswag | winogrande | mmlu_humanities | mmlu_social_sciences | mmlu_stem | mmlu_other | |
|
| ------------- | ---- | -------- | ------------- | ----- | --------- | ---------- | --------------- | -------------------- | --------- | ---------- | |
|
| FP | 81,5 | 86,2 | 61,9 | 87,4 | 63,7 | 75,8 | 78,7 | 84,4 | 71,1 | 80,2 | |
|
| **Quantized** | 80,7 | 85,8 | 61,4 | 87,0 | 62,7 | 73,0 | 75,5 | 81,0 | 68,6 | 77,9 | |
|
|
|
|
|
### Accuracy on Russian Benchmarks, % (↑) |
|
|
|
| | danetqa | terra | rwsd | muserc | rucos | lidirus | parus | rcb | russe | rucola | |
|
| ------------- | ------- | ----- | ---- | ------ | ----- | ------- | ----- | ---- | ----- | ------ | |
|
| FP | 88,9 | 88,6 | 75,5 | 81,8 | 82,4 | 70,7 | 77,0 | 35,0 | 63,1 | 34,7 | |
|
| **Quantized** | 86,6 | 81,8 | 71,6 | 75,6 | 69,5 | 60,3 | 64,0 | 26,8 | 63,1 | 32,5 | |
|
|
|
|
|
### Summary |
|
|
|
| | Avg acc diff on Eng, % (↑) | Avg acc diff on Rus, % (↑) | Occupied disk space, % (↓) | |
|
| ------------- | -------------------------- | -------------------------- | -------------------------- | |
|
| FP | 0 | 0 | 100 | |
|
| **Quantized** | \-1,7 | \-6,6 | 28,2 | |
|
|
|
|
|
## Examples |
|
|
|
### Imports and Model Loading |
|
|
|
<details> |
|
<summary>Expand</summary> |
|
|
|
```python |
|
import gc |
|
|
|
import auto_gptq.nn_modules.qlinear.qlinear_cuda as qlinear_cuda |
|
import auto_gptq.nn_modules.qlinear.qlinear_triton as qlinear_triton |
|
import torch |
|
|
|
from accelerate import ( |
|
init_empty_weights, |
|
infer_auto_device_map, |
|
load_checkpoint_in_model, |
|
) |
|
from tqdm import tqdm |
|
from transformers import ( |
|
AutoConfig, |
|
AutoModelForCausalLM, |
|
AutoTokenizer, |
|
pipeline, |
|
) |
|
|
|
|
|
def get_named_linears(model): |
|
return { |
|
name: module for name, module in model.named_modules() |
|
if isinstance(module, torch.nn.Linear) |
|
} |
|
|
|
|
|
def set_module(model, name, module): |
|
parent = model |
|
levels = name.split('.') |
|
|
|
for i in range(len(levels) - 1): |
|
cur_name = levels[i] |
|
|
|
if cur_name.isdigit(): |
|
parent = parent[int(cur_name)] |
|
else: |
|
parent = getattr(parent, cur_name) |
|
|
|
setattr(parent, levels[-1], module) |
|
|
|
|
|
def load_model(model_path): |
|
# Based on: https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_mixtral_7bx8.ipynb |
|
|
|
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
if not hasattr(config, 'quantization_config'): |
|
raise AttributeError( |
|
f'No quantization info found in model config "{model_path}"' |
|
f' (`quantization_config` section is missing).' |
|
) |
|
|
|
wbits = config.quantization_config['bits'] |
|
group_size = config.quantization_config['group_size'] |
|
|
|
# We are going to init an ordinary model and then manually replace all Linears with QuantLinears |
|
del config.quantization_config |
|
|
|
with init_empty_weights(): |
|
model = AutoModelForCausalLM.from_config(config=config, torch_dtype=torch.float16, trust_remote_code=True) |
|
|
|
layers = model.model.layers |
|
|
|
for i in tqdm(range(len(layers))): |
|
layer = layers[i] |
|
named_linears = get_named_linears(layer) |
|
|
|
for name, module in named_linears.items(): |
|
params = ( |
|
wbits, group_size, |
|
module.in_features, module.out_features, |
|
module.bias is not None |
|
) |
|
|
|
if wbits in [2, 4]: |
|
q_linear = qlinear_triton.QuantLinear(*params) |
|
elif wbits == 3: |
|
q_linear = qlinear_cuda.QuantLinear(*params) |
|
else: |
|
raise NotImplementedError("Only 2, 3 and 4 bits are supported.") |
|
|
|
q_linear.to(next(layer.parameters()).device) |
|
set_module(layer, name, q_linear) |
|
|
|
torch.cuda.empty_cache() |
|
gc.collect() |
|
|
|
model.tie_weights() |
|
device_map = infer_auto_device_map(model) |
|
|
|
print("Loading pre-computed quantized weights...") |
|
|
|
load_checkpoint_in_model( |
|
model, checkpoint=model_path, |
|
device_map=device_map, offload_state_dict=True, |
|
) |
|
|
|
print("Model loaded successfully!") |
|
|
|
return model |
|
``` |
|
</details> |
|
|
|
|
|
### Inference |
|
|
|
```python |
|
model_path = "compressa-ai/Llama-3-70B-Instruct-OmniQuant" |
|
|
|
model = load_model(model_path).cuda() |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model_path, use_fast=False, trust_remote_code=True |
|
) |
|
|
|
# Llama 3 "specifics" |
|
# https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4 |
|
terminators = [ |
|
tokenizer.convert_tokens_to_ids("<|end_of_text|>"), |
|
tokenizer.convert_tokens_to_ids("<|eot_id|>") |
|
] |
|
|
|
system_message = "You are a friendly chatbot who responds as if you are the Sandy Cheeks squirrel from the SpongeBob SquarePants cartoon." |
|
user_message = "Do squirrels communicate with birds?" |
|
messages = [ |
|
{"role": "system", "content": system_message}, |
|
{"role": "user", "content": user_message}, |
|
] |
|
prompt = tokenizer.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
inputs = {k: v.cuda() for k, v in inputs.items()} |
|
|
|
outputs = model.generate( |
|
**inputs, max_new_tokens=512, |
|
do_sample=True, temperature=0.7, top_p=0.95, |
|
eos_token_id=terminators, |
|
) |
|
|
|
response = tokenizer.decode(outputs[0]) |
|
continuation = response.removeprefix(prompt).removesuffix(tokenizer.eos_token) |
|
|
|
print(f'Prompt:\n{prompt}') |
|
print(f'Continuation:\n{continuation}\n') |
|
``` |
|
|
|
|
|
### Inference Using Pipeline |
|
|
|
```python |
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, tokenizer=tokenizer, |
|
eos_token_id=terminators, |
|
max_new_tokens=512, do_sample=True, |
|
temperature=0.7, top_p=0.95, |
|
device=0, |
|
) |
|
|
|
prompt = pipe.tokenizer.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
outputs = pipe(prompt) |
|
|
|
response = outputs[0]["generated_text"] |
|
continuation = response.removeprefix(prompt) |
|
|
|
print(f'Prompt:\n{prompt}') |
|
print(f'Continuation:\n{continuation}\n') |
|
``` |
|
|