Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints
falcon-mamba-7b / README.md
ybelkada's picture
Update README.md
e3d99da verified
|
raw
history blame
5.62 kB
metadata
language:
  - multilingual
license: apache-2.0

Model Card for Sindibad-7B

Table of Contents

  1. TL;DR
  2. Model Details
  3. Usage
  4. Training Details
  5. Evaluation

TL;DR

Model Details

Model Description

  • Model type: Language model
  • Language(s) (NLP): English
  • License: Apache 2.0

Usage

Find below some example scripts on how to use the model in transformers (Make sure to have the latest transformers, or the one built from source):

Using the Pytorch model

Running the model on a CPU

Click to expand
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/sindibad-7b")
model = AutoModelForCausalLM.from_pretrained("tiiuae/sindibad-7b")

input_text = "Question: How many hours in one day? Answer: "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU

Click to expand
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/sindibad-7b")
model = AutoModelForCausalLM.from_pretrained("tiiuae/sindibad-7b", device_map="auto")

input_text = "Question: How many hours in one day? Answer: "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU using different precisions

FP16

Click to expand
# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/sindibad-7b")
model = AutoModelForCausalLM.from_pretrained("tiiuae/sindibad-7b", device_map="auto", torch_dtype=torch.float16)

input_text = "Question: How many hours in one day? Answer: "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

4-bit

Click to expand
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained("tiiuae/sindibad-7b")
model = AutoModelForCausalLM.from_pretrained("tiiuae/sindibad-7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True))

input_text = "Question: How many hours in one day? Answer: "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Training Details

Jingwei

Training Data

Guillaume

Training Procedure

The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from   bmin=128×2048b_{\mathrm{min}}=128\times2048   to   bmax=2048×2048b_{\mathrm{max}}=2048\times2048   tokens during the first 50 GT of the training. In the stable phase, we used maximal learning rate   ηmax=6.4×104\eta_{\mathrm{max}}=6.4 \times 10^{-4}   and decayed it to the minimal value   ηmin=ηmax/256\eta_{\mathrm{min}}=\eta_{\mathrm{max}} / 256   with exponential schedule over 500 GT. Also, we applied BatchScaling during the rampup — rescaling learning rate   η\eta  so that the Adam noise temperature   Tnoiseη/bT_{\mathrm{noise}}\equiv\eta / \sqrt{b}   is kept cosntant.

Evaluation

Benchmarks

We evaluate our model on all benchmarks of the leaderboard's version 2 using the lm-evaluation-harness package, and we evaluate it on the benchmarks of version 1 using lighteval.

model_name IFEval BBH MATH LvL5 GPQA MUSR MMLU-PRO Average L2 ARC HellaSwag MMLU Winogrande TruthfulQA GSM8K Average L1
meta-llama/Meta-Llama-3-8B 14.55 24.50 3.25 7.38 6.24 24.55 13.41 60.24 82.23 66.70 78.45 42.93 45.19 62.62
tiiuae/falcon2-11B 32.61 21.94 2.34 2.8 7.53 15.44 13.78 59.73 82.91 58.37 78.30 52.56 53.83 64.28
mistralai/Mistral-7B-v0.1 23.86 22.02 2.49 5.59 10.68 22.36 14.50 59.98 83.31 64.16 78.37 42.15 37.83 60.97
Zyphra/Zamba-7B-v1 - - - - - - - 46.48 80.24 57.72 76.4 - - -
Ours 32.16 21.07 4.08 10.18 6.97 13.43 14.65 61.69 80.63 61.05 74.03 53.60 51.86 63.81

Throughput

This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands:

pip install "causal-conv1d>=1.4.0" mamba-ssm

Refer to our technical report for more details about performance evaluation.