Gemma 2 model card
- DatafoundryAI/OpenVino_Ratnamuladve.Q8_K_M-GGUF
Authors
- DatafoundryAI
Model Information
- Summary description and brief definition of inputs and outputs.
Description
- Gemma-2-2b-it builds on the technological advancements of the Gemini models, offering high-quality language generation capabilities. We have enhanced this model by applying INT8 quantization using the Intel OpenVINO Toolkit. This process optimizes the model for deployment in resource-constrained environments.
About Openvino
Model Conversion and Quantization with Intel OpenVINO::
Model Optimizer:
- OpenVINO includes a tool called the Model Optimizer that converts pre-trained models from popular frameworks (such as TensorFlow, PyTorch, and ONNX) into an intermediate representation (IR). This IR consists of two files: a .xml file describing the model's structure and a .bin file containing the weights.
Quantization:
- During the conversion process, you can apply quantization techniques to reduce model size and improve inference speed. OpenVINO supports INT8 quantization, which reduces floating-point precision to 8-bit integers.
Benefits:
- INT8 quantization improves computational efficiency and reduces memory footprint, making the model more suitable for deployment on devices with limited hardware resources. The OpenVINO toolkit facilitates this process by providing tools and optimizations that ensure the model's performance remains high while being more resource-efficient.
Resources and Technical Documentation
Installing the Transformers Library and Inference
To use this model, first install the Transformers library:
!pip install transformers openvino openvino-dev
Inference code
Below is an example of how to perform inference with the quantized Gemma-2-2b-it model:
import openvino_genai
def streamer(subword):
print(subword, end='', flush=True)
return False
model_dir = "your path "
device = 'CPU' # GPU can be used as well
pipe = openvino_genai.LLMPipeline(model_dir, device)
import time
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
pipe.start_chat()
total_tokens = 0
total_time = 0
while True:
prompt = input('question:\n')
if 'Stop!' == prompt:
break
start_time = time.time()
output = pipe.generate(prompt, config, streamer)
end_time = time.time()
elapsed_time = end_time - start_time
num_tokens = len(output.split()) # Adjust this based on how tokens are represented
total_tokens += num_tokens
total_time += elapsed_time
print(f'Generated tokens: {num_tokens}')
print(f'Time taken: {elapsed_time:.2f} seconds')
print('\n----------')
pipe.finish_chat()
if total_time > 0:
tok_per_s = total_tokens / total_time
print(f'Tokens per second: {tok_per_s:.2f}')
else:
print('No tokens generated.')
'''''