anish12's picture
Create README.md
712e717 verified
|
raw
history blame
3.12 kB

Gemma 2 model card

  • DatafoundryAI/OpenVino_Ratnamuladve.Q8_K_M-GGUF

Authors

  • DatafoundryAI

Model Information

  • Summary description and brief definition of inputs and outputs.

Description

  • Gemma-2-2b-it builds on the technological advancements of the Gemini models, offering high-quality language generation capabilities. We have enhanced this model by applying INT8 quantization using the Intel OpenVINO Toolkit. This process optimizes the model for deployment in resource-constrained environments.

About Openvino

Model Conversion and Quantization with Intel OpenVINO::

  • Model Optimizer:

    • OpenVINO includes a tool called the Model Optimizer that converts pre-trained models from popular frameworks (such as TensorFlow, PyTorch, and ONNX) into an intermediate representation (IR). This IR consists of two files: a .xml file describing the model's structure and a .bin file containing the weights.
  • Quantization:

    • During the conversion process, you can apply quantization techniques to reduce model size and improve inference speed. OpenVINO supports INT8 quantization, which reduces floating-point precision to 8-bit integers.

Benefits:

  • INT8 quantization improves computational efficiency and reduces memory footprint, making the model more suitable for deployment on devices with limited hardware resources. The OpenVINO toolkit facilitates this process by providing tools and optimizations that ensure the model's performance remains high while being more resource-efficient.

Resources and Technical Documentation

Installing the Transformers Library and Inference

To use this model, first install the Transformers library:

!pip install transformers openvino openvino-dev

Inference code

Below is an example of how to perform inference with the quantized Gemma-2-2b-it model:

import openvino_genai 
def streamer(subword): 
    print(subword, end='', flush=True)  
    return False 
model_dir = "your path "  
device = 'CPU'  # GPU can be used as well 
pipe = openvino_genai.LLMPipeline(model_dir, device)
import time
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
pipe.start_chat()

total_tokens = 0
total_time = 0

while True:
    prompt = input('question:\n')
    if 'Stop!' == prompt:
        break
    
    start_time = time.time()
    output = pipe.generate(prompt, config, streamer)
    end_time = time.time()

    elapsed_time = end_time - start_time
    num_tokens = len(output.split())  # Adjust this based on how tokens are represented

    total_tokens += num_tokens
    total_time += elapsed_time

    print(f'Generated tokens: {num_tokens}')
    print(f'Time taken: {elapsed_time:.2f} seconds')
    print('\n----------')

pipe.finish_chat()

if total_time > 0:
    tok_per_s = total_tokens / total_time
    print(f'Tokens per second: {tok_per_s:.2f}')
else:
    print('No tokens generated.')
'''''