Model Continuously Generating Text After Completing Task

#6
by endrikacupaj - opened

Hi,

Thank you for the great work and for open-sourcing the models!

I’ve noticed that sometimes the model keeps generating text indefinitely, even after it has answered the question or completed the task. It seems to forget to add an EOS token to stop the generation, leading to unnecessary token usage and making it harder to use in real applications.

I’m currently running the model with vLLM, and to prevent this issue, I have to set the max_tokens argument and post-process the responses.

Have you encountered this behavior before? Do you know why it happens, and is there a way to fix it?

Best,
Endri

IBM Granite org

Hi Endri, thanks for opening this issue. Can you provide any examples and/or specifics about this? Things that would help narrow down the behavior:

  • Sample prompts that you see cause this behavior
  • Sampling parameters (temperature, top_k, etc...)
  • Description of whether the behavior is deterministic or random
  • Characteristics of the load in vLLM that coincides with the behavior

Hi, thanks for the quick response!

This is how I run the model:

vllm serve ibm-granite/granite-3.1-8b-instruct \
  --served-model-name granite-3.1-8b-instruct \
  --host '0.0.0.0' \
  --port 5555 \
  --max-model-len 20480 \
  --download-dir '/LLM_Models/ibm-granite/granite-3.1-8b-instruct'

Additionally, I use a low temperature and include a seed value to ensure deterministic results:

seed: 1234
temperature: 0.1

My system prompts typically include a task description followed by multiple few-shot examples. I provide the input on the user prompt and the model generates an answer.
However, the model sometimes continues generating examples similar to those in the few-shot section of the system prompt instead of stopping after answering the user prompt.

This behaviour appears to be somewhat inconsistent. For example, increasing the temperature sometimes resolves the issue, but in other cases, it does not.

As mentioned, for now, I set max_tokens and post-process the responses, but I’m curious if have you observed this behaviour as well.

IBM Granite org

Ok, thanks for the details. I'm trying to narrow down whether this is a bug in the model or in the vLLM code wrapping the model. Is it possible to replicate the failure using a simple loop and transformers, but keeping the same seed and temperature settings? This would help isolate where the failure occurs.

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
set_seed(1234)
model_path = "ibm-granite/granite-3.1-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
user_prompt = "What is the airspeed velocity of an unladen swallow?"
prompt = tokenizer.apply_chat_template([{"role": "user", "content": user_prompt}], tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(prompt, return_tensors="pt")
outputs = set()
for i in range(100):
    output_ids = model.generate(**input_tokens, max_length=20480, temperature=0.1, do_sample=True)
    output = tokenizer.decode(output_ids[0], skip_special_tokens=False)
    outputs.add(output)

print(f"Generated {len(outputs)} unique responses")
for response in outputs:
    print("---------------------")
    print(response)

Sign up or log in to comment