GENERator-eukaryote-3b-base model
Abouts
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
For more technical details, please refer to our paper GENERator: A Long-Context Generative Genomic Foundation Model.
How to use
Simple example1: generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base")
config = model.config
max_length = config.max_position_embeddings
# Define input sequences.
sequences = [
"ATGAGGTGGCAAGAAATGGGCTAC",
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]
# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in sequences]
# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
sequences,
add_special_tokens=False,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
# Generate the sequences
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=32)
# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Print the decoded sequences
print(decoded_sequences)
Simple example2: embedding
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base")
config = model.config
max_length = config.max_position_embeddings
# Define input sequences.
sequences = [
"ATGAGGTGGCAAGAAATGGGCTAC",
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]
# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
# such as the BOS EOS token, at the appropriate positions.
tokenizer.padding_side = "right"
inputs = tokenizer(
sequences,
add_special_tokens=True,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
# Perform a forward pass through the model to obtain the outputs, including hidden states.
with torch.inference_mode():
outputs = model(**inputs, output_hidden_states=True)
# Retrieve the hidden states from the last layer.
hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size)
# Use the attention_mask to determine the index of the last token in each sequence.
# Since add_special_tokens=True is used, the last token is typically the EOS token.
attention_mask = inputs["attention_mask"]
last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence
# Extract the embedding corresponding to the EOS token for each sequence.
seq_embeddings = []
for i, token_index in enumerate(last_token_indices):
# Fetch the embedding for the last token (EOS token).
seq_embedding = hidden_states[i, token_index, :]
seq_embeddings.append(seq_embedding)
# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
seq_embeddings = torch.stack(seq_embeddings)
print("Sequence Embeddings:", seq_embeddings)
Citation
@misc{wu2025generator,
title={GENERator: A Long-Context Generative Genomic Foundation Model},
author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
year={2025},
eprint={2502.07272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07272},
}
- Downloads last month
- 14
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.