File size: 4,911 Bytes
71f86fc a8d1d7f 71f86fc a8d1d7f 71f86fc a8d1d7f 71f86fc a8d1d7f 71f86fc a8d1d7f 71f86fc 5d6fcf5 aaedaf4 a8d1d7f 71f86fc a8d1d7f 21343b9 a8d1d7f 21343b9 a8d1d7f 71f86fc 21343b9 71f86fc a8d1d7f 71f86fc 21343b9 71f86fc a8d1d7f 71f86fc a8d1d7f 71f86fc a8d1d7f 5d6fcf5 71f86fc a8d1d7f 5d6fcf5 a8d1d7f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
tags:
- dna
- human_genome
---
# GENA-LM (gena-lm-bert-base-t2t)
GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.
GENA-LM models are transformer masked language models trained on human DNA sequence.
Differences between GENA-LM (`gena-lm-bert-base-t2t`) and DNABERT:
- BPE tokenization instead of k-mers;
- input sequence size is about 4500 nucleotides (512 BPE tokens) compared to 512 nucleotides of DNABERT
- pre-training on T2T vs. GRCh38.p13 human genome assembly.
Source code and data: https://github.com/AIRI-Institute/GENA_LM
Paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523
This repository also contains models that are finetuned on downstream tasks:
- promoters predictions (branch [promoters_300_run_1](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t/tree/promoters_300_run_1))
- splice sites prediction (branch [spliceai_run_1](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t/tree/spliceai_run_1))
- epigenetic features and gene expression (trained on enformer dataset, branch [enformer](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t/tree/enformer))
and models that are used in our [GENA-Web](https://dnalm.airi.net) web tool for genomic sequence annotation:
- deepsea (gena_web_deepsea, branch [gena_web_deepsea](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t/tree/gena_web_deepsea))
- deepstarr (gena_web_deepstarr, branch [gena_web_deepstarr](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t/tree/gena_web_deepstarr))
## Examples
### How to load pre-trained model for Masked Language Modeling
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
```
### How to load pre-trained model to fine-tune it on classification task
Get model class from GENA-LM repository:
```bash
git clone https://github.com/AIRI-Institute/GENA_LM.git
```
```python
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
```
or you can just download [modeling_bert.py](https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm) and put it close to your code.
OR you can get model class from HuggingFace AutoModel:
```python
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)
```
## Model description
GENA-LM (`gena-lm-bert-base-t2t`) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. Model config for `gena-lm-bert-base-t2t` is similar to the bert-base:
- 512 Maximum sequence length
- 12 Layers, 12 Attention heads
- 768 Hidden size
- 32k Vocabulary size
We pre-trained `gena-lm-bert-base-t2t` using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000-genome SNPs (gnomAD dataset). Pre-training was performed for 2,100,000 iterations with batch size 256 and sequence length was equal to 512 tokens. We modified Transformer with [Pre-Layer normalization](https://arxiv.org/abs/2002.04745), but without the final layer LayerNorm.
## Evaluation
For evaluation results, see our paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523
## Citation
```bibtex
@article{GENA_LM,
author = {Fishman, Veniamin and Kuratov, Yuri and Shmelev, Aleksei and Petrov, Maxim and Penzar, Dmitry and Shepelin, Denis and Chekanov, Nikolay and Kardymon, Olga and Burtsev, Mikhail},
title = {GENA-LM: a family of open-source foundational DNA language models for long sequences},
journal = {Nucleic Acids Research},
volume = {53},
number = {2},
pages = {gkae1310},
year = {2025},
month = {01},
issn = {0305-1048},
doi = {10.1093/nar/gkae1310},
url = {https://doi.org/10.1093/nar/gkae1310},
eprint = {https://academic.oup.com/nar/article-pdf/53/2/gkae1310/61443229/gkae1310.pdf},
}
``` |