kuleshov-group
/

PlantCaduceus_l32

Feature Extraction

Model card Files Files and versions Community

PlantCaduceus_l32 / README.md

JingjingZhai's picture

Update README.md

c3ef16b verified 16 days ago

|

history blame contribute delete

2.68 kB

	---
	license: apache-2.0
	---

	## Model Overview

	PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:

	- [PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20): 20 layers, 384 hidden size, 20M parameters
	- [PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24): 24 layers, 512 hidden size, 40M parameters
	- [PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28): 28 layers, 768 hidden size, 112M parameters
	- [PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32): 32 layers, 1024 hidden size, 225M parameters

	We would highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)) for the zero-shot score estimation.

	## How to use
	```python
	from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
	import torch
	model_path = 'kuleshov-group/PlantCaduceus_l32'
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
	model.eval()
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

	sequence = "ATGCGTACGATCGTAG"
	encoding = tokenizer.encode_plus(
	sequence,
	return_tensors="pt",
	return_attention_mask=False,
	return_token_type_ids=False
	)
	input_ids = encoding["input_ids"].to(device)
	with torch.inference_mode():
	outputs = model(input_ids=input_ids, output_hidden_states=True)
	```

	## Citation
	```bibtex
	@article {Zhai2024.06.04.596709,
	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
	elocation-id = {2024.06.04.596709},
	year = {2024},
	doi = {10.1101/2024.06.04.596709},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
	journal = {bioRxiv}
	}
	```

	## Contact
	Jingjing Zhai ([email protected])