acharya-jyu
/

BioMed-NER-English

Token Classification

disease-recognition

Inference Endpoints

Model card Files Files and versions Community

BioMed-NER-English / README.md

acharya-jyu's picture

Update README.md

ef4b67b verified 12 days ago

|

history blame contribute delete

3.21 kB

	---
	library_name: transformers
	tags:
	- ner
	- biomedical
	- disease-recognition
	- pubmedbert
	- BioMedNLP
	datasets:
	- rjac/biobert-ner-diseases-dataset
	license: mit
	language:
	- en
	metrics:
	- precision
	- recall
	- f1
	base_model:
	- microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
	---

	# Model Card for BioMed-NER-English

	Fine-tuned BiomedNLP-BiomedBERT model for medical entity recognition, achieving 0.9868 F1-score on disease entity extraction from clinical text.

	## Model Details

	### Model Description

	- Developed by: [Aashish Acharya](https://github.com/acharya-jyu)
	- Model type: BiomedNLP-BiomedBERT (Token Classification)
	- Language(s): English
	- License: MIT
	- Finetuned from model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
	- Source Code: [GitHub Link](https://github.com/Acharya-jyu/ner-model)

	### Model Sources
	- Base Model: [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext)
	- Training Dataset: [rjac/biobert-ner-diseases-dataset](https://huggingface.co/datasets/rjac/biobert-ner-diseases-dataset)

	## Uses

	### Direct Use

	This model excels at extracting disease mentions from medical text using BIO tagging scheme:
	- B-Disease: Beginning of disease mention
	- I-Disease: Continuation of disease mention
	- O: Non-disease tokens

	### Training
	Training Data

	Dataset: biobert-ner-diseases-dataset
	Size: 21,225 annotated medical sentences
	Split: 15,488 training (73%), 5,737 testing (27%)
	Average sentence length: 24.3 tokens
	Disease mention frequency: 1.8 per sentence

	Training Procedure
	Training Hyperparameters

	- Learning rate: 5e-5
	- Batch size: 8
	- Epochs: 8
	- Optimizer: AdamW with weight decay (0.01)
	- Warmup steps: 500
	- Early stopping patience: 5
	- Loss function: Cross-entropy with label smoothing (0.1)
	- Gradient accumulation steps: 4
	- Max gradient norm: 1.0

	Evaluation
	<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/cOW2y9C8ypND8f7lpFC0W.png" width="400" alt="image">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/vn5UZUFhkuaz78QvnP01O.png" width="400" alt="image">
	Metrics
	Final model performancc
	Strict Entity Matching:

	Precision: 0.9869
	Recall: 0.9868
	F1 Score: 0.9868

	Partial Entity Matching:

	Precision: 0.9527
	Recall: 0.9456
	F1 Score: 0.9491

	Error Analysis

	Boundary Errors: 1,154
	Type Errors: 0

	Environmental Impact

	Hardware Type: Google Colab GPU
	Hours used: ~2 hours
	Cloud Provider: Google Cloud
	Carbon Emitted: Not tracked

	Technical Specifications
	Model Architecture

	Base model: PubMedBERT
	Hidden size: 768
	Attention heads: 12
	Layers: 12
	Parameters: ~110M

	Compute Infrastructure

	Platform: Google Colab
	GPU: Tesla T4/P100

	## Citation
	```bibtex
	@misc{acharya2024sapbert,
	title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset},
	author={Acharya, Aashish},
	year={2024},
	publisher={Hugging Face Model Hub}
	}
	```

	## Model Card Contact
	[Aashish Acharya](https://github.com/acharya-jyu)