|
--- |
|
library_name: transformers |
|
tags: |
|
- ner |
|
- biomedical |
|
- disease-recognition |
|
- pubmedbert |
|
- BioMedNLP |
|
datasets: |
|
- rjac/biobert-ner-diseases-dataset |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
base_model: |
|
- microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext |
|
--- |
|
|
|
# Model Card for BioMed-NER-English |
|
|
|
Fine-tuned BiomedNLP-BiomedBERT model for medical entity recognition, achieving 0.9868 F1-score on disease entity extraction from clinical text. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** [Aashish Acharya](https://github.com/acharya-jyu) |
|
- **Model type:** BiomedNLP-BiomedBERT (Token Classification) |
|
- **Language(s):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext |
|
- **Source Code:** [GitHub Link](https://github.com/Acharya-jyu/ner-model) |
|
|
|
### Model Sources |
|
- **Base Model:** [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) |
|
- **Training Dataset:** [rjac/biobert-ner-diseases-dataset](https://huggingface.co/datasets/rjac/biobert-ner-diseases-dataset) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model excels at extracting disease mentions from medical text using BIO tagging scheme: |
|
- B-Disease: Beginning of disease mention |
|
- I-Disease: Continuation of disease mention |
|
- O: Non-disease tokens |
|
|
|
### Training |
|
**Training Data** |
|
|
|
Dataset: biobert-ner-diseases-dataset |
|
Size: 21,225 annotated medical sentences |
|
Split: 15,488 training (73%), 5,737 testing (27%) |
|
Average sentence length: 24.3 tokens |
|
Disease mention frequency: 1.8 per sentence |
|
|
|
**Training Procedure** |
|
**Training Hyperparameters** |
|
|
|
- Learning rate: 5e-5 |
|
- Batch size: 8 |
|
- Epochs: 8 |
|
- Optimizer: AdamW with weight decay (0.01) |
|
- Warmup steps: 500 |
|
- Early stopping patience: 5 |
|
- Loss function: Cross-entropy with label smoothing (0.1) |
|
- Gradient accumulation steps: 4 |
|
- Max gradient norm: 1.0 |
|
|
|
**Evaluation** |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/cOW2y9C8ypND8f7lpFC0W.png" width="400" alt="image"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/vn5UZUFhkuaz78QvnP01O.png" width="400" alt="image"> |
|
**Metrics** |
|
Final model performancc |
|
**Strict Entity Matching:** |
|
|
|
Precision: 0.9869 |
|
Recall: 0.9868 |
|
F1 Score: 0.9868 |
|
|
|
**Partial Entity Matching:** |
|
|
|
Precision: 0.9527 |
|
Recall: 0.9456 |
|
F1 Score: 0.9491 |
|
|
|
**Error Analysis** |
|
|
|
Boundary Errors: 1,154 |
|
Type Errors: 0 |
|
|
|
**Environmental Impact** |
|
|
|
Hardware Type: Google Colab GPU |
|
Hours used: ~2 hours |
|
Cloud Provider: Google Cloud |
|
Carbon Emitted: Not tracked |
|
|
|
**Technical Specifications** |
|
Model Architecture |
|
|
|
Base model: PubMedBERT |
|
Hidden size: 768 |
|
Attention heads: 12 |
|
Layers: 12 |
|
Parameters: ~110M |
|
|
|
**Compute Infrastructure** |
|
|
|
Platform: Google Colab |
|
GPU: Tesla T4/P100 |
|
|
|
## Citation |
|
```bibtex |
|
@misc{acharya2024sapbert, |
|
title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset}, |
|
author={Acharya, Aashish}, |
|
year={2024}, |
|
publisher={Hugging Face Model Hub} |
|
} |
|
``` |
|
|
|
## Model Card Contact |
|
[Aashish Acharya](https://github.com/acharya-jyu) |