---
library_name: transformers
tags:
- ner
- biomedical
- disease-recognition
- pubmedbert
- BioMedNLP
datasets:
- rjac/biobert-ner-diseases-dataset
license: mit
language:
- en
metrics:
- precision
- recall
- f1
base_model:
- microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
---

# Model Card for BioMed-NER-English

Fine-tuned BiomedNLP-BiomedBERT model for medical entity recognition, achieving 0.9868 F1-score on disease entity extraction from clinical text.

## Model Details

### Model Description

- **Developed by:** [Aashish Acharya](https://github.com/acharya-jyu)
- **Model type:** BiomedNLP-BiomedBERT (Token Classification)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
- **Source Code:** [GitHub Link](https://github.com/Acharya-jyu/ner-model)

### Model Sources
- **Base Model:** [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext)
- **Training Dataset:** [rjac/biobert-ner-diseases-dataset](https://huggingface.co/datasets/rjac/biobert-ner-diseases-dataset)

## Uses

### Direct Use

This model excels at extracting disease mentions from medical text using BIO tagging scheme:
- B-Disease: Beginning of disease mention
- I-Disease: Continuation of disease mention
- O: Non-disease tokens

### Training
**Training Data**

Dataset: biobert-ner-diseases-dataset
Size: 21,225 annotated medical sentences
Split: 15,488 training (73%), 5,737 testing (27%)
Average sentence length: 24.3 tokens
Disease mention frequency: 1.8 per sentence

**Training Procedure**
**Training Hyperparameters**

- Learning rate: 5e-5
- Batch size: 8
- Epochs: 8
- Optimizer: AdamW with weight decay (0.01)
- Warmup steps: 500
- Early stopping patience: 5
- Loss function: Cross-entropy with label smoothing (0.1)
- Gradient accumulation steps: 4
- Max gradient norm: 1.0

**Evaluation**
<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/cOW2y9C8ypND8f7lpFC0W.png" width="400" alt="image">
<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/vn5UZUFhkuaz78QvnP01O.png" width="400" alt="image">
**Metrics**
Final model performancc
**Strict Entity Matching:**

    Precision: 0.9869
    Recall: 0.9868
    F1 Score: 0.9868
    
**Partial Entity Matching:**

    Precision: 0.9527
    Recall: 0.9456
    F1 Score: 0.9491

**Error Analysis**

    Boundary Errors: 1,154
    Type Errors: 0

**Environmental Impact**

    Hardware Type: Google Colab GPU
    Hours used: ~2 hours
    Cloud Provider: Google Cloud
    Carbon Emitted: Not tracked

**Technical Specifications**
Model Architecture

    Base model: PubMedBERT
    Hidden size: 768
    Attention heads: 12
    Layers: 12
    Parameters: ~110M

**Compute Infrastructure**

    Platform: Google Colab
    GPU: Tesla T4/P100

## Citation
```bibtex
@misc{acharya2024sapbert,
  title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset},
  author={Acharya, Aashish},
  year={2024},
  publisher={Hugging Face Model Hub}
}
```

## Model Card Contact
[Aashish Acharya](https://github.com/acharya-jyu)