BioMed-NER-English / README.md
acharya-jyu's picture
Update README.md
ef4b67b verified
metadata
library_name: transformers
tags:
  - ner
  - biomedical
  - disease-recognition
  - pubmedbert
  - BioMedNLP
datasets:
  - rjac/biobert-ner-diseases-dataset
license: mit
language:
  - en
metrics:
  - precision
  - recall
  - f1
base_model:
  - microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Model Card for BioMed-NER-English

Fine-tuned BiomedNLP-BiomedBERT model for medical entity recognition, achieving 0.9868 F1-score on disease entity extraction from clinical text.

Model Details

Model Description

  • Developed by: Aashish Acharya
  • Model type: BiomedNLP-BiomedBERT (Token Classification)
  • Language(s): English
  • License: MIT
  • Finetuned from model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
  • Source Code: GitHub Link

Model Sources

Uses

Direct Use

This model excels at extracting disease mentions from medical text using BIO tagging scheme:

  • B-Disease: Beginning of disease mention
  • I-Disease: Continuation of disease mention
  • O: Non-disease tokens

Training

Training Data

Dataset: biobert-ner-diseases-dataset Size: 21,225 annotated medical sentences Split: 15,488 training (73%), 5,737 testing (27%) Average sentence length: 24.3 tokens Disease mention frequency: 1.8 per sentence

Training Procedure Training Hyperparameters

  • Learning rate: 5e-5
  • Batch size: 8
  • Epochs: 8
  • Optimizer: AdamW with weight decay (0.01)
  • Warmup steps: 500
  • Early stopping patience: 5
  • Loss function: Cross-entropy with label smoothing (0.1)
  • Gradient accumulation steps: 4
  • Max gradient norm: 1.0

Evaluation image image Metrics Final model performancc Strict Entity Matching:

Precision: 0.9869
Recall: 0.9868
F1 Score: 0.9868

Partial Entity Matching:

Precision: 0.9527
Recall: 0.9456
F1 Score: 0.9491

Error Analysis

Boundary Errors: 1,154
Type Errors: 0

Environmental Impact

Hardware Type: Google Colab GPU
Hours used: ~2 hours
Cloud Provider: Google Cloud
Carbon Emitted: Not tracked

Technical Specifications Model Architecture

Base model: PubMedBERT
Hidden size: 768
Attention heads: 12
Layers: 12
Parameters: ~110M

Compute Infrastructure

Platform: Google Colab
GPU: Tesla T4/P100

Citation

@misc{acharya2024sapbert,
  title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset},
  author={Acharya, Aashish},
  year={2024},
  publisher={Hugging Face Model Hub}
}

Model Card Contact

Aashish Acharya