--- library_name: transformers tags: - ner - biomedical - disease-recognition - pubmedbert - BioMedNLP datasets: - rjac/biobert-ner-diseases-dataset license: mit language: - en metrics: - precision - recall - f1 base_model: - microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext --- # Model Card for BioMed-NER-English Fine-tuned BiomedNLP-BiomedBERT model for medical entity recognition, achieving 0.9868 F1-score on disease entity extraction from clinical text. ## Model Details ### Model Description - **Developed by:** [Aashish Acharya](https://github.com/acharya-jyu) - **Model type:** BiomedNLP-BiomedBERT (Token Classification) - **Language(s):** English - **License:** MIT - **Finetuned from model:** microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext - **Source Code:** [GitHub Link](https://github.com/Acharya-jyu/ner-model) ### Model Sources - **Base Model:** [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) - **Training Dataset:** [rjac/biobert-ner-diseases-dataset](https://huggingface.co/datasets/rjac/biobert-ner-diseases-dataset) ## Uses ### Direct Use This model excels at extracting disease mentions from medical text using BIO tagging scheme: - B-Disease: Beginning of disease mention - I-Disease: Continuation of disease mention - O: Non-disease tokens ### Training **Training Data** Dataset: biobert-ner-diseases-dataset Size: 21,225 annotated medical sentences Split: 15,488 training (73%), 5,737 testing (27%) Average sentence length: 24.3 tokens Disease mention frequency: 1.8 per sentence **Training Procedure** **Training Hyperparameters** - Learning rate: 5e-5 - Batch size: 8 - Epochs: 8 - Optimizer: AdamW with weight decay (0.01) - Warmup steps: 500 - Early stopping patience: 5 - Loss function: Cross-entropy with label smoothing (0.1) - Gradient accumulation steps: 4 - Max gradient norm: 1.0 **Evaluation** image image **Metrics** Final model performancc **Strict Entity Matching:** Precision: 0.9869 Recall: 0.9868 F1 Score: 0.9868 **Partial Entity Matching:** Precision: 0.9527 Recall: 0.9456 F1 Score: 0.9491 **Error Analysis** Boundary Errors: 1,154 Type Errors: 0 **Environmental Impact** Hardware Type: Google Colab GPU Hours used: ~2 hours Cloud Provider: Google Cloud Carbon Emitted: Not tracked **Technical Specifications** Model Architecture Base model: PubMedBERT Hidden size: 768 Attention heads: 12 Layers: 12 Parameters: ~110M **Compute Infrastructure** Platform: Google Colab GPU: Tesla T4/P100 ## Citation ```bibtex @misc{acharya2024sapbert, title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset}, author={Acharya, Aashish}, year={2024}, publisher={Hugging Face Model Hub} } ``` ## Model Card Contact [Aashish Acharya](https://github.com/acharya-jyu)