Spaces:
Sleeping
Every Cure Take Home
How to create an API endpoint that adeheres to an OpenAPI spec?
How to host publicly and for free an API?
can use docker + hugging face
What type of hugging face models do entity type extraction?
NER model, some are fine tuned in medical terminology such as d4data/biomedical-ner-all, BioBert or ClinicalBert. Could also use LLM calls, but hard to judge whose performance would be better/benchmarking (potential improvement), also might be more expensive than a simpler fine tuned BERT model.
biobert was trained in 2020, not much docs in HF but it's the most popular 700k downloads last month clinical bert 47k last month (2023)
bio clinical bert 3M downloads (2019)
CLINICAL ner leaderboard useful: https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard#:~:text=The%20main%20goal%20of%20the,entities%20across%20diverse%20medical%20domains.
indeed LLMs are up there
What do entities mean in the context of this challenge?
In this context, entities refer to Named Entity Recognition and in particular medical entities (diseases, names of molecules, proteins, medical procedures, etc)
There are models specifically trained to do NER detection from text, we'll leverage those.
how to extract text out of a pdf?
pdfplumber works pretty well as stated below we'll keep images and tables out of here, pdfplumber does extract text from tables but without time to assess how good the extraction is we don't know how reliable that is
how to extract meaningful context that's not just related to the text contet? wors around it?
attention mechanism comes to mind
caveats pf pdfplumber
we shouldn't include appendix and references into the mix
torch and uv
torch only works with python 3.12
UV_PYTHON=3.12 uv init uv add transformers torch pdfplumber marimo gliner
separate model and app -> probs cleaner but don't have the time
to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
what's the context size of these bert models? do i need to chunk the output
test the fast api
it's got a nice test module
looks good
https://huggingface.co/blaze999/Medical-NER
https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch
nice: https://huggingface.co/urchade/gliner_base
what's the max length that gliner accepts?
havent been able to find it
Parts to the problem
- Check how good pdfplumber or PyMuPDF is at extracting text without butchering it.
- I think for now I could focus on text and list image or table parsing as an improvement.
- Identify suitable model for tasks
- write out fastapi endpoint matching openapi spec
- write out caching based on filename/content (sha)"
- write out effective logging in API backend
- write out testing of endpoint
- deploy