everycure-ner-pdf / learning.md
Luis Chaves
split pdf into chunks after finding max lenth from warning, also updated dockerfie to have cuda installed and use gpu if available
d86a1f5
|
raw
history blame
2.89 kB

Every Cure Take Home

How to create an API endpoint that adeheres to an OpenAPI spec?

How to host publicly and for free an API?

can use docker + hugging face

What type of hugging face models do entity type extraction?

NER model, some are fine tuned in medical terminology such as d4data/biomedical-ner-all, BioBert or ClinicalBert. Could also use LLM calls, but hard to judge whose performance would be better/benchmarking (potential improvement), also might be more expensive than a simpler fine tuned BERT model.

biobert was trained in 2020, not much docs in HF but it's the most popular 700k downloads last month clinical bert 47k last month (2023)

bio clinical bert 3M downloads (2019)

CLINICAL ner leaderboard useful: https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard#:~:text=The%20main%20goal%20of%20the,entities%20across%20diverse%20medical%20domains.

indeed LLMs are up there

What do entities mean in the context of this challenge?

In this context, entities refer to Named Entity Recognition and in particular medical entities (diseases, names of molecules, proteins, medical procedures, etc)

There are models specifically trained to do NER detection from text, we'll leverage those.

how to extract text out of a pdf?

pdfplumber works pretty well as stated below we'll keep images and tables out of here, pdfplumber does extract text from tables but without time to assess how good the extraction is we don't know how reliable that is

how to extract meaningful context that's not just related to the text contet? wors around it?

attention mechanism comes to mind

caveats pf pdfplumber

we shouldn't include appendix and references into the mix

torch and uv

torch only works with python 3.12

UV_PYTHON=3.12 uv init uv add transformers torch pdfplumber marimo gliner

separate model and app -> probs cleaner but don't have the time

to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine

what's the context size of these bert models? do i need to chunk the output

test the fast api

it's got a nice test module

looks good

https://huggingface.co/blaze999/Medical-NER

https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch

nice: https://huggingface.co/urchade/gliner_base

what's the max length that gliner accepts?

havent been able to find it

Parts to the problem

  • Check how good pdfplumber or PyMuPDF is at extracting text without butchering it.
    • I think for now I could focus on text and list image or table parsing as an improvement.
  • Identify suitable model for tasks
  • write out fastapi endpoint matching openapi spec
    • write out caching based on filename/content (sha)"
    • write out effective logging in API backend
  • write out testing of endpoint
  • deploy