# Every Cure Take Home ## How to create an API endpoint that adeheres to an OpenAPI spec? ## How to host publicly and for free an API? can use docker + hugging face ## What type of hugging face models do entity type extraction? NER model, some are fine tuned in medical terminology such as d4data/biomedical-ner-all, BioBert or ClinicalBert. Could also use LLM calls, but hard to judge whose performance would be better/benchmarking (potential improvement), also might be more expensive than a simpler fine tuned BERT model. biobert was trained in 2020, not much docs in HF but it's the most popular 700k downloads last month clinical bert 47k last month (2023) bio clinical bert 3M downloads (2019) CLINICAL ner leaderboard useful: . indeed LLMs are up there ## What do entities mean in the context of this challenge? In this context, entities refer to [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) and in particular medical entities (diseases, names of molecules, proteins, medical procedures, etc) There are models specifically trained to do NER detection from text, we'll leverage those. ## how to extract text out of a pdf? pdfplumber works pretty well as stated below we'll keep images and tables out of here, pdfplumber does extract text from tables but without time to assess how good the extraction is we don't know how reliable that is ## how to extract meaningful context that's not just related to the text contet? wors around it? attention mechanism comes to mind ## caveats pf pdfplumber we shouldn't include appendix and references into the mix ## torch and uv torch only works with python 3.12 UV_PYTHON=3.12 uv init uv add transformers torch pdfplumber marimo gliner ## separate model and app -> probs cleaner but don't have the time to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine ## what's the context size of these bert models? do i need to chunk the output ## test the fast api it's got a nice test module ## looks good https://huggingface.co/blaze999/Medical-NER nice: https://huggingface.co/urchade/gliner_base ## what's the max length that gliner accepts? havent been able to find it ## Parts to the problem - Check how good pdfplumber or PyMuPDF is at extracting text without butchering it. - I think for now I could focus on text and list image or table parsing as an improvement. - Identify suitable model for tasks - write out fastapi endpoint matching openapi spec - write out caching based on filename/content (sha)" - write out effective logging in API backend - write out testing of endpoint - deploy