🍊 Korean Medical DPR(Dense Passage Retrieval)

1. Intro

의료 λΆ„μ•Όμ—μ„œ μ‚¬μš©ν•  수 μžˆλŠ” Bi-Encoder ꡬ쑰의 검색 λͺ¨λΈμž…λ‹ˆλ‹€.
ν•œΒ·μ˜ 혼용체의 의료 기둝을 μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ SapBERT-KO-EN 을 베이슀 λͺ¨λΈλ‘œ μ΄μš©ν–ˆμŠ΅λ‹ˆλ‹€.
μ§ˆλ¬Έμ€ Question Encoder둜, ν…μŠ€νŠΈλŠ” Context Encoderλ₯Ό μ΄μš©ν•΄ μΈμ½”λ”©ν•©λ‹ˆλ‹€.

(β€» 이 λͺ¨λΈμ€ AI Hub의 μ΄ˆκ±°λŒ€ AI ν—¬μŠ€μΌ€μ–΄ 질의 응닡 λ°μ΄ν„°λ‘œ ν•™μŠ΅ν•œ λͺ¨λΈμž…λ‹ˆλ‹€.)

2. Model

(1) Self Alignment Pretraining (SAP)

ν•œκ΅­ 의료 기둝은 ν•œΒ·μ˜ 혼용체둜 μ“°μ—¬, μ˜μ–΄ μš©μ–΄λ„ 인식할 수 μžˆλŠ” λͺ¨λΈμ΄ ν•„μš”ν•©λ‹ˆλ‹€.
Multi Similarity Lossλ₯Ό μ΄μš©ν•΄ λ™μΌν•œ μ½”λ“œμ˜ μš©μ–΄ 간에 높은 μœ μ‚¬λ„λ₯Ό 갖도둝 ν•™μŠ΅ν–ˆμŠ΅λ‹ˆλ‹€.

예) C3843080 || κ³ ν˜ˆμ•• μ§ˆν™˜ 
    C3843080 || Hypertension
    C3843080 || High Blood Pressure
    C3843080 || HTN
    C3843080 || HBP

(2) Dense Passage Retrieval (DPR)

SapBERT-KO-EN을 검색 λͺ¨λΈλ‘œ λ§Œλ“€κΈ° μœ„ν•΄ 좔가적인 Fine-tuning을 ν•΄μ•Ό ν•©λ‹ˆλ‹€.
Bi-Encoder ꡬ쑰둜 μ§ˆμ˜μ™€ ν…μŠ€νŠΈμ˜ μœ μ‚¬λ„λ₯Ό κ³„μ‚°ν•˜λŠ” DPR λ°©μ‹μœΌλ‘œ Fine-tuning ν–ˆμŠ΅λ‹ˆλ‹€.
λ‹€μŒκ³Ό 같이 기쑴의 데이터 셋에 ν•œΒ·μ˜ 혼용체 μƒ˜ν”Œμ„ μ¦κ°•ν•œ 데이터 셋을 μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

예) ν•œκ΅­μ–΄ 병λͺ…: κ³ ν˜ˆμ••
    μ˜μ–΄ 병λͺ…: Hypertenstion
    질의 (원본): 아버지가 κ³ ν˜ˆμ••μΈλ° 그게 뭔지 λͺ¨λ₯΄κ² μ–΄. κ³ ν˜ˆμ••μ΄ 뭔지 μ„€λͺ…μ’€ ν•΄μ€˜.
    질의 (증강): 아버지가 Hypertenstion 인데 그게 뭔지 λͺ¨λ₯΄κ² μ–΄. Hypertenstion 이 뭔지 μ„€λͺ…μ’€ ν•΄μ€˜.

3. Training

(1) Self Alignment Pretraining (SAP)

SapBERT-KO-EN ν•™μŠ΅μ— ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.
ν•œΒ·μ˜ 의료 μš©μ–΄λ₯Ό μˆ˜λ‘ν•œ 의료 μš©μ–΄ 사전인 KOSTOM을 ν•™μŠ΅ λ°μ΄ν„°λ‘œ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

  • Model : klue/bert-base
  • Dataset : KOSTOM
  • Epochs : 1
  • Batch Size : 64
  • Max Length : 64
  • Dropout : 0.1
  • Pooler : 'cls'
  • Eval Step : 100
  • Threshold : 0.8
  • Scale Positive Sample : 1
  • Scale Negative Sample : 60

(2) Dense Passage Retrieval (DPR)

Fine-tuning에 ν™œμš©ν•œ 베이슀 λͺ¨λΈ 및 ν•˜μ΄νΌ νŒŒλΌλ―Έν„°λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  • Model : SapBERT-KO-EN(klue/bert-base)
  • Dataset : μ΄ˆκ±°λŒ€ AI ν—¬μŠ€μΌ€μ–΄ 질의 응닡 데이터(AI Hub)
  • Epochs : 10
  • Batch Size : 64
  • Dropout : 0.1
  • Pooler : 'cls'

4. Example

이 λͺ¨λΈμ€ μ§ˆλ¬Έμ„ μΈμ½”λ”©ν•˜λŠ” λͺ¨λΈλ‘œ, Context λͺ¨λΈκ³Ό ν•¨κ»˜ μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€.
λ™μΌν•œ μ§ˆλ³‘μ— κ΄€ν•œ 질문과 ν…μŠ€νŠΈκ°€ 높은 μœ μ‚¬λ„λ₯Ό λ³΄μΈλ‹€λŠ” 사싀을 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

(β€» μ•„λž˜ μ½”λ“œμ˜ μ˜ˆμ‹œλŠ” ChatGPTλ₯Ό μ΄μš©ν•΄ μƒμ„±ν•œ 의료 ν…μŠ€νŠΈμž…λ‹ˆλ‹€.)
(β€» ν•™μŠ΅ λ°μ΄ν„°μ˜ νŠΉμ„± 상 μ˜ˆμ‹œ 보닀 μ •μ œλœ ν…μŠ€νŠΈμ— λŒ€ν•΄ 더 잘 μž‘λ™ν•©λ‹ˆλ‹€.)

import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure 처방 사둀'

targets = [
    """κ³ ν˜ˆμ•• 진단.
    ν™˜μž 상담 및 μƒν™œμŠ΅κ΄€ ꡐ정 ꢌ고. 저염식, κ·œμΉ™μ μΈ μš΄λ™, κΈˆμ—°, 금주 μ§€μ‹œ.
    ν™˜μž 재방문. ν˜ˆμ••: 150/95mmHg. μ•½λ¬ΌμΉ˜λ£Œ μ‹œμž‘. Amlodipine 5mg 1일 1회 처방.""",
    
    """응급싀 도착 ν›„ μœ„ λ‚΄μ‹œκ²½ 진행.
    μ†Œκ²¬: Gastric ulcerμ—μ„œ Forrest IIb 관찰됨. μΆœν˜ˆμ€ μ†ŒλŸ‰μ˜ μ‚ΌμΆœμ„± 좜혈 ν˜•νƒœ.
    처치: 에피넀프린 μ£Όμ‚¬λ‘œ 좜혈 κ°μ†Œ 확인. Hemoclip 2개둜 좜혈 λΆ€μœ„ ν΄λ¦¬ν•‘ν•˜μ—¬ μ§€ν˜ˆ μ™„λ£Œ.""",
    
    """ν˜ˆμ€‘ 높은 지방 수치 및 지방간 μ†Œκ²¬.
    λ‹€λ°œμ„± gallstones 확인. 증상 없을 경우 κ²½κ³Ό κ΄€μ°° ꢌμž₯.
    우츑 renal cyst, μ–‘μ„± κ°€λŠ₯μ„± λ†’μœΌλ©° 좔가적인 처치 λΆˆν•„μš” 함."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}
Downloads last month
6
Safetensors
Model size
111M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for snumin44/medical-biencoder-ko-bert-question

Base model

klue/bert-base
Finetuned
(70)
this model