๐Ÿ“– How It Works

Where the data comes from ยท how the pipeline works ยท what confidence means

๐Ÿ—„ Where does the data come from?
๐Ÿงฌ WHO ATC Classification
Anatomical Therapeutic Chemical (ATC) โ€” WHO standard for classifying drugs, e.g. A10BA02 = Metformin (Antidiabetics).

The system uses ATC as a backbone to map drugs โ†’ therapeutic class โ†’ MDC.
whocc.no/atc_ddd_index
๐Ÿฅ Drug-Disease Crosswalk
Manually curated knowledge base, referencing:
โ€ข Thai Drug Database (Health Systems Research Institute)
โ€ข Indications from prescribing information
โ€ข Thai NHSO approved indications
โ€ข DrugBank / RxNorm crosswalk

Each drug has multiple indications with confidence weights.
๐Ÿ“‹ ICD-10-TM (Thai Modification)
ICD-10-TM is the Thai Modification of the WHO ICD-10, maintained by the
Department of Medical Services, Ministry of Public Health. Used for:
โ€ข NHSO (NHSO) UC Scheme reimbursement claims
โ€ข TDRG v6.3 grouper
โ€ข 43-file standard dataset

Available for download at icd10.thaicasemix.com
โš™ How does the Pipeline work?
Stage 1
keyword
Direct keyword lookup
Drug names are normalized first (stripping dosage, dosage form, salt form), then exact-matched against the knowledge base.
Thai alias e.g. "เธเธฅเธนเน‚เธ„เธŸเธฒเธˆ" โ†’ "metformin" โ†’ E11.9
Fastest ยท deterministic ยท covers ~70% of common drugs
โ†“ if confidence < threshold
Stage 2
TF-IDF
TF-IDF Cosine Similarity (char n-gram)
Uses character n-gram (2โ€“4 chars) instead of word tokenization โ†’ handles:
โ€ข Brand name: "Cefurox" โ†’ cefuroxime
โ€ข Salt form: "Amlodipine Besylate" โ†’ amlodipine
โ€ข Abbreviation: "co-trimox" โ†’ cotrimoxazole
Vectorized with TF-IDF then cosine similarity against the drug corpus.
Score = cosine_sim ร— base_confidence (from KB)
โ†“ if confidence < 40%
Stage 3
fuzzy
Fuzzy String Match (difflib)
Python built-in difflib.get_close_matches โ€” SequenceMatcher
Handles bad typos: "Metofrmin" โ†’ metformin, "Furosemid" โ†’ furosemide
Slower than TF-IDF but catches edge cases that n-gram misses
โ†“ if confidence is still < 40% and --model is specified
Stage 4
LLM
Mistral-7B Fallback (optional)
Calls mistral-7b-instruct-v0.2.Q4_K_M.gguf via llama-cpp-python
Prompt asks the LLM: "What disease is this drug used to treat? Answer as ICD-10-TM JSON"
Returns top-3 candidates with reasoning.
Used only as a last resort ยท requires model file (~4GB) ยท CPU-only, slower
โš  The LLM may hallucinate ICD codes that don't exist โ€” always validate against the ICD-10-TM reference table
๐Ÿ“Š What is the Confidence Score?
Confidence is not a statistical probability, but a composite score derived from:
Stage 1 & 3
confidence = base_weight
base_weight is defined in the KB according to specificity:
โ€ข 0.95 = drugs with mostly a single indication (levothyroxine โ†’ hypothyroid)
โ€ข 0.70 = drugs covering multiple disease groups (amoxicillin โ†’ multiple infections)
Stage 2 (TF-IDF)
confidence = cosine_sim ร— base_weight
cosine_sim = similarity of the char n-gram vectors
If 100% match cosine=1.0 โ†’ gets full base_weight
If only partial match โ†’ score reduces proportionally
AUTO โ‰ฅ 70%
Ready to use, no review needed
REVIEW 40โ€“70%
Should be checked by a pharmacist / coder for alt codes
MANUAL < 40%
Must be coded manually, or add the drug to the KB
๐Ÿ“š Knowledge Base Coverage 165 drugs ยท 42 Thai aliases
Cardiovascular (31)
atorvastatin โ†’ E78.0 +1
simvastatin โ†’ E78.0 +1
rosuvastatin โ†’ E78.0
lovastatin โ†’ E78.0
amlodipine โ†’ I10 +1
nifedipine โ†’ I10 +1
lisinopril โ†’ I10 +1
enalapril โ†’ I10 +1
ramipril โ†’ I10
losartan โ†’ I10 +1
valsartan โ†’ I10
carvedilol โ†’ I50.0 +1
bisoprolol โ†’ I50.0 +1
metoprolol โ†’ I10 +1
propranolol โ†’ I10 +1
digoxin โ†’ I50.0 +1
furosemide โ†’ I50.0 +1
spironolactone โ†’ I50.0 +1
warfarin โ†’ I48.9 +2
aspirin โ†’ I25.1 +1
clopidogrel โ†’ I25.1 +1
nitroglycerin โ†’ I20.9
isosorbide โ†’ I20.9 +1
amiodarone โ†’ I48.9 +1
ivabradine โ†’ I50.0
sacubitril โ†’ I50.0
heparin โ†’ I82.9 +1
enoxaparin โ†’ I82.9
rivaroxaban โ†’ I48.9 +1
apixaban โ†’ I48.9
dabigatran โ†’ I48.9
Respiratory (23)
salbutamol โ†’ J45.9 +1
albuterol โ†’ J45.9
terbutaline โ†’ J45.9
formoterol โ†’ J45.9 +1
salmeterol โ†’ J45.9 +1
tiotropium โ†’ J44.9
ipratropium โ†’ J44.9 +1
budesonide โ†’ J45.9 +1
fluticasone โ†’ J45.9
beclomethasone โ†’ J45.9
montelukast โ†’ J45.9
theophylline โ†’ J45.9 +1
acetylcysteine โ†’ J44.9 +1
ambroxol โ†’ J22
amoxicillin โ†’ J06.9 +2
amoxiclav โ†’ J18.9 +1
cefuroxime โ†’ J18.9
ceftriaxone โ†’ J18.9 +1
azithromycin โ†’ J18.9 +1
clarithromycin โ†’ J18.9
erythromycin โ†’ J18.9
doxycycline โ†’ J18.9 +1
levofloxacin โ†’ J18.9 +1
Diabetes (17)
metformin โ†’ E11.9
glipizide โ†’ E11.9
glibenclamide โ†’ E11.9
gliclazide โ†’ E11.9
glimepiride โ†’ E11.9
sitagliptin โ†’ E11.9
vildagliptin โ†’ E11.9
saxagliptin โ†’ E11.9
empagliflozin โ†’ E11.9 +1
dapagliflozin โ†’ E11.9
canagliflozin โ†’ E11.9
pioglitazone โ†’ E11.9
acarbose โ†’ E11.9
insulin โ†’ E10.9 +1
liraglutide โ†’ E11.9
semaglutide โ†’ E11.9
dulaglutide โ†’ E11.9
Neuro / Psych (17)
phenytoin โ†’ G40.9
valproate โ†’ G40.9 +1
carbamazepine โ†’ G40.9 +1
levetiracetam โ†’ G40.9
clonazepam โ†’ G40.9 +1
diazepam โ†’ F41.9 +1
alprazolam โ†’ F41.9
sertraline โ†’ F32.9
fluoxetine โ†’ F32.9
escitalopram โ†’ F32.9 +1
amitriptyline โ†’ F32.9 +1
risperidone โ†’ F20.9
olanzapine โ†’ F20.9 +1
haloperidol โ†’ F20.9
levodopa โ†’ G20
donepezil โ†’ G30.9
memantine โ†’ G30.9
Other (12)
metronidazole โ†’ A06.0 +1
paracetamol โ†’ R50.9 +1
acetaminophen โ†’ R50.9
hydrocortisone โ†’ E27.4
fludrocortisone โ†’ E27.1
tamsulosin โ†’ N40
finasteride โ†’ N40
sildenafil โ†’ N52.9 +1
imatinib โ†’ C91.1 +1
vitamin c โ†’ E50
zinc โ†’ E60
primaquine โ†’ B51.9
GI / Gastro (12)
omeprazole โ†’ K25.9 +1
esomeprazole โ†’ K21.0 +1
pantoprazole โ†’ K21.0 +1
lansoprazole โ†’ K21.0
ranitidine โ†’ K21.0 +1
famotidine โ†’ K21.0
domperidone โ†’ K21.0 +1
metoclopramide โ†’ K21.0 +1
ondansetron โ†’ R11
bisacodyl โ†’ K59.0
lactulose โ†’ K59.0 +1
sucralfate โ†’ K25.9
Pain & Rheum (12)
ibuprofen โ†’ M79.3 +1
diclofenac โ†’ M79.3 +1
naproxen โ†’ M79.3
celecoxib โ†’ M05.9 +1
tramadol โ†’ R52.9
morphine โ†’ R52.1
codeine โ†’ R52.9 +1
pregabalin โ†’ G62.9 +1
gabapentin โ†’ G62.9 +1
prednisolone โ†’ M05.9 +1
allopurinol โ†’ M10.9
colchicine โ†’ M10.9
Haematology / HIV (12)
ferrous sulfate โ†’ D50.9
ferrous โ†’ D50.9
folic acid โ†’ D52.9
vitamin b12 โ†’ D51.9
hydroxocobalamin โ†’ D51.9
erythropoietin โ†’ D63.1
tenofovir โ†’ B20
lamivudine โ†’ B20 +1
efavirenz โ†’ B20
lopinavir โ†’ B20
atazanavir โ†’ B20
dolutegravir โ†’ B20
Infection (9)
cephalexin โ†’ L02.9 +1
cefazolin โ†’ L02.9
ciprofloxacin โ†’ N39.0 +1
fluconazole โ†’ B37.9
cotrimoxazole โ†’ N39.0 +1
vancomycin โ†’ A41.9 +1
meropenem โ†’ A41.9
piperacillin โ†’ A41.9
loperamide โ†’ A09 +1
Tropical / Parasitic (7)
chloroquine โ†’ B54
artemether โ†’ B54
artesunate โ†’ B54
albendazole โ†’ B77.9 +1
mebendazole โ†’ B83.9
praziquantel โ†’ B65.9
ivermectin โ†’ B74.0 +1
Oncology (6)
dexamethasone โ†’ C80.9
tamoxifen โ†’ C50.9
anastrozole โ†’ C50.9
letrozole โ†’ C50.9
cyclophosphamide โ†’ C80.9
methotrexate โ†’ C80.9 +1
Endocrine (5)
levothyroxine โ†’ E03.9
methimazole โ†’ E05.9
propylthiouracil โ†’ E05.9
calcium carbonate โ†’ E55.9 +1
vitamin d โ†’ E55.9
Renal (2)
calcitriol โ†’ N18.9 +1
sodium bicarbonate โ†’ N18.9
โ„น Add drugs by extending DRUG_ICD_KB and THAI_DRUG_ALIASES in drug_mapper.py โ€” no model restart needed