๐ How It Works
Where the data comes from ยท how the pipeline works ยท what confidence means
๐ Where does the data come from?
๐งฌ WHO ATC Classification
Anatomical Therapeutic Chemical (ATC) โ WHO standard for classifying drugs, e.g.
A10BA02 = Metformin (Antidiabetics).
The system uses ATC as a backbone to map drugs โ therapeutic class โ MDC.
๐ฅ Drug-Disease Crosswalk
Manually curated knowledge base, referencing:
โข Thai Drug Database (Health Systems Research Institute)
โข Indications from prescribing information
โข Thai NHSO approved indications
โข DrugBank / RxNorm crosswalk
Each drug has multiple indications with confidence weights.
๐ ICD-10-TM (Thai Modification)
ICD-10-TM is the Thai Modification of the WHO ICD-10, maintained by the
Department of Medical Services, Ministry of Public Health. Used for:
โข NHSO (NHSO) UC Scheme reimbursement claims
โข TDRG v6.3 grouper
โข 43-file standard dataset
Available for download at
icd10.thaicasemix.com
โ How does the Pipeline work?
Stage 1
keyword
Direct keyword lookup
Drug names are normalized first (stripping dosage, dosage form, salt form), then exact-matched against the knowledge base.
Thai alias e.g. "เธเธฅเธนเนเธเธเธฒเธ" โ "metformin" โ E11.9
Fastest ยท deterministic ยท covers ~70% of common drugs
โ if confidence < threshold
Stage 2
TF-IDF
TF-IDF Cosine Similarity (char n-gram)
Uses character n-gram (2โ4 chars) instead of word tokenization โ handles:
โข Brand name: "Cefurox" โ cefuroxime
โข Salt form: "Amlodipine Besylate" โ amlodipine
โข Abbreviation: "co-trimox" โ cotrimoxazole
Vectorized with TF-IDF then cosine similarity against the drug corpus.
Score = cosine_sim ร base_confidence (from KB)
โ if confidence < 40%
Stage 3
fuzzy
Fuzzy String Match (difflib)
Python built-in difflib.get_close_matches โ SequenceMatcher
Handles bad typos: "Metofrmin" โ metformin, "Furosemid" โ furosemide
Slower than TF-IDF but catches edge cases that n-gram misses
โ if confidence is still < 40% and --model is specified
Stage 4
LLM
Mistral-7B Fallback (optional)
Calls mistral-7b-instruct-v0.2.Q4_K_M.gguf via llama-cpp-python
Prompt asks the LLM: "What disease is this drug used to treat? Answer as ICD-10-TM JSON"
Returns top-3 candidates with reasoning.
Used only as a last resort ยท requires model file (~4GB) ยท CPU-only, slower
โ The LLM may hallucinate ICD codes that don't exist โ always validate against the ICD-10-TM reference table
๐ What is the Confidence Score?
Confidence is not a statistical probability, but a composite score derived from:
Stage 1 & 3
confidence = base_weight
base_weight is defined in the KB according to specificity:
โข 0.95 = drugs with mostly a single indication (levothyroxine โ hypothyroid)
โข 0.70 = drugs covering multiple disease groups (amoxicillin โ multiple infections)
Stage 2 (TF-IDF)
confidence = cosine_sim ร base_weight
cosine_sim = similarity of the char n-gram vectors
If 100% match cosine=1.0 โ gets full base_weight
If only partial match โ score reduces proportionally
AUTO โฅ 70%
Ready to use, no review needed
REVIEW 40โ70%
Should be checked by a pharmacist / coder for alt codes
MANUAL < 40%
Must be coded manually, or add the drug to the KB
๐ Knowledge Base Coverage
165 drugs ยท 42 Thai aliases
atorvastatin
โ
E78.0
+1
simvastatin
โ
E78.0
+1
rosuvastatin
โ
E78.0
lovastatin
โ
E78.0
amlodipine
โ
I10
+1
nifedipine
โ
I10
+1
lisinopril
โ
I10
+1
enalapril
โ
I10
+1
ramipril
โ
I10
losartan
โ
I10
+1
valsartan
โ
I10
carvedilol
โ
I50.0
+1
bisoprolol
โ
I50.0
+1
metoprolol
โ
I10
+1
propranolol
โ
I10
+1
digoxin
โ
I50.0
+1
furosemide
โ
I50.0
+1
spironolactone
โ
I50.0
+1
warfarin
โ
I48.9
+2
aspirin
โ
I25.1
+1
clopidogrel
โ
I25.1
+1
nitroglycerin
โ
I20.9
isosorbide
โ
I20.9
+1
amiodarone
โ
I48.9
+1
ivabradine
โ
I50.0
sacubitril
โ
I50.0
heparin
โ
I82.9
+1
enoxaparin
โ
I82.9
rivaroxaban
โ
I48.9
+1
apixaban
โ
I48.9
dabigatran
โ
I48.9
salbutamol
โ
J45.9
+1
albuterol
โ
J45.9
terbutaline
โ
J45.9
formoterol
โ
J45.9
+1
salmeterol
โ
J45.9
+1
tiotropium
โ
J44.9
ipratropium
โ
J44.9
+1
budesonide
โ
J45.9
+1
fluticasone
โ
J45.9
beclomethasone
โ
J45.9
montelukast
โ
J45.9
theophylline
โ
J45.9
+1
acetylcysteine
โ
J44.9
+1
ambroxol
โ
J22
amoxicillin
โ
J06.9
+2
amoxiclav
โ
J18.9
+1
cefuroxime
โ
J18.9
ceftriaxone
โ
J18.9
+1
azithromycin
โ
J18.9
+1
clarithromycin
โ
J18.9
erythromycin
โ
J18.9
doxycycline
โ
J18.9
+1
levofloxacin
โ
J18.9
+1
metformin
โ
E11.9
glipizide
โ
E11.9
glibenclamide
โ
E11.9
gliclazide
โ
E11.9
glimepiride
โ
E11.9
sitagliptin
โ
E11.9
vildagliptin
โ
E11.9
saxagliptin
โ
E11.9
empagliflozin
โ
E11.9
+1
dapagliflozin
โ
E11.9
canagliflozin
โ
E11.9
pioglitazone
โ
E11.9
acarbose
โ
E11.9
insulin
โ
E10.9
+1
liraglutide
โ
E11.9
semaglutide
โ
E11.9
dulaglutide
โ
E11.9
phenytoin
โ
G40.9
valproate
โ
G40.9
+1
carbamazepine
โ
G40.9
+1
levetiracetam
โ
G40.9
clonazepam
โ
G40.9
+1
diazepam
โ
F41.9
+1
alprazolam
โ
F41.9
sertraline
โ
F32.9
fluoxetine
โ
F32.9
escitalopram
โ
F32.9
+1
amitriptyline
โ
F32.9
+1
risperidone
โ
F20.9
olanzapine
โ
F20.9
+1
haloperidol
โ
F20.9
levodopa
โ
G20
donepezil
โ
G30.9
memantine
โ
G30.9
metronidazole
โ
A06.0
+1
paracetamol
โ
R50.9
+1
acetaminophen
โ
R50.9
hydrocortisone
โ
E27.4
fludrocortisone
โ
E27.1
tamsulosin
โ
N40
finasteride
โ
N40
sildenafil
โ
N52.9
+1
imatinib
โ
C91.1
+1
vitamin c
โ
E50
zinc
โ
E60
primaquine
โ
B51.9
omeprazole
โ
K25.9
+1
esomeprazole
โ
K21.0
+1
pantoprazole
โ
K21.0
+1
lansoprazole
โ
K21.0
ranitidine
โ
K21.0
+1
famotidine
โ
K21.0
domperidone
โ
K21.0
+1
metoclopramide
โ
K21.0
+1
ondansetron
โ
R11
bisacodyl
โ
K59.0
lactulose
โ
K59.0
+1
sucralfate
โ
K25.9
ibuprofen
โ
M79.3
+1
diclofenac
โ
M79.3
+1
naproxen
โ
M79.3
celecoxib
โ
M05.9
+1
tramadol
โ
R52.9
morphine
โ
R52.1
codeine
โ
R52.9
+1
pregabalin
โ
G62.9
+1
gabapentin
โ
G62.9
+1
prednisolone
โ
M05.9
+1
allopurinol
โ
M10.9
colchicine
โ
M10.9
ferrous sulfate
โ
D50.9
ferrous
โ
D50.9
folic acid
โ
D52.9
vitamin b12
โ
D51.9
hydroxocobalamin
โ
D51.9
erythropoietin
โ
D63.1
tenofovir
โ
B20
lamivudine
โ
B20
+1
efavirenz
โ
B20
lopinavir
โ
B20
atazanavir
โ
B20
dolutegravir
โ
B20
cephalexin
โ
L02.9
+1
cefazolin
โ
L02.9
ciprofloxacin
โ
N39.0
+1
fluconazole
โ
B37.9
cotrimoxazole
โ
N39.0
+1
vancomycin
โ
A41.9
+1
meropenem
โ
A41.9
piperacillin
โ
A41.9
loperamide
โ
A09
+1
chloroquine
โ
B54
artemether
โ
B54
artesunate
โ
B54
albendazole
โ
B77.9
+1
mebendazole
โ
B83.9
praziquantel
โ
B65.9
ivermectin
โ
B74.0
+1
dexamethasone
โ
C80.9
tamoxifen
โ
C50.9
anastrozole
โ
C50.9
letrozole
โ
C50.9
cyclophosphamide
โ
C80.9
methotrexate
โ
C80.9
+1
levothyroxine
โ
E03.9
methimazole
โ
E05.9
propylthiouracil
โ
E05.9
calcium carbonate
โ
E55.9
+1
vitamin d
โ
E55.9
calcitriol
โ
N18.9
+1
sodium bicarbonate
โ
N18.9
โน Add drugs by extending DRUG_ICD_KB and THAI_DRUG_ALIASES in drug_mapper.py โ no model restart needed