LLM in Medicine

Large Language Models in Medicine

A 20-minute introduction

  • 🧠 What are LLMs and how do they work?
  • 🏥 Applications in clinical practice and drug development
  • 🔬 From general models to specialized AI (pathology, oncology)
  • ⚠️ Risks, limitations, and best practices

What is a Large Language Model?

From text prediction to clinical reasoning

  • 🤖 LLM = neural network trained to predict the next word
  • 📚 Trained on billions of texts: books, papers, websites, code
  • 🧠 Emergent abilities: reasoning, summarization, translation, coding
  • 💬 Examples: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta)
  • 📈 Rapid evolution: GPT-3 (2020) → GPT-4 (2023) → new models every few months

How Do LLMs Work? — Overview

The journey from text to response

"The patient has elevated liver enzymes"
          ↓
   1. TOKENIZATION — break into pieces
          ↓
   2. EMBEDDINGS — convert to numbers
          ↓
   3. SELF-ATTENTION — understand context
          ↓
   4. GENERATION — predict next word
          ↓
"suggesting possible hepatocellular injury..."

Step 1: Tokenization

Breaking text into digestible pieces

  • 🧩 Tokens = small chunks of text (words, parts of words, punctuation)
  • 📏 Computers can’t read words directly — they need standardized units
  • 🔤 Common words → single token; rare words → multiple tokens

Example:

Text Tokens
“heart” [heart] (1 token)
“cardiomyopathy” [card, io, my, opathy] (4 tokens)
“Patient stable” [Patient, stable] (2 tokens)
“ALT: 85 U/L” [ALT, :, 85, U, /, L] (6 tokens)

Step 2: Embeddings

Turning words into numbers with meaning

  • 🔢 Computers only understand numbers — not words
  • 🧮 Embedding = a list of numbers (vector) representing a word’s meaning
  • 📐 Similar words → similar vectors → close in “meaning space”

Intuition:

"liver"  → [0.2, -0.5, 0.8, 0.1, ...]  ←─┐
"hepatic"→ [0.3, -0.4, 0.7, 0.2, ...]  ←─┤ Similar!
"kidney" → [0.1, -0.3, 0.6, 0.0, ...]  ←─┘ Related organ
"car"    → [-0.8, 0.9, -0.2, 0.5, ...] ← Very different!

Step 3: Self-Attention

How each word “looks at” every other word

  • 👁️ Self-attention = the mechanism that captures context
  • 🔗 Each word can “attend to” (focus on) other words in the sentence
  • 🎯 This resolves ambiguities and builds understanding

Medical example:

“The liver is enlarged and shows steatosis. It may indicate NAFLD.”

What does “it” refer to?

  • A human knows: “it” = the enlarged, steatotic liver (the finding)
  • Self-attention learns: when processing “it”, focus strongly on “liver”

Self-Attention: A Closer Look

How the model decides what’s important

Sentence: “The patient with chest pain was diagnosed with MI and treated with aspirin

When processing “MI”, the model attends to:

Word Attention Weight Why?
chest High Anatomical location of MI
pain High Key symptom
patient Medium Subject of sentence
aspirin High Standard treatment
with Low Function word
the Low Function word

The model learns these patterns from billions of examples.

Step 4: Generation

Predicting one word at a time

  • 🔁 LLMs generate text one token at a time
  • 🎲 At each step: “Given everything so far, what’s the most likely next word?”
  • 📊 The model outputs probabilities for every possible next token

Example generation:

Input:  "The ECG shows ST"
         ↓
Model:  elevation (85%), depression (10%), segment (3%), ...
         ↓
Output: "The ECG shows ST elevation"
         ↓
Model:  in (70%), suggesting (15%), consistent (10%), ...
         ↓
Output: "The ECG shows ST elevation in leads..."

Pretraining and Fine-tuning

How models learn general and specialized knowledge

  • 📚 Pretraining: learn from billions of texts (books, web, papers)
    • Goal: understand language, grammar, facts, reasoning patterns
    • Result: a “generalist” model (GPT-4, Claude, LLaMA)
  • 🎯 Fine-tuning: additional training on specialized data
    • Goal: adapt to specific domain or task
    • Result: a “specialist” model (Med-PaLM, BioMistral, Owkin Zero)
  • 💬 RLHF (Reinforcement Learning from Human Feedback):
    • Humans rate responses → model learns preferences
    • Makes models more helpful, harmless, and honest

The Context Window

How much can the model “see” at once?

  • 📋 Context window = maximum tokens the model can process at once
  • How chatbots work: they re-read the entire conversation every time you send a message
  • ⚠️ When the window fills up → older messages are forgotten
Model (2025) Context Window ~Pages
GPT-5.1 256K tokens ~200
Claude Sonnet 4.5 200K tokens ~150
Gemini 3 2M tokens ~1500

Dataset Shift: AI Models Are Not Eternal

Why today’s model may fail tomorrow

  • 📅 Knowledge cutoff: model only knows what existed during training
    • Each model has a training date (check documentation)
    • Won’t know new drugs, updated guidelines, recent outbreaks
  • 📊 Dataset shift: the world changes, but the model doesn’t
    • New diseases, evolving treatment protocols, changing resistance patterns
  • 🔄 Model drift: performance degrades over time without updates

Important

An AI model is a snapshot of knowledge at training time. Always verify against current guidelines.

Clinical Applications: Overview

Where LLMs can help clinicians

  • 📝 Documentation: draft reports, discharge summaries, referral letters
  • 🔍 Literature search: summarize papers, synthesize evidence
  • 🩺 Clinical support: differential diagnosis, drug interactions
  • 📊 Data extraction: structure info from free-text notes (ICD coding)
  • 🎓 Education: explain concepts to patients in plain language

Example 1: Discharge Summary

From structured data to readable text

Input prompt:

“Write a discharge summary for: 72-year-old male, admitted for chest pain. Troponin elevated. Coronary angiography showed 90% LAD stenosis. PCI with DES performed. Discharged on aspirin, clopidogrel, atorvastatin, metoprolol.”

LLM output:

“A 72-year-old male was admitted with acute chest pain and elevated troponin levels consistent with NSTEMI. Coronary angiography revealed 90% stenosis of the left anterior descending artery. Successful percutaneous coronary intervention was performed with drug-eluting stent placement. The patient was discharged in stable condition on dual antiplatelet therapy (aspirin 100mg, clopidogrel 75mg), atorvastatin 80mg, and metoprolol 50mg. Follow-up in cardiology clinic in 4 weeks recommended.”

Example 2: Laboratory Interpretation

Pattern recognition in lab panels

Input prompt:

“Interpret: ALT 85 U/L (ref 7-56), AST 72 U/L (ref 8-48), GGT 120 U/L (ref 9-48), ALP normal, Bilirubin normal, Albumin normal.”

LLM output:

“The liver panel shows elevated transaminases (ALT 85, AST 72) with ALT > AST, and elevated GGT (120), with preserved synthetic function (normal albumin) and no cholestasis (normal ALP, bilirubin). This pattern suggests hepatocellular injury with possible etiologies including: NAFLD/NASH (given ALT > AST ratio), medication-induced hepatotoxicity, viral hepatitis, or early alcoholic liver disease. Recommend: hepatitis B/C serology, metabolic panel, medication review, and liver ultrasound.”

Example 3: Differential Diagnosis

Clinical reasoning support

Input prompt:

“55-year-old female with progressive fatigue, weight loss, and hyperpigmentation. Labs show hyponatremia (128 mEq/L), hyperkalemia (5.8 mEq/L), low morning cortisol. Generate differential diagnosis.”

LLM output:

“The clinical picture strongly suggests primary adrenal insufficiency (Addison’s disease):

  • Fatigue, weight loss = non-specific but common
  • Hyperpigmentation = ACTH-driven melanin (pathognomonic)
  • Hyponatremia + hyperkalemia = mineralocorticoid deficiency
  • Low cortisol = confirms axis dysfunction

Differential includes: autoimmune adrenalitis (most common), adrenal metastases, infections (TB, fungal), adrenal hemorrhage. Recommend ACTH stimulation test and adrenal CT.”

From Text to Multimodal AI

Beyond chatbots: specialized medical AI

  • 💬 Text-only LLMs: documentation, literature — limited for diagnosis
  • 🔬 Specialized AI: pathology slides, radiology, oncology predictions
  • 🧬 Multimodal AI: images + genomics + clinical records + labs
    • Discovers patterns invisible to single-modality analysis

Note

The most impactful medical AI combines domain expertise with multimodal data.

AI in Digital Pathology

From slides to molecular insights

  • 🔬 Digital pathology: whole-slide imaging (WSI) enables AI analysis
  • 🧠 Deep learning can identify patterns invisible to human eye
  • 🎯 Key applications:
    • Cancer detection and grading
    • Biomarker prediction from H&E slides
    • Tumor microenvironment analysis
    • Prognosis prediction
  • Advantage: faster, more consistent, preserves tissue for other tests

Agentic AI in Drug Development

From chatbots to autonomous research agents

  • 🤖 Agentic AI = AI systems that can plan, reason, and execute tasks autonomously
  • 🧬 Applications in pharma:
    • Target identification and validation
    • Biomarker discovery for patient stratification
    • Clinical trial design and optimization
    • Competitive landscape analysis
  • 🔬 Example: Owkin K Pro — agentic AI co-pilot for biopharma
    • Natural language queries on multimodal patient data
    • Access to curated oncology datasets (2,200+ patients, 6 modalities)
    • Results: 70% faster target identification, trial duration reduced by 35%

⚠️ The Hallucination Problem

When AI confidently makes things up

  • 🎭 Hallucination = plausible but false information
  • 🤝 LLMs are trained to sound helpful, not to be truthful
  • 🩺 Clinical examples:
    • Invented drug dosages or interactions
    • Fake citations (non-existent papers, wrong authors)
    • Incorrect reference ranges or guidelines
  • Rule: always verify — especially dosages, references, and protocols

Warning

Plausible ≠ Correct. A fluent, confident answer can still be completely wrong.

Other Limitations

What AI cannot do (yet)

  • ⚖️ Bias: reproduces biases in training data (demographic, geographic)
  • 🔒 Privacy: commercial APIs may retain prompts — never input PHI
  • 🧮 Reasoning limits: still unreliable for complex multi-step calculations
  • 📋 Accountability: who is responsible for an AI-assisted error?
  • 🏥 Regulatory status: most tools are not FDA/CE approved for diagnosis

Best Practices for Clinical Use

How to use AI responsibly

✅ Do ❌ Don’t
Use as draft generator Trust blindly without review
Verify all factual claims Input PHI on public APIs
Use validated tools for clinical decisions Use ChatGPT for diagnosis
Keep human in the loop Assume AI “knows” medicine
Document AI-assisted decisions Cite LLM outputs as sources

Key Parameters to Know

Controlling LLM behavior

  • 🌡️ Temperature: controls randomness/creativity
    • Low (0.0-0.3): consistent, deterministic — best for clinical use
    • High (0.7-1.0): creative, varied — for brainstorming only
  • ✂️ Max tokens: limit output length (controls cost and verbosity)
  • 🎯 System prompt: define role, constraints, and behavior
    • Example: “You are a clinical assistant. Always cite sources. Never provide specific drug doses without verification.”

Clinical tip: For medical tasks, always use low temperature. “Interpret this ECG” at temperature 0.1 gives consistent, reproducible output. At 0.8, you get creative but potentially dangerous variations.

The Future: Multimodal + Agentic

Where medical AI is heading

  • 🖼️ Multimodal integration: text + images + genomics + EMR
  • 🤖 Agentic systems: autonomous research and analysis
  • 🔗 Foundation models for biology: pretrained on molecular data
  • 🏥 Clinical integration: embedded in workflows (not separate tools)
  • ⚖️ Regulation evolving: EU AI Act, FDA guidance on AI/ML devices

Take-Home Messages

What to remember from this seminar

  1. 🧠 LLMs are powerful for documentation, literature synthesis, and education
  2. ⚠️ They hallucinate — always verify clinical information
  3. 🔬 Specialized AI (pathology, oncology) outperforms general chatbots
  4. 🤖 Agentic AI (like K Pro) is transforming drug development
  5. 👨‍⚕️ Keep human oversight — AI assists, doesn’t replace expertise

Questions?

Thank you for your attention

📧 Contact: luca.vedovelli@ubep.unipd.it

🔗 Resources: