LLM in Medicine

Large Language Models in Medicine

A 20-minute introduction

🧠 What are LLMs and how do they work?
🏥 Applications in clinical practice and drug development
🔬 From general models to specialized AI (pathology, oncology)
⚠️ Risks, limitations, and best practices

What is a Large Language Model?

From text prediction to clinical reasoning

🤖 LLM = neural network trained to predict the next word
📚 Trained on billions of texts: books, papers, websites, code
🧠 Emergent abilities: reasoning, summarization, translation, coding
💬 Examples: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta)
📈 Rapid evolution: GPT-3 (2020) → GPT-4 (2023) → new models every few months

How Do LLMs Work? — Overview

The journey from text to response

"The patient has elevated liver enzymes"
          ↓
   1. TOKENIZATION — break into pieces
          ↓
   2. EMBEDDINGS — convert to numbers
          ↓
   3. SELF-ATTENTION — understand context
          ↓
   4. GENERATION — predict next word
          ↓
"suggesting possible hepatocellular injury..."

Step 1: Tokenization

Breaking text into digestible pieces

🧩 Tokens = small chunks of text (words, parts of words, punctuation)
📏 Computers can’t read words directly — they need standardized units
🔤 Common words → single token; rare words → multiple tokens

Example:

Text	Tokens
“heart”	`[heart]` (1 token)
“cardiomyopathy”	`[card, io, my, opathy]` (4 tokens)
“Patient stable”	`[Patient, stable]` (2 tokens)
“ALT: 85 U/L”	`[ALT, :, 85, U, /, L]` (6 tokens)

Step 2: Embeddings

Turning words into numbers with meaning

🔢 Computers only understand numbers — not words
🧮 Embedding = a list of numbers (vector) representing a word’s meaning
📐 Similar words → similar vectors → close in “meaning space”

Intuition:

"liver"  → [0.2, -0.5, 0.8, 0.1, ...]  ←─┐
"hepatic"→ [0.3, -0.4, 0.7, 0.2, ...]  ←─┤ Similar!
"kidney" → [0.1, -0.3, 0.6, 0.0, ...]  ←─┘ Related organ
"car"    → [-0.8, 0.9, -0.2, 0.5, ...] ← Very different!

Step 3: Self-Attention

How each word “looks at” every other word

👁️ Self-attention = the mechanism that captures context
🔗 Each word can “attend to” (focus on) other words in the sentence
🎯 This resolves ambiguities and builds understanding

Medical example:

“The liver is enlarged and shows steatosis. It may indicate NAFLD.”

What does “it” refer to?

A human knows: “it” = the enlarged, steatotic liver (the finding)
Self-attention learns: when processing “it”, focus strongly on “liver”

Self-Attention: A Closer Look

How the model decides what’s important

Sentence: “The patient with chest pain was diagnosed with MI and treated with aspirin”

When processing “MI”, the model attends to:

Word	Attention Weight	Why?
chest	High	Anatomical location of MI
pain	High	Key symptom
patient	Medium	Subject of sentence
aspirin	High	Standard treatment
with	Low	Function word
the	Low	Function word

The model learns these patterns from billions of examples.

Let’s look deeper at how attention actually works.

When processing the word “MI” (myocardial infarction), the model assigns “attention weights” to every other word. Higher weights mean “pay more attention to this word when understanding MI”.

In this example, “chest” and “pain” get high attention — they’re directly related to MI. “Aspirin” also gets high attention because the model has learned that aspirin is a standard MI treatment.

Function words like “the” and “with” get low attention — they don’t carry much meaning for understanding what MI means in this context.

The remarkable thing: nobody programmed these relationships. The model learned them automatically from reading billions of sentences. It discovered medical knowledge just from patterns in text.

This is why LLMs can be useful in medicine — they’ve absorbed vast amounts of medical language and relationships.

Step 4: Generation

Predicting one word at a time

🔁 LLMs generate text one token at a time
🎲 At each step: “Given everything so far, what’s the most likely next word?”
📊 The model outputs probabilities for every possible next token

Example generation:

Input:  "The ECG shows ST"
         ↓
Model:  elevation (85%), depression (10%), segment (3%), ...
         ↓
Output: "The ECG shows ST elevation"
         ↓
Model:  in (70%), suggesting (15%), consistent (10%), ...
         ↓
Output: "The ECG shows ST elevation in leads..."

Pretraining and Fine-tuning

How models learn general and specialized knowledge

📚 Pretraining: learn from billions of texts (books, web, papers)
- Goal: understand language, grammar, facts, reasoning patterns
- Result: a “generalist” model (GPT-4, Claude, LLaMA)
🎯 Fine-tuning: additional training on specialized data
- Goal: adapt to specific domain or task
- Result: a “specialist” model (Med-PaLM, BioMistral, Owkin Zero)
💬 RLHF (Reinforcement Learning from Human Feedback):
- Humans rate responses → model learns preferences
- Makes models more helpful, harmless, and honest

How does a model actually learn all this? Two main phases.

Pretraining is like a general education. The model reads billions of documents — books, websites, Wikipedia, scientific papers — and learns to predict the next word. Through this, it absorbs language patterns, facts, and reasoning abilities.

But a pretrained model is a generalist. It knows a little about everything, but isn’t optimized for any specific task.

Fine-tuning is like medical school. You take the pretrained model and train it further on specialized data — clinical notes, medical textbooks, PubMed articles. This creates a model that’s much better at medical reasoning.

RLHF is the final polish — human feedback teaches the model to be helpful and avoid harmful outputs. This is why ChatGPT feels conversational and tries to be useful.

Medical models like Owkin Zero go through all these stages with biological and clinical data, achieving better performance on medical tasks than general models.

The Context Window

How much can the model “see” at once?

📋 Context window = maximum tokens the model can process at once
� How chatbots work: they re-read the entire conversation every time you send a message
⚠️ When the window fills up → older messages are forgotten

Model (2025)	Context Window	~Pages
GPT-5.1	256K tokens	~200
Claude Sonnet 4.5	200K tokens	~150
Gemini 3	2M tokens	~1500

The context window defines how much text the model can “see” at once.

Here’s the key insight: chatbots don’t have memory like humans. Every time you send a message, the model re-reads the ENTIRE conversation from the beginning. Your new message gets appended, and the whole thing is processed again.

When the conversation exceeds the context window, older messages get dropped. The model literally forgets the beginning of your conversation. This is why long clinical discussions can lose coherence — early patient details may be forgotten.

Current models (2025): GPT-5.1 has 256K tokens, Claude Sonnet 4.5 has 200K, and Gemini 3 offers up to 2 million tokens — enough for over a thousand pages.

For medical use: larger windows let you analyze entire patient histories or research papers in one go.

Dataset Shift: AI Models Are Not Eternal

Why today’s model may fail tomorrow

📅 Knowledge cutoff: model only knows what existed during training
- Each model has a training date (check documentation)
- Won’t know new drugs, updated guidelines, recent outbreaks
📊 Dataset shift: the world changes, but the model doesn’t
- New diseases, evolving treatment protocols, changing resistance patterns
🔄 Model drift: performance degrades over time without updates

Important

An AI model is a snapshot of knowledge at training time. Always verify against current guidelines.

This is a critical concept: AI models are not eternal.

Knowledge cutoff: every model has a training date, and it knows nothing after that. If you ask about a drug approved after the cutoff, it won’t know — or worse, it might hallucinate an answer. Always check the model’s documentation for its training date.

Dataset shift: the real world changes constantly. A model trained on 2020 data may make incorrect predictions on 2025 patients because disease patterns, treatment protocols, and drug resistance have shifted.

Model drift: even if the world stayed the same, models can degrade over time as they’re used in ways that differ from their training conditions.

This is why responsible AI deployment requires continuous monitoring and retraining. Always check when a model was last updated.

Clinical Applications: Overview

Where LLMs can help clinicians

📝 Documentation: draft reports, discharge summaries, referral letters
🔍 Literature search: summarize papers, synthesize evidence
🩺 Clinical support: differential diagnosis, drug interactions
📊 Data extraction: structure info from free-text notes (ICD coding)
🎓 Education: explain concepts to patients in plain language

Example 1: Discharge Summary

From structured data to readable text

Input prompt:

“Write a discharge summary for: 72-year-old male, admitted for chest pain. Troponin elevated. Coronary angiography showed 90% LAD stenosis. PCI with DES performed. Discharged on aspirin, clopidogrel, atorvastatin, metoprolol.”

LLM output:

“A 72-year-old male was admitted with acute chest pain and elevated troponin levels consistent with NSTEMI. Coronary angiography revealed 90% stenosis of the left anterior descending artery. Successful percutaneous coronary intervention was performed with drug-eluting stent placement. The patient was discharged in stable condition on dual antiplatelet therapy (aspirin 100mg, clopidogrel 75mg), atorvastatin 80mg, and metoprolol 50mg. Follow-up in cardiology clinic in 4 weeks recommended.”

Example 2: Laboratory Interpretation

Pattern recognition in lab panels

Input prompt:

“Interpret: ALT 85 U/L (ref 7-56), AST 72 U/L (ref 8-48), GGT 120 U/L (ref 9-48), ALP normal, Bilirubin normal, Albumin normal.”

LLM output:

“The liver panel shows elevated transaminases (ALT 85, AST 72) with ALT > AST, and elevated GGT (120), with preserved synthetic function (normal albumin) and no cholestasis (normal ALP, bilirubin). This pattern suggests hepatocellular injury with possible etiologies including: NAFLD/NASH (given ALT > AST ratio), medication-induced hepatotoxicity, viral hepatitis, or early alcoholic liver disease. Recommend: hepatitis B/C serology, metabolic panel, medication review, and liver ultrasound.”

Example 3: Differential Diagnosis

Clinical reasoning support

Input prompt:

“55-year-old female with progressive fatigue, weight loss, and hyperpigmentation. Labs show hyponatremia (128 mEq/L), hyperkalemia (5.8 mEq/L), low morning cortisol. Generate differential diagnosis.”

LLM output:

“The clinical picture strongly suggests primary adrenal insufficiency (Addison’s disease):

Fatigue, weight loss = non-specific but common

Hyperpigmentation = ACTH-driven melanin (pathognomonic)

Hyponatremia + hyperkalemia = mineralocorticoid deficiency

Low cortisol = confirms axis dysfunction

Differential includes: autoimmune adrenalitis (most common), adrenal metastases, infections (TB, fungal), adrenal hemorrhage. Recommend ACTH stimulation test and adrenal CT.”

From Text to Multimodal AI

Beyond chatbots: specialized medical AI

💬 Text-only LLMs: documentation, literature — limited for diagnosis
🔬 Specialized AI: pathology slides, radiology, oncology predictions
🧬 Multimodal AI: images + genomics + clinical records + labs
- Discovers patterns invisible to single-modality analysis

Note

The most impactful medical AI combines domain expertise with multimodal data.

AI in Digital Pathology

From slides to molecular insights

🔬 Digital pathology: whole-slide imaging (WSI) enables AI analysis
🧠 Deep learning can identify patterns invisible to human eye
🎯 Key applications:
- Cancer detection and grading
- Biomarker prediction from H&E slides
- Tumor microenvironment analysis
- Prognosis prediction
✅ Advantage: faster, more consistent, preserves tissue for other tests

Agentic AI in Drug Development

From chatbots to autonomous research agents

🤖 Agentic AI = AI systems that can plan, reason, and execute tasks autonomously
🧬 Applications in pharma:
- Target identification and validation
- Biomarker discovery for patient stratification
- Clinical trial design and optimization
- Competitive landscape analysis
🔬 Example: Owkin K Pro — agentic AI co-pilot for biopharma
- Natural language queries on multimodal patient data
- Access to curated oncology datasets (2,200+ patients, 6 modalities)
- Results: 70% faster target identification, trial duration reduced by 35%

⚠️ The Hallucination Problem

When AI confidently makes things up

🎭 Hallucination = plausible but false information
🤝 LLMs are trained to sound helpful, not to be truthful
🩺 Clinical examples:
- Invented drug dosages or interactions
- Fake citations (non-existent papers, wrong authors)
- Incorrect reference ranges or guidelines
✅ Rule: always verify — especially dosages, references, and protocols

Warning

Plausible ≠ Correct. A fluent, confident answer can still be completely wrong.

Other Limitations

What AI cannot do (yet)

⚖️ Bias: reproduces biases in training data (demographic, geographic)
🔒 Privacy: commercial APIs may retain prompts — never input PHI
🧮 Reasoning limits: still unreliable for complex multi-step calculations
📋 Accountability: who is responsible for an AI-assisted error?
🏥 Regulatory status: most tools are not FDA/CE approved for diagnosis

Best Practices for Clinical Use

How to use AI responsibly

✅ Do	❌ Don’t
Use as draft generator	Trust blindly without review
Verify all factual claims	Input PHI on public APIs
Use validated tools for clinical decisions	Use ChatGPT for diagnosis
Keep human in the loop	Assume AI “knows” medicine
Document AI-assisted decisions	Cite LLM outputs as sources

Key Parameters to Know

Controlling LLM behavior

🌡️ Temperature: controls randomness/creativity
- Low (0.0-0.3): consistent, deterministic — best for clinical use
- High (0.7-1.0): creative, varied — for brainstorming only
✂️ Max tokens: limit output length (controls cost and verbosity)
🎯 System prompt: define role, constraints, and behavior
- Example: “You are a clinical assistant. Always cite sources. Never provide specific drug doses without verification.”

Clinical tip: For medical tasks, always use low temperature. “Interpret this ECG” at temperature 0.1 gives consistent, reproducible output. At 0.8, you get creative but potentially dangerous variations.

If you’re using LLMs directly through APIs or platforms that expose these settings, a few parameters are important to understand:

Temperature is the most important. It controls how “creative” or random the model’s outputs are. For clinical tasks, always keep it low — you want consistent, reproducible results, not creative variations.

Max tokens limits how long the response can be. This is useful for controlling costs and preventing overly verbose outputs.

System prompt is powerful — it lets you define the AI’s role and set constraints upfront. You can instruct it to always cite sources, never provide specific drug doses, or flag uncertainty explicitly.

We already discussed context window in the technical section — remember it determines how much the model can “see” at once.

The Future: Multimodal + Agentic

Where medical AI is heading

🖼️ Multimodal integration: text + images + genomics + EMR
🤖 Agentic systems: autonomous research and analysis
🔗 Foundation models for biology: pretrained on molecular data
🏥 Clinical integration: embedded in workflows (not separate tools)
⚖️ Regulation evolving: EU AI Act, FDA guidance on AI/ML devices

Take-Home Messages

What to remember from this seminar

🧠 LLMs are powerful for documentation, literature synthesis, and education
⚠️ They hallucinate — always verify clinical information
🔬 Specialized AI (pathology, oncology) outperforms general chatbots
🤖 Agentic AI (like K Pro) is transforming drug development
👨‍⚕️ Keep human oversight — AI assists, doesn’t replace expertise

Questions?

Thank you for your attention

📧 Contact: luca.vedovelli@ubep.unipd.it

🔗 Resources:

Owkin: owkin.com
OpenAI: platform.openai.com
Claude: claude.ai