Chapter 04

Digital Sovereignty

Local LLMs, VRAM Optimization, Llama 4, and the Sovereign Software Stack

Running language models locally can reduce exposure to third-party data retention and subpoena risk in some scenarios. Actual privacy posture depends on configuration, hardware, network setup, and threat model. Not security or legal advice.

GPU Options for Local AI (2026)

GPU Model	VRAM	Bandwidth	Ideal Workload
NVIDIA RTX 3090	24GB GDDR6X	936 GB/s	Budget large VRAM; 7B-30B models
NVIDIA RTX 4090	24GB GDDR6X	1,008 GB/s	Proven baseline; 30B models
NVIDIA RTX 5090	32GB GDDR7	1,792 GB/s	Flagship; 70B models (quantized)
Mac Studio M4 Ultra	512GB (Shared)	819 GB/s	Ultra-large models (up to 405B)

VRAM and Model Parameter Ratios

VRAM Requirements by Quantization

Model Size	VRAM (FP16/Raw)	VRAM (Q4)	VRAM (Q2/IQ2)
7B - 8B	14GB - 16GB	4GB - 5GB	2GB - 3GB
30B - 34B	64GB+	19GB - 20GB	10GB - 12GB
70B	140GB+	35GB - 40GB	20GB - 22GB
405B	810GB+	200GB+	120GB+

Bandwidth and Throughput

Performance Benchmarks

Setup	VRAM Total	Bandwidth	Speed (32B Q4)	Price (2026)
Single RTX 5090	32GB	1,792 GB/s	61 tok/s	$2,500 - $3,800
Dual RTX 3090 (Used)	48GB	936 GB/s	30 tok/s	$1,600 - $1,800
Mac Studio M4 Max	128GB	400 GB/s	40-60 tok/s	$3,500 - $5,000
DGX Spark / H100	80GB	2,000 GB/s	150+ tok/s	$25,000+

The Sovereign Software Stack

Engine Layer: Ollama or Llama.cpp for managing GGUF-quantized models. Interface Layer: Open WebUI or LM Studio for ChatGPT-like front-end. Workflow Builder: n8n for self-hosted RAG pipelines. Character Framework: WAFT for interactive world models and dynamic AI characters.

Recommended models: dolphin-3.0-llama-4-8b for instruction accuracy without refusal; qwen-2.5-coder-32b for superior coding performance.

Offline AI for Healthcare

IMPORTANT: This section discusses general informational uses of local AI tools alongside, not in place of, professional medical care. Nothing here is medical advice. Always consult a qualified licensed clinician for any medical decision. If you may have a medical emergency, call your doctor or your local emergency number immediately. Healthcare costs vary widely; figures cited are illustrative. Local AI tools, where appropriate and used responsibly, may help with general health literacy, but they do not diagnose, treat, cure, or prevent any disease.

A local retrieval-augmented system over published medical literature can, in principle, support general health literacy and patient self-education. This is a description of an architecture, not a clinical recommendation. Such a system is not a medical device, has not been evaluated by any regulatory authority, and must never be used in place of a licensed clinician. In a medical emergency, contact emergency services.

Offline Healthcare AI Stack

Layer	Tool	Purpose	Hardware Req.
Inference engine	Ollama / llama.cpp	Run quantized medical LLM	RTX 3090 or M2 Pro+
Medical knowledge base	PubMed OA + Merck Manual + UpToDate (offline export)	RAG source corpus	~50GB SSD
Vector database	ChromaDB or Qdrant (self-hosted)	Semantic search over corpus	8GB RAM minimum
RAG orchestration	Anything-LLM or Open WebUI (RAG mode)	Query routing + context injection	Same machine
Wearable telemetry	Withings / Garmin local sync	Vital trends without cloud upload	Local WiFi only
Emergency reference	Where There Is No Doctor (offline PDF + embedded)	Field-level triage guide	Offline-first

Open Models Some Explore for Reading Medical Literature

Model	Parameters	Strength	General Reading Use
Med42-v2 (M42 Health)	70B (Q4)	Trained on medical text	Summarizing public medical literature
BioMistral-7B	7B (Q4)	Biomedical literature comprehension	Research summaries, literature lookup
Llama 3.1-70B	70B (Q4)	General reasoning	General reading and study
OpenBioLLM-70B	70B (Q4)	USMLE-level medical knowledge	Studying medical concepts

These tools are for general health literacy and reading public literature only. They are NOT for triage, diagnosis, treatment, drug, or mental-health decisions, and are not a substitute for a licensed clinician. If you may have a medical emergency, call your doctor or your local emergency number immediately. Critical implementation notes: (1) All medical AI outputs are informational, not diagnostic. The stack must surface this disclaimer in the UI at every response. (2) The corpus must be version-controlled and updated quarterly - stale medical literature is worse than no literature. (3) For community deployments, run the stack on a dedicated machine accessible over local LAN - members can query from any device without internet. (4) Pair with a physical medical kit: tourniquets, wound closure strips, SAM splints, and a printed copy of "Where There Is No Doctor" - analog backup for power-out scenarios.

Local research tools can reduce information asymmetry, but they are not a substitute for a licensed medical professional. Always work with a qualified clinician for any medical decision.

The information here is a starting point for your own research, not a professional recommendation. Figures are rough estimates and any products named are referenced for information only, not endorsements. Read our terms & disclaimers.

←→ navigate b all chapters h home