Running LLM Apps on Shared/VPS: CPU vs RAM vs VRAM Reality
Every hosting engineer eventually faces the same question: “Can I run my LLM application on shared hosting or a small VPS?” It sounds feasible – after all, Python scripts run fine, vector operations run fine, and embeddings don’t look heavy at first glance. But once you deploy a 7B or 13B model, add retrieval, quantise weights, and scale concurrent users, reality hits: LLM workloads behave nothing like traditional web apps.
This guide breaks down the messy, practical truth from a security-oriented, infrastructure-level perspective. After a decade hardening WordPress hosting environments, the same patterns emerge repeatedly – resource contention, memory starvation, mutex locks, thermal throttling, and noisy neighbours. LLM workloads amplify all of them.
We will cover:
- The real bottlenecks: CPU vs RAM vs VRAM
- How quantisation and model size impact hosting limits
- The concurrency traps nobody warns you about
- When shared hosting is acceptable – and when it breaks
- How to use lightweight retrieval such as vector db explained to reduce load
- Security and isolation concerns when mixing LLM code with PHP apps
- A quick benchmarking checklist for your setup
- Links to deeper hosting and DevOps resources such as using AI to diagnose issues, AI direct on your code editor of choice, hosting articles, and our devops category
Why LLM Hosting Is Fundamentally Differen
Most WordPress or PHP applications behave predictably: CPU spikes appear under load, RAM usage plateaus, and caching smooths most of the workflow. LLM workloads rip all that up. They generate bursty CPU demand, unpredictable RAM spikes, deeply non-linear latency under concurrency, and GPU requirements that no shared host can meet.
The issue is simple: language models don’t scale linearly. Double the tokens, double the compute, double the context – but then attention layers blow that curve out of proportion. Models with 4K context windows behave wildly differently from those with 16K or 128K windows, even if the parameter count stays fixed.
In a shared hosting environment, these nonlinearities collide with noisy neighbours and limited OS-level control. That’s why LLM deployment must be treated more like data engineering than web hosting.
CPU vs RAM vs VRAM: The Real Priorities
Hosting conversations often fixate on CPU cores, but in LLM workloads, CPU is just one piece. Below is a simplified breakdown of which component matters most for which operation.
| Operation Type | CPU Demand | RAM Demand | VRAM/GPU Need |
|---|---|---|---|
| Token generation (CPU inference) | Very High | Moderate | None |
| Embedding generation | High | Moderate | Optional (GPU speeds it massively) |
| Vector search (FAISS/SQLite/Pinecone) | Low | Low | None |
| Full LLM inference (7B+) | Extremely High | High | Essential for real-time use |
| Serving multiple users | Burst-heavy | High (multi-process) | GPU strongly advised |
Notice something? Only one part of the stack is GPU-friendly: the actual model inference itself. Embeddings, RAG lookups, vector search, JSON parsing, and safety checks generally run fine on CPU.
This is why many teams deploy only the embedding + retrieval layer on a VPS while sending inference to an external GPU provider.
Why Shared Hosting Struggles With LLM Workloads
A shared environment has inherent limits tightly coupled to security and resource fairness. LLM workloads violate most of those assumptions.
Process Isolation
Shared hosting caps the number of concurrent processes and threads. A single LLM inference may use dozens of threads (OpenBLAS, MKL, or llama.cpp parallelism). Hosts usually throttle this automatically.
RAM Caps and OOM Killer
There is no graceful failure. When the model footprint plus temporary buffers exceed the RAM limit, your process is killed instantly. Quantised 4-bit models help, but not enough.
No Access to GPU
Even if the provider technically offers Nvidia cards, shared hosting users almost never receive CUDA access. Most providers forbid it outright for security reasons.
Noisy Neighbour Contention
Other sites spike CPU → your model slows mid-generation → users see broken streams or timeouts.
Security Risks
Running Python inference next to PHP workloads is a risk multiplier. Shell access, Python modules, long-running daemons – none of this belongs on shared hosting from a security-hardening perspective.
When Shared Hosting Works (Rarely)
There are three cases where shared hosting is acceptable:
Case 1: Embeddings Only
If your application only generates embeddings using lightweight libraries or remote services, shared hosting can serve as the frontend while offloading the heavy lifting.
Case 2: Proxying to Remote GPUs
Your hosting acts only as a “router” between users and your GPU endpoint (e.g., OpenAI, Replicate, Groq, custom A100 node).
Case 3: Micro-models (1B-3B models)
Extremely small models like Phi-2 or TinyLlama can run under tight constraints. But expect terrible latency under load.
Anything beyond this requires a VPS or dedicated instance.
The VPS Reality Check: Limits Still Apply
A VPS gives you more freedom, but the CPU/RAM/IOPS ceilings still drive the entire experience. Below is a rough reality table for Llama models on a typical 2 vCPU / 4 GB RAM VPS.
| Model Size | Quantisation | Feasible? | Expected Latency | Concurrency |
|---|---|---|---|---|
| 3B | Q4_K_M | Yes | 1.5 – 3s/token | 1 user |
| 7B | Q4_K_M | Barely | 4 – 7s/token | 1 user |
| 13B | Q4_K_M | No | N/A | N/A |
| 7B GPU (remote) | Any | Yes | 0.05s/token | 10+ users |
On CPU-only environments, even 7B models become impractical under concurrency. CPU inference is extremely expensive; the moment two users run generation simultaneously, your VPS will bottleneck hard.
This is why many hosting engineers recommend a hybrid pattern: deploy the web layer on a VPS and forward inference to GPU endpoints. It lowers attack surface, simplifies scaling, and frees your OS from multithread chaos.
Vector Databases on a VPS: Surprisingly Easy
Retrieval-augmented generation (RAG) is often cheaper to run than the LLM itself. Most vector databases – depending on the backend, such as FAISS, SQLite-based approaches, or cloud backends – consume negligible CPU.
For a full primer, see vector db explained.
What matters most is:
- Embedding generation (CPU heavy)
- Index build time (medium)
- Search complexity (very low)
- Concurrent lookups (low)
In many RAG workflows, vector search represents less than 2% of total CPU time. The bottleneck is nearly always the model inference, not the retrieval layer.
Concurrency: The Silent LLM Killer
From a systems-hardening viewpoint, concurrency is the most dangerous misconfiguration. Engineers often test their LLM endpoint with a single user and assume everything will scale. It never does.
Key concurrency traps include:
Python GIL Misunderstanding
Developers assume threads will scale. They won’t. One thread holds the interpreter lock while numeric libraries spawn their own threads beneath.
Multiple Inference Workers
Running two model workers on a 2 vCPU machine instantly chokes the host. The OS spends more time context switching than generating tokens.
Context Window Expansion
Longer context = more attention computation = slower inference = requests overlap longer = concurrency spike = feedback loop.
PHP-to-Python Bridges
Where WordPress sites call Python scripts, many use shell_exec or FastCGI wrappers. Under high load, these choke, queue, or deadlock.
Security Implications: Mixed PHP + LLM Environments
This is where my hosting hardening experience kicks in: LLM workloads introduce attack avenues rarely considered in typical WordPress setups.
Long-running Inference Workers
Any daemonised inference worker sitting next to PHP-FPM increases attack surface and reduces your ability to enforce least privilege.
Model Poisoning or Prompt Injection
If your model interacts with public inputs, you must isolate it from filesystem access unless you trust your sanitisation pipeline.
Package Supply Chain
LLM frameworks frequently pull deep dependency trees: transformers, safetensors, accelerate, BLAS libraries. Each is a potential vulnerability root.
System Resource Exhaustion (DoS by Design)
Attackers can send large contexts or long prompts to intentionally spike CPU for minutes.
This is why shared hosting is not just impractical but unsafe for LLM workloads.
How to Diagnose LLM Hosting Bottlenecks
If your hosting environment is slow, unstable, or spiking unexpectedly, use the techniques illustrated in using AI to diagnose issues or run lightweight host-side profiling:
- Check CPU throttling with
toporhtop - Monitor RAM peaks with
dmesg | grep -i oom - Enable venv-level dependency pinning
- Run inference benchmarks at different concurrency levels
- Simulate burst traffic with Locust or k6
The key is measuring under concurrency, not single user mode.
Recommended Deployment Patterns
From experience, the following patterns work best across production WordPress sites integrating AI workflows.
Pattern 1: VPS for RAG, Remote GPUs for Inference
- Host WordPress + vector DB on a VPS
- Send inference to GPU provider
- Fast, safe, scalable
Pattern 2: Edge API Gateway + VPS
- Use Cloudflare Workers for request routing
- Keep heavy workloads off server
- Minimises attack surface
Pattern 3: Fully Managed AI Stack
- Use cloud providers for embeddings, storage, inference
- WordPress is purely a frontend
- Zero server load
Buying Guide: How to Choose Hosting for LLM Apps
Focus on these factors:
CPU Generation
Newer-gen CPUs outperform older-gen by 2x or more on ML workloads.
RAM Headroom
Avoid any VPS with less than 8GB RAM for model hosting.
GPU Access
If the provider offers GPUs, ensure you get:
- CUDA access
- Stable VRAM allocation
- No multi-tenant GPU sharing
Storage IOPS
High IOPS helps with model load times and vector database operations.
Network Performance
If inference is offloaded, network latency becomes critical.
Running LLM Apps on Shared/VPS FAQ
No. RAM and CPU constraints make it impractical and insecure.
Yes, lightly. But expect slow throughput.
For any real-time user-facing inference, yes.
Only for small models or CPU-bound workflows. For production traffic, use GPUs.
Often yes – they are lightweight. For large datasets or high traffic, separate them.
Summary
Running LLM apps on shared or small VPS hosting is possible, but only in narrow circumstances. CPU-bound inference is slow, concurrency falls apart quickly, and RAM caps hit hard. Retrieval layers are surprisingly cheap to host, but full model inference is best delegated to GPUs or dedicated inference infrastructure.
For deeper infrastructure patterns, browse our hosting articles or explore the broader systems architectures inside our devops category.
If you found this content helpful,please consider sharing!: