Running LLM Apps on Shared/VPS: CPU vs RAM vs VRAM Reality

Running LLM apps on shared/VPS: CPU vs RAM vs VRAM reality

Every hosting engineer eventually faces the same question: “Can I run my LLM application on shared hosting or a small VPS?” It sounds feasible – after all, Python scripts run fine, vector operations run fine, and embeddings don’t look heavy at first glance. But once you deploy a 7B or 13B model, add retrieval, quantise weights, and scale concurrent users, reality hits: LLM workloads behave nothing like traditional web apps.

This guide breaks down the messy, practical truth from a security-oriented, infrastructure-level perspective. After a decade hardening WordPress hosting environments, the same patterns emerge repeatedly – resource contention, memory starvation, mutex locks, thermal throttling, and noisy neighbours. LLM workloads amplify all of them.

We will cover:

  • The real bottlenecks: CPU vs RAM vs VRAM
  • How quantisation and model size impact hosting limits
  • The concurrency traps nobody warns you about
  • When shared hosting is acceptable – and when it breaks
  • How to use lightweight retrieval such as vector db explained to reduce load
  • Security and isolation concerns when mixing LLM code with PHP apps
  • A quick benchmarking checklist for your setup
  • Links to deeper hosting and DevOps resources such as using AI to diagnose issues, AI direct on your code editor of choice, hosting articles, and our devops category

Why LLM Hosting Is Fundamentally Differen

Most WordPress or PHP applications behave predictably: CPU spikes appear under load, RAM usage plateaus, and caching smooths most of the workflow. LLM workloads rip all that up. They generate bursty CPU demand, unpredictable RAM spikes, deeply non-linear latency under concurrency, and GPU requirements that no shared host can meet.

The issue is simple: language models don’t scale linearly. Double the tokens, double the compute, double the context – but then attention layers blow that curve out of proportion. Models with 4K context windows behave wildly differently from those with 16K or 128K windows, even if the parameter count stays fixed.

In a shared hosting environment, these nonlinearities collide with noisy neighbours and limited OS-level control. That’s why LLM deployment must be treated more like data engineering than web hosting.


CPU vs RAM vs VRAM: The Real Priorities

Hosting conversations often fixate on CPU cores, but in LLM workloads, CPU is just one piece. Below is a simplified breakdown of which component matters most for which operation.

Operation TypeCPU DemandRAM DemandVRAM/GPU Need
Token generation (CPU inference)Very HighModerateNone
Embedding generationHighModerateOptional (GPU speeds it massively)
Vector search (FAISS/SQLite/Pinecone)LowLowNone
Full LLM inference (7B+)Extremely HighHighEssential for real-time use
Serving multiple usersBurst-heavyHigh (multi-process)GPU strongly advised

Notice something? Only one part of the stack is GPU-friendly: the actual model inference itself. Embeddings, RAG lookups, vector search, JSON parsing, and safety checks generally run fine on CPU.

This is why many teams deploy only the embedding + retrieval layer on a VPS while sending inference to an external GPU provider.


Why Shared Hosting Struggles With LLM Workloads

A shared environment has inherent limits tightly coupled to security and resource fairness. LLM workloads violate most of those assumptions.

Process Isolation

Shared hosting caps the number of concurrent processes and threads. A single LLM inference may use dozens of threads (OpenBLAS, MKL, or llama.cpp parallelism). Hosts usually throttle this automatically.

RAM Caps and OOM Killer

There is no graceful failure. When the model footprint plus temporary buffers exceed the RAM limit, your process is killed instantly. Quantised 4-bit models help, but not enough.

No Access to GPU

Even if the provider technically offers Nvidia cards, shared hosting users almost never receive CUDA access. Most providers forbid it outright for security reasons.

Noisy Neighbour Contention

Other sites spike CPU → your model slows mid-generation → users see broken streams or timeouts.

Security Risks

Running Python inference next to PHP workloads is a risk multiplier. Shell access, Python modules, long-running daemons – none of this belongs on shared hosting from a security-hardening perspective.


When Shared Hosting Works (Rarely)

There are three cases where shared hosting is acceptable:

Case 1: Embeddings Only

If your application only generates embeddings using lightweight libraries or remote services, shared hosting can serve as the frontend while offloading the heavy lifting.

Case 2: Proxying to Remote GPUs

Your hosting acts only as a “router” between users and your GPU endpoint (e.g., OpenAI, Replicate, Groq, custom A100 node).

Case 3: Micro-models (1B-3B models)

Extremely small models like Phi-2 or TinyLlama can run under tight constraints. But expect terrible latency under load.

Anything beyond this requires a VPS or dedicated instance.


The VPS Reality Check: Limits Still Apply

A VPS gives you more freedom, but the CPU/RAM/IOPS ceilings still drive the entire experience. Below is a rough reality table for Llama models on a typical 2 vCPU / 4 GB RAM VPS.

Model SizeQuantisationFeasible?Expected LatencyConcurrency
3BQ4_K_MYes1.5 – 3s/token1 user
7BQ4_K_MBarely4 – 7s/token1 user
13BQ4_K_MNoN/AN/A
7B GPU (remote)AnyYes0.05s/token10+ users

On CPU-only environments, even 7B models become impractical under concurrency. CPU inference is extremely expensive; the moment two users run generation simultaneously, your VPS will bottleneck hard.

This is why many hosting engineers recommend a hybrid pattern: deploy the web layer on a VPS and forward inference to GPU endpoints. It lowers attack surface, simplifies scaling, and frees your OS from multithread chaos.


Vector Databases on a VPS: Surprisingly Easy

Retrieval-augmented generation (RAG) is often cheaper to run than the LLM itself. Most vector databases – depending on the backend, such as FAISS, SQLite-based approaches, or cloud backends – consume negligible CPU.

For a full primer, see vector db explained.

What matters most is:

  • Embedding generation (CPU heavy)
  • Index build time (medium)
  • Search complexity (very low)
  • Concurrent lookups (low)

In many RAG workflows, vector search represents less than 2% of total CPU time. The bottleneck is nearly always the model inference, not the retrieval layer.


Concurrency: The Silent LLM Killer

From a systems-hardening viewpoint, concurrency is the most dangerous misconfiguration. Engineers often test their LLM endpoint with a single user and assume everything will scale. It never does.

Key concurrency traps include:

Python GIL Misunderstanding

Developers assume threads will scale. They won’t. One thread holds the interpreter lock while numeric libraries spawn their own threads beneath.

Multiple Inference Workers

Running two model workers on a 2 vCPU machine instantly chokes the host. The OS spends more time context switching than generating tokens.

Context Window Expansion

Longer context = more attention computation = slower inference = requests overlap longer = concurrency spike = feedback loop.

PHP-to-Python Bridges

Where WordPress sites call Python scripts, many use shell_exec or FastCGI wrappers. Under high load, these choke, queue, or deadlock.


Security Implications: Mixed PHP + LLM Environments

This is where my hosting hardening experience kicks in: LLM workloads introduce attack avenues rarely considered in typical WordPress setups.

Long-running Inference Workers

Any daemonised inference worker sitting next to PHP-FPM increases attack surface and reduces your ability to enforce least privilege.

Model Poisoning or Prompt Injection

If your model interacts with public inputs, you must isolate it from filesystem access unless you trust your sanitisation pipeline.

Package Supply Chain

LLM frameworks frequently pull deep dependency trees: transformers, safetensors, accelerate, BLAS libraries. Each is a potential vulnerability root.

System Resource Exhaustion (DoS by Design)

Attackers can send large contexts or long prompts to intentionally spike CPU for minutes.

This is why shared hosting is not just impractical but unsafe for LLM workloads.


How to Diagnose LLM Hosting Bottlenecks

If your hosting environment is slow, unstable, or spiking unexpectedly, use the techniques illustrated in using AI to diagnose issues or run lightweight host-side profiling:

  • Check CPU throttling with top or htop
  • Monitor RAM peaks with dmesg | grep -i oom
  • Enable venv-level dependency pinning
  • Run inference benchmarks at different concurrency levels
  • Simulate burst traffic with Locust or k6

The key is measuring under concurrency, not single user mode.


From experience, the following patterns work best across production WordPress sites integrating AI workflows.

Pattern 1: VPS for RAG, Remote GPUs for Inference

  • Host WordPress + vector DB on a VPS
  • Send inference to GPU provider
  • Fast, safe, scalable

Pattern 2: Edge API Gateway + VPS

  • Use Cloudflare Workers for request routing
  • Keep heavy workloads off server
  • Minimises attack surface

Pattern 3: Fully Managed AI Stack

  • Use cloud providers for embeddings, storage, inference
  • WordPress is purely a frontend
  • Zero server load

Buying Guide: How to Choose Hosting for LLM Apps

Focus on these factors:

CPU Generation

Newer-gen CPUs outperform older-gen by 2x or more on ML workloads.

RAM Headroom

Avoid any VPS with less than 8GB RAM for model hosting.

GPU Access

If the provider offers GPUs, ensure you get:

  • CUDA access
  • Stable VRAM allocation
  • No multi-tenant GPU sharing

Storage IOPS

High IOPS helps with model load times and vector database operations.

Network Performance

If inference is offloaded, network latency becomes critical.


Running LLM Apps on Shared/VPS FAQ

Can I run Llama 7B on shared hosting?

No. RAM and CPU constraints make it impractical and insecure.

Can I run embeddings on shared hosting?

Yes, lightly. But expect slow throughput.

Do I need a GPU?

For any real-time user-facing inference, yes.

Is VPS enough for LLMs?

Only for small models or CPU-bound workflows. For production traffic, use GPUs.

Should I keep vector DB on the same machine as WordPress?

Often yes – they are lightweight. For large datasets or high traffic, separate them.


Summary

Running LLM apps on shared or small VPS hosting is possible, but only in narrow circumstances. CPU-bound inference is slow, concurrency falls apart quickly, and RAM caps hit hard. Retrieval layers are surprisingly cheap to host, but full model inference is best delegated to GPUs or dedicated inference infrastructure.

For deeper infrastructure patterns, browse our hosting articles or explore the broader systems architectures inside our devops category.

If you found this content helpful,
please consider sharing!:
Paul Wright

Writer: Paul Wright

Content Creator with over 20 years experience Programming, Hosting, WordPress, AI & DevOps

Paul Wright is a develop with extensive experience in programming, hosting infrastructure, WordPress performance, cloud architecture, DevOps workflows, and artificial intelligence tools. At Tech IT EZ, Paul leads the site’s technical content, covering everything from performance benchmarking and uptime analysis to developer workflows, optimization strategies, and AI-enhanced productivity. With more than two decades working across software, infrastructure, and digital systems, Paul brings a grounded, engineering-driven approach to his writing. His articles distill complex topics into practical, actionable insights—helping readers understand and improve the systems they rely on. Paul’s technical reviews are independently verified by Tech IT EZ’s Senior Technical Expert Reviewer, ensuring accuracy and trust across all engineering-focused content.

Contact

Leave a Comment

Your email address will not be published.