Maximizing GPU Efficiency: NVIDIA MPS vs Dedicated GPU Allocation for LLM Inference

How GPU sharing can cut infrastructure costs by 50% with only ~7.5% performance impact

This report presents a comprehensive performance comparison between NVIDIA Multi-Process Service (MPS) [1] GPU sharing and dedicated GPU allocation for large language model inference workloads. Using vLLM serving infrastructure deployed on Google Kubernetes Engine (GKE) with `NVIDIA A100-80GB` GPUs, we evaluated the `Qwen3-4B-FP8` model across key performance metrics including throughput, latency, and resource utilization.

The Challenge: GPU Underutilization in LLM Inference

GPUs like the A100-80GB are powerful, but many LLM inference workloads fail to fully utilize their capabilities. Consider a typical deployment scenario:

Model Size: Llama3-8b (~15GB VRAM memory)
GPU Memory: 80GB available

This massive underutilization presents an optimization opportunity. However, naive approaches like running multiple processes on the same GPU often fail due to CUDA context limitations and memory management issues.

NVIDIA MPS: Beyond Simple GPU Time-Slicing

NVIDIA Multi-Process Service (MPS) is a binary-compatible client-server runtime that enables multiple CUDA applications to share a single GPU context.

Spatial vs Temporal Sharing

Traditional Time-Slicing:

Process A: [====] [    ] [====] [    ]
Process B: [    ] [====] [    ] [====]
           Time →

MPS Spatial Sharing:

SM 0-30:  [Process A kernels    ]
SM 31-60: [Process B kernels    ]
SM 61-108:[  Available/Shared   ]
          Concurrent Execution →

Key Technical Advantages

1. Concurrent Kernel Execution Multiple processes can execute CUDA kernels concurrently on different Streaming Multiprocessors (SMs), eliminating the context-switching overhead of time-slicing approaches.

2. Hardware-Level Resource Isolation MPS provides configurable SM allocation and memory partitioning:

# Configure 60% SM allocation
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=60

# Set 75GB memory limit for device 0
CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=75G

# Enable per-context SM partitioning
CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING=1

3. Quality of Service Controls Priority-based scheduling ensures critical workloads receive necessary resources while background processes use remaining capacity.

Load Testing Parameters

Our benchmark simulated realistic production traffic patterns:

benchmark_config:
  request_count: 1000
  request_rate: 250  # req/s
  max_concurrency: 250
  input_length: ~1024  # tokens (randomized)
  output_length: ~256  # tokens (randomized) 
  burstiness: 1.0     # uniform distribution

Deployment Configurations

MPS Shared Configuration:

resources:
  nvidia.com/gpu: 1
  memory: "40Gi"
env:
- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
  value: "60"
- name: VLLM_GPU_MEMORY_UTILIZATION  
  value: "0.5"  # 50% to allow concurrent processes

Dedicated GPU Configuration:

resources:
  nvidia.com/gpu: 1
  memory: "40Gi" 
env:
- name: VLLM_GPU_MEMORY_UTILIZATION
  value: "0.8"  # 80% for maximum single-process efficiency

Performance Analysis:

Throughput Performance

The results demonstrate that MPS provides excellent performance with minimal degradation:

Metric	Dedicated GPU	MPS Shared	Delta
Output Throughput	1,557.1 tokens/s	1,448.2 tokens/s	-7.5%
Request Throughput	6.17 req/s	5.73 req/s	-7.1%

Latency Analysis

Latency metrics reveal where MPS overhead manifests:

Metric	Dedicated GPU	MPS Shared	Delta
TTFT (Median)	497.5 ms	504.7 ms	+1.4%
TPOT (Median)	52.3 ms	62.5 ms	+19.5%

The latency impact primarily affects decode token generation (TPOT) rather than initial prefill (TTFT). This suggests MPS overhead is more pronounced during the iterative decode phase where multiple processes compete for SM resources.

Resource Efficiency

Memory Utilization Strategy:

MPS: 50% allocation per process → 2 processes per GPU
Dedicated: 80% allocation → 1 process per GPU
Effective Density: 2x pod density with MPS

Performance vs Cost Trade-off:

Cost Reduction: 50% (1 GPU vs 2 GPUs)
Performance Loss: 7.5% throughput

This represents exceptional cost efficiency – every 1% performance loss yields 6.6% cost reduction.

Infrastructure Setup and Technical Details

Kubernetes Cluster Configuration

Node Pool Specifications:

mps_node_pool:
  machine_type: a2-ultragpu-1g  # 1x A100-80GB
  gpu_driver: latest
  container_runtime: cos_containerd
  
dedicated_node_pool:
  machine_type: a2-ultragpu-2g  # 2x A100-80GB  
  gpu_driver: latest
  container_runtime: cos_containerd

MPS Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-mps-config
data:
  mps.conf: |
    CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=60
    CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=75G
    CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING=1

Future Directions and Advanced Optimizations

MIG

Test with Multi-Instance GPU sharing strategy.

Multi-Model Serving

Next-generation MPS deployments could support heterogeneous model serving:

# multi-model MPS configuration
mps_config = {
    "process_0": {"model": "qwen3-4b", "sm_allocation": 40},
    "process_1": {"model": "llama2-7b", "sm_allocation": 60},
    "shared_memory": "20GB"
}

Dynamic Resource Allocation

Advanced MPS implementations could adjust resource allocation and bin pack pods / workloads based on real-time demand.

Conclusion

Our evaluation demonstrates that NVIDIA MPS provides a viable GPU sharing solution with manageable performance trade-offs. The ability to achieve 50% cost reduction with only 7.5% performance impact makes it a compelling solution for many production scenarios.

Key Takeaways:

MPS is production-ready for cost-sensitive workloads with manageable performance trade-offs
Hybrid strategies work best – use MPS for smaller LLMs that easily fit in a single GPU, development environments, dedicated GPUs for critical, performance sensitive production workloads

As GPU costs continue to dominate AI infrastructure budgets, technologies like MPS will become increasingly important. The cost efficiency we observed makes a compelling business case for adoption, especially in scenarios where maximum performance isn't required.

The future of AI infrastructure isn't just about faster GPUs – it's about using them more intelligently. NVIDIA MPS provides a powerful tool in that optimization toolkit, enabling organizations to build cost-effective, scalable AI systems without sacrificing functionality.

[1] https://docs.nvidia.com/deploy/mps/index.html

About This Study: This benchmark was conducted using Google Kubernetes Engine infrastructure with NVIDIA A100-80GB GPUs and vLLM serving engine.

We accelerate climate action by empowering businesses to reduce their carbon footprint. Our focus on transparency, accountability, and impact drives progress in carbon offsetting, renewable energy, ocean conservation, and biodiversity protection. Together, we build a sustainable future.

Request Demo

Sign up to our news

Request Demo

MyPebble, Inc.

53 State St. Suite 500

Boston, MA 02109

1-888-314-1019

Maximizing GPU Efficiency: NVIDIA MPS vs Dedicated GPU Allocation for LLM Inference

The Challenge: GPU Underutilization in LLM Inference

NVIDIA MPS: Beyond Simple GPU Time-Slicing

Spatial vs Temporal Sharing

Key Technical Advantages

Load Testing Parameters

Deployment Configurations

Performance Analysis:

Throughput Performance

Latency Analysis

Resource Efficiency

Infrastructure Setup and Technical Details

Kubernetes Cluster Configuration

Future Directions and Advanced Optimizations

MIG

Multi-Model Serving

Dynamic Resource Allocation

Conclusion

[1] https://docs.nvidia.com/deploy/mps/index.html

About This Study: This benchmark was conducted using Google Kubernetes Engine infrastructure with NVIDIA A100-80GB GPUs and vLLM serving engine.

Read more articles

Request Demo

Sign up to our news

Submit