notes/projects/neosphere/ml-server/20250711-qwerty_multi-GPU.md at 584265c22c6f472c7656969d85e778e0a4ab5742

Files

Petar Cubela 252a91dbcc ]

2025-07-15 15:36:24 +02:00

3.6 KiB

Raw Blame History

Multi-GPU Setup: Boltz vs. Ollama+OpenWebUI

To maximize GPU utilization while running Boltz (likely a large-scale AI model) and Ollama+OpenWebUI (a smaller, lightweight LLM inference tool), here's a structured approach:

1. Separate GPUs for Different Workloads

Best Practice:

Use separate GPUs for Boltz and Ollama+OpenWebUI.
Why:
- Boltz likely requires high VRAM (e.g., 96 GB for H100) and low-latency inter-GPU communication (NVLink) for distributed tasks.
- Ollama+OpenWebUI uses smaller models (e.g., 7B or less) and low VRAM (e.g., 16–32 GB).
- Separating them avoids resource contention (e.g., VRAM, compute bandwidth) and ensures each tool gets optimal performance.

Implementation:

Assign H100 to Boltz (via CUDA_VISIBLE_DEVICES=0).
Assign L40 to Ollama+OpenWebUI (via CUDA_VISIBLE_DEVICES=1).
Ensure both GPUs are recognized and functional via nvidia-smi.

2. Shared GPU for Same Workload

Use Case:

If both tools require the same GPU (e.g., for a single inference task or model parallelism), use NVLink for low-latency communication.
Why:
- NVLink (H100) or PCIe 4.0 (L40) enables cross-GPU data transfer for distributed inference or model parallelism.
- Requires CUDA-aware MPI or Distributed Memory Frameworks (e.g., PyTorch Distributed, Horovod).

Implementation:

Configure both tools to use both GPUs (e.g., CUDA_VISIBLE_DEVICES=0,1).
Use NVLink (for H100) or PCIe 4.0 (for L40) to minimize latency.
Ensure Boltz and Ollama are compatible with multi-GPU workflows (e.g., model sharding, pipeline parallelism).

3. Key Considerations

Factor	Recommendation
GPU Allocation	Use separate GPUs for Boltz and Ollama+OpenWebUI to avoid resource contention.
Inter-GPU Communication	Use NVLink (H100) or PCIe 4.0 (L40) for shared GPU workflows.
Software Compatibility	Ensure Boltz and Ollama support multi-GPU setups (e.g., CUDA-aware MPI).
VRAM Utilization	Allocate H100 to Boltz (high VRAM) and L40 to Ollama (low VRAM).
Driver Configuration	Install NVIDIA drivers 535+ and ensure CUDA 12.x compatibility.

4. Optimal Workflow

Separate GPUs:
- Boltz: Use H100 with NVLink for distributed training/inference.
- Ollama+OpenWebUI: Use L40 for lightweight LLM inference.
- Advantages: Maximized VRAM, reduced latency, and no resource contention.
Shared GPU:
- Boltz: Use H100 for large-scale tasks.
- Ollama+OpenWebUI: Use H100 for lightweight tasks (e.g., model serving).
- Advantages: Single GPU utilization, but requires careful resource management.

5. Final Advice

Prioritize Separate GPUs: For most use cases, Boltz and Ollama+OpenWebUI will benefit from dedicated GPUs.
NVLink is Optional: Only use it if you need cross-GPU communication for advanced workflows (e.g., model parallelism).
Monitor Performance: Use nvidia-smi and Prometheus/Grafana to track GPU utilization, VRAM, and latency.

By separating the workloads, you ensure maximum efficiency for both tools while leveraging the full potential of your hardware. 🚀

3.6 KiB Raw Blame History Unescape Escape