Files
notes/projects/neosphere/ml-server/20250711-qwerty_multi-GPU.md
Petar Cubela 252a91dbcc ]
2025-07-15 15:36:24 +02:00

3.6 KiB
Raw Blame History

Multi-GPU Setup: Boltz vs. Ollama+OpenWebUI

To maximize GPU utilization while running Boltz (likely a large-scale AI model) and Ollama+OpenWebUI (a smaller, lightweight LLM inference tool), here's a structured approach:


1. Separate GPUs for Different Workloads

Best Practice:

  • Use separate GPUs for Boltz and Ollama+OpenWebUI.
  • Why:
    • Boltz likely requires high VRAM (e.g., 96 GB for H100) and low-latency inter-GPU communication (NVLink) for distributed tasks.
    • Ollama+OpenWebUI uses smaller models (e.g., 7B or less) and low VRAM (e.g., 1632 GB).
    • Separating them avoids resource contention (e.g., VRAM, compute bandwidth) and ensures each tool gets optimal performance.

Implementation:

  • Assign H100 to Boltz (via CUDA_VISIBLE_DEVICES=0).
  • Assign L40 to Ollama+OpenWebUI (via CUDA_VISIBLE_DEVICES=1).
  • Ensure both GPUs are recognized and functional via nvidia-smi.

2. Shared GPU for Same Workload

Use Case:

  • If both tools require the same GPU (e.g., for a single inference task or model parallelism), use NVLink for low-latency communication.
  • Why:
    • NVLink (H100) or PCIe 4.0 (L40) enables cross-GPU data transfer for distributed inference or model parallelism.
    • Requires CUDA-aware MPI or Distributed Memory Frameworks (e.g., PyTorch Distributed, Horovod).

Implementation:

  • Configure both tools to use both GPUs (e.g., CUDA_VISIBLE_DEVICES=0,1).
  • Use NVLink (for H100) or PCIe 4.0 (for L40) to minimize latency.
  • Ensure Boltz and Ollama are compatible with multi-GPU workflows (e.g., model sharding, pipeline parallelism).

3. Key Considerations

Factor Recommendation
GPU Allocation Use separate GPUs for Boltz and Ollama+OpenWebUI to avoid resource contention.
Inter-GPU Communication Use NVLink (H100) or PCIe 4.0 (L40) for shared GPU workflows.
Software Compatibility Ensure Boltz and Ollama support multi-GPU setups (e.g., CUDA-aware MPI).
VRAM Utilization Allocate H100 to Boltz (high VRAM) and L40 to Ollama (low VRAM).
Driver Configuration Install NVIDIA drivers 535+ and ensure CUDA 12.x compatibility.

4. Optimal Workflow

  • Separate GPUs:

    • Boltz: Use H100 with NVLink for distributed training/inference.
    • Ollama+OpenWebUI: Use L40 for lightweight LLM inference.
    • Advantages: Maximized VRAM, reduced latency, and no resource contention.
  • Shared GPU:

    • Boltz: Use H100 for large-scale tasks.
    • Ollama+OpenWebUI: Use H100 for lightweight tasks (e.g., model serving).
    • Advantages: Single GPU utilization, but requires careful resource management.

5. Final Advice

  • Prioritize Separate GPUs: For most use cases, Boltz and Ollama+OpenWebUI will benefit from dedicated GPUs.
  • NVLink is Optional: Only use it if you need cross-GPU communication for advanced workflows (e.g., model parallelism).
  • Monitor Performance: Use nvidia-smi and Prometheus/Grafana to track GPU utilization, VRAM, and latency.

By separating the workloads, you ensure maximum efficiency for both tools while leveraging the full potential of your hardware. 🚀