notes/projects/neosphere/ml-server/20250711-qwerty_multi-H100-GPU.md at c83d178b7709663439b6e20d29156ab80d9f3005

Files

Petar Cubela 252a91dbcc ]

2025-07-15 15:36:24 +02:00

4.7 KiB

Raw Blame History

When using two H100 GPUs (with NVLink interconnect) for Boltz and other inference tools, the goal is to maximize GPU utilization, minimize latency, and ensure compatibility between models. Here's a structured approach to configure the GPUs optimally:

1. Key Considerations for H100 GPUs

NVLink: Enables low-latency, high-bandwidth communication between GPUs (up to 100 GB/s). Ideal for distributed inference or model parallelism.
VRAM: Each H100 has 80 GB VRAM, so you can run multiple large models or split a single model across both GPUs.
CUDA Compatibility: Ensure CUDA 12.x and NVIDIA drivers 535+ are installed for full NVLink support.

2. Recommended Configuration Options

Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)

Use Case: Running Boltz (large model) and other inference tools (e.g., Ollama, OpenWebUI, or smaller models) independently.

Configuration:

GPU 0: Assign to Boltz (e.g., CUDA_VISIBLE_DEVICES=0).
GPU 1: Assign to other inference tools (e.g., CUDA_VISIBLE_DEVICES=1).
Why:
- Avoids resource contention (VRAM, compute bandwidth).
- Ensures maximum efficiency for each task.
- NVLink is not needed unless you're using model parallelism.

Implementation:

Use CUDA_VISIBLE_DEVICES=0 for Boltz and CUDA_VISIBLE_DEVICES=1 for other tools.
Ensure both GPUs are recognized via nvidia-smi.

Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)

Use Case: Running a single large model (e.g., Boltz) or distributed inference across both GPUs.

Configuration:

GPU 0 and GPU 1: Assign to Boltz (e.g., CUDA_VISIBLE_DEVICES=0,1).
Why:
- Leverages NVLink for low-latency communication (critical for distributed training/inference).
- Enables model parallelism (split the model across GPUs).
- Maximizes VRAM utilization (80 GB per GPU).

Implementation:

Use CUDA_VISIBLE_DEVICES=0,1 to allocate both GPUs to the same model.
Use CUDA-aware MPI or Distributed Memory Frameworks (e.g., PyTorch Distributed, Horovod) for communication.

Option C: Hybrid Approach (Best for Mixed Workloads)

Use Case: Running Boltz on one GPU and other tools on the second GPU, but allowing shared memory for lightweight tasks.

Configuration:

GPU 0: Boltz (e.g., CUDA_VISIBLE_DEVICES=0).
GPU 1: Other tools (e.g., CUDA_VISIBLE_DEVICES=1).
Shared Memory: Use NVLink to enable shared memory for lightweight tasks (e.g., serving smaller models or caching).

Implementation:

Use CUDA_VISIBLE_DEVICES=0 for Boltz and CUDA_VISIBLE_DEVICES=1 for other tools.
Enable shared memory via nvidia-smi or kernel modules.

3. Best Practices for Maximum Efficiency

Factor	Recommendation
GPU Allocation	Use separate GPUs for Boltz and other tools to avoid resource contention.
Inter-GPU Communication	Use NVLink for shared GPU workflows (e.g., model parallelism).
Software Compatibility	Ensure Boltz and other tools support multi-GPU workflows (e.g., CUDA-aware MPI).
VRAM Utilization	Allocate H100 to Boltz (high VRAM) and L40 to other tools (low VRAM).
Driver Configuration	Install NVIDIA drivers 535+ and ensure CUDA 12.x compatibility.

4. Tools for Monitoring and Optimization

nvidia-smi: Monitor GPU utilization, VRAM, and temperature.
Prometheus + Grafana: Track real-time metrics for GPU usage and latency.
CUDA Profiler (Nsight): Optimize kernel performance and memory transfers.
Model Optimization: Use quantization (e.g., 4-bit, 8-bit) for smaller models to reduce VRAM usage.

5. Final Recommendation

Prioritize Separate GPUs: For most use cases, Boltz and other inference tools will benefit from dedicated GPUs to avoid contention.
Use NVLink for Shared Workloads: Only use it if you're running distributed inference or model parallelism for a single model.
Monitor Performance: Use nvidia-smi and Prometheus/Grafana to track GPU utilization and optimize resource allocation.

By separating the workloads, you ensure maximum efficiency for both Boltz and other tools while leveraging the full potential of your H100 GPUs. 🚀

4.7 KiB Raw Blame History

1. Key Considerations for H100 GPUs

2. Recommended Configuration Options

Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)

Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)

Option C: Hybrid Approach (Best for Mixed Workloads)

3. Best Practices for Maximum Efficiency

4. Tools for Monitoring and Optimization

5. Final Recommendation

4.7 KiB

Raw Blame History