4.7 KiB
When using two H100 GPUs (with NVLink interconnect) for Boltz and other inference tools, the goal is to maximize GPU utilization, minimize latency, and ensure compatibility between models. Here's a structured approach to configure the GPUs optimally:
1. Key Considerations for H100 GPUs
- NVLink: Enables low-latency, high-bandwidth communication between GPUs (up to 100 GB/s). Ideal for distributed inference or model parallelism.
- VRAM: Each H100 has 80 GB VRAM, so you can run multiple large models or split a single model across both GPUs.
- CUDA Compatibility: Ensure CUDA 12.x and NVIDIA drivers 535+ are installed for full NVLink support.
2. Recommended Configuration Options
Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)
Use Case: Running Boltz (large model) and other inference tools (e.g., Ollama, OpenWebUI, or smaller models) independently.
Configuration:
- GPU 0: Assign to Boltz (e.g.,
CUDA_VISIBLE_DEVICES=0). - GPU 1: Assign to other inference tools (e.g.,
CUDA_VISIBLE_DEVICES=1). - Why:
- Avoids resource contention (VRAM, compute bandwidth).
- Ensures maximum efficiency for each task.
- NVLink is not needed unless you're using model parallelism.
Implementation:
- Use
CUDA_VISIBLE_DEVICES=0for Boltz andCUDA_VISIBLE_DEVICES=1for other tools. - Ensure both GPUs are recognized via
nvidia-smi.
Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)
Use Case: Running a single large model (e.g., Boltz) or distributed inference across both GPUs.
Configuration:
- GPU 0 and GPU 1: Assign to Boltz (e.g.,
CUDA_VISIBLE_DEVICES=0,1). - Why:
- Leverages NVLink for low-latency communication (critical for distributed training/inference).
- Enables model parallelism (split the model across GPUs).
- Maximizes VRAM utilization (80 GB per GPU).
Implementation:
- Use
CUDA_VISIBLE_DEVICES=0,1to allocate both GPUs to the same model. - Use CUDA-aware MPI or Distributed Memory Frameworks (e.g., PyTorch Distributed, Horovod) for communication.
Option C: Hybrid Approach (Best for Mixed Workloads)
Use Case: Running Boltz on one GPU and other tools on the second GPU, but allowing shared memory for lightweight tasks.
Configuration:
- GPU 0: Boltz (e.g.,
CUDA_VISIBLE_DEVICES=0). - GPU 1: Other tools (e.g.,
CUDA_VISIBLE_DEVICES=1). - Shared Memory: Use NVLink to enable shared memory for lightweight tasks (e.g., serving smaller models or caching).
Implementation:
- Use
CUDA_VISIBLE_DEVICES=0for Boltz andCUDA_VISIBLE_DEVICES=1for other tools. - Enable shared memory via
nvidia-smior kernel modules.
3. Best Practices for Maximum Efficiency
| Factor | Recommendation |
|---|---|
| GPU Allocation | Use separate GPUs for Boltz and other tools to avoid resource contention. |
| Inter-GPU Communication | Use NVLink for shared GPU workflows (e.g., model parallelism). |
| Software Compatibility | Ensure Boltz and other tools support multi-GPU workflows (e.g., CUDA-aware MPI). |
| VRAM Utilization | Allocate H100 to Boltz (high VRAM) and L40 to other tools (low VRAM). |
| Driver Configuration | Install NVIDIA drivers 535+ and ensure CUDA 12.x compatibility. |
4. Tools for Monitoring and Optimization
- nvidia-smi: Monitor GPU utilization, VRAM, and temperature.
- Prometheus + Grafana: Track real-time metrics for GPU usage and latency.
- CUDA Profiler (Nsight): Optimize kernel performance and memory transfers.
- Model Optimization: Use quantization (e.g., 4-bit, 8-bit) for smaller models to reduce VRAM usage.
5. Final Recommendation
- Prioritize Separate GPUs: For most use cases, Boltz and other inference tools will benefit from dedicated GPUs to avoid contention.
- Use NVLink for Shared Workloads: Only use it if you're running distributed inference or model parallelism for a single model.
- Monitor Performance: Use
nvidia-smiand Prometheus/Grafana to track GPU utilization and optimize resource allocation.
By separating the workloads, you ensure maximum efficiency for both Boltz and other tools while leveraging the full potential of your H100 GPUs. 🚀