82 lines
4.7 KiB
Markdown
82 lines
4.7 KiB
Markdown
|
|
When using **two H100 GPUs** (with **NVLink** interconnect) for **Boltz** and **other inference tools**, the goal is to **maximize GPU utilization**, **minimize latency**, and **ensure compatibility** between models. Here's a structured approach to configure the GPUs optimally:
|
|
|
|
---
|
|
|
|
### **1. Key Considerations for H100 GPUs**
|
|
- **NVLink**: Enables **low-latency, high-bandwidth communication** between GPUs (up to 100 GB/s). Ideal for **distributed inference** or **model parallelism**.
|
|
- **VRAM**: Each H100 has **80 GB VRAM**, so you can run **multiple large models** or **split a single model** across both GPUs.
|
|
- **CUDA Compatibility**: Ensure **CUDA 12.x** and **NVIDIA drivers 535+** are installed for full NVLink support.
|
|
|
|
---
|
|
|
|
### **2. Recommended Configuration Options**
|
|
|
|
#### **Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)**
|
|
**Use Case**: Running **Boltz** (large model) and **other inference tools** (e.g., Ollama, OpenWebUI, or smaller models) **independently**.
|
|
|
|
**Configuration**:
|
|
- **GPU 0**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0`).
|
|
- **GPU 1**: Assign to **other inference tools** (e.g., `CUDA_VISIBLE_DEVICES=1`).
|
|
- **Why**:
|
|
- Avoids **resource contention** (VRAM, compute bandwidth).
|
|
- Ensures **maximum efficiency** for each task.
|
|
- NVLink is **not needed** unless you're using **model parallelism**.
|
|
|
|
**Implementation**:
|
|
- Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools.
|
|
- Ensure both GPUs are **recognized** via `nvidia-smi`.
|
|
|
|
#### **Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)**
|
|
**Use Case**: Running **a single large model** (e.g., Boltz) or **distributed inference** across both GPUs.
|
|
|
|
**Configuration**:
|
|
- **GPU 0 and GPU 1**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0,1`).
|
|
- **Why**:
|
|
- Leverages **NVLink** for **low-latency communication** (critical for distributed training/inference).
|
|
- Enables **model parallelism** (split the model across GPUs).
|
|
- Maximizes **VRAM utilization** (80 GB per GPU).
|
|
|
|
**Implementation**:
|
|
- Use `CUDA_VISIBLE_DEVICES=0,1` to allocate both GPUs to the same model.
|
|
- Use **CUDA-aware MPI** or **Distributed Memory Frameworks** (e.g., PyTorch Distributed, Horovod) for communication.
|
|
|
|
#### **Option C: Hybrid Approach (Best for Mixed Workloads)**
|
|
**Use Case**: Running **Boltz** on one GPU and **other tools** on the second GPU, but **allowing shared memory** for lightweight tasks.
|
|
|
|
**Configuration**:
|
|
- **GPU 0**: Boltz (e.g., `CUDA_VISIBLE_DEVICES=0`).
|
|
- **GPU 1**: Other tools (e.g., `CUDA_VISIBLE_DEVICES=1`).
|
|
- **Shared Memory**: Use **NVLink** to enable **shared memory** for lightweight tasks (e.g., serving smaller models or caching).
|
|
|
|
**Implementation**:
|
|
- Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools.
|
|
- Enable **shared memory** via `nvidia-smi` or kernel modules.
|
|
|
|
---
|
|
|
|
### **3. Best Practices for Maximum Efficiency**
|
|
| **Factor** | **Recommendation** |
|
|
|--------------------------|-------------------------------------------------------------------------------------|
|
|
| **GPU Allocation** | Use **separate GPUs** for Boltz and other tools to avoid resource contention. |
|
|
| **Inter-GPU Communication** | Use **NVLink** for shared GPU workflows (e.g., model parallelism). |
|
|
| **Software Compatibility** | Ensure **Boltz** and other tools support **multi-GPU workflows** (e.g., CUDA-aware MPI). |
|
|
| **VRAM Utilization** | Allocate **H100** to Boltz (high VRAM) and **L40** to other tools (low VRAM). |
|
|
| **Driver Configuration** | Install **NVIDIA drivers 535+** and ensure **CUDA 12.x** compatibility. |
|
|
|
|
---
|
|
|
|
### **4. Tools for Monitoring and Optimization**
|
|
- **nvidia-smi**: Monitor GPU utilization, VRAM, and temperature.
|
|
- **Prometheus + Grafana**: Track real-time metrics for GPU usage and latency.
|
|
- **CUDA Profiler (Nsight)**: Optimize kernel performance and memory transfers.
|
|
- **Model Optimization**: Use **quantization** (e.g., 4-bit, 8-bit) for smaller models to reduce VRAM usage.
|
|
|
|
---
|
|
|
|
### **5. Final Recommendation**
|
|
- **Prioritize Separate GPUs**: For most use cases, **Boltz** and **other inference tools** will benefit from **dedicated GPUs** to avoid contention.
|
|
- **Use NVLink for Shared Workloads**: Only use it if you're running **distributed inference** or **model parallelism** for a single model.
|
|
- **Monitor Performance**: Use `nvidia-smi` and **Prometheus/Grafana** to track GPU utilization and optimize resource allocation.
|
|
|
|
By separating the workloads, you ensure **maximum efficiency** for both Boltz and other tools while leveraging the full potential of your H100 GPUs. 🚀 |