When using **two H100 GPUs** (with **NVLink** interconnect) for **Boltz** and **other inference tools**, the goal is to **maximize GPU utilization**, **minimize latency**, and **ensure compatibility** between models. Here's a structured approach to configure the GPUs optimally: --- ### **1. Key Considerations for H100 GPUs** - **NVLink**: Enables **low-latency, high-bandwidth communication** between GPUs (up to 100 GB/s). Ideal for **distributed inference** or **model parallelism**. - **VRAM**: Each H100 has **80 GB VRAM**, so you can run **multiple large models** or **split a single model** across both GPUs. - **CUDA Compatibility**: Ensure **CUDA 12.x** and **NVIDIA drivers 535+** are installed for full NVLink support. --- ### **2. Recommended Configuration Options** #### **Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)** **Use Case**: Running **Boltz** (large model) and **other inference tools** (e.g., Ollama, OpenWebUI, or smaller models) **independently**. **Configuration**: - **GPU 0**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0`). - **GPU 1**: Assign to **other inference tools** (e.g., `CUDA_VISIBLE_DEVICES=1`). - **Why**: - Avoids **resource contention** (VRAM, compute bandwidth). - Ensures **maximum efficiency** for each task. - NVLink is **not needed** unless you're using **model parallelism**. **Implementation**: - Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools. - Ensure both GPUs are **recognized** via `nvidia-smi`. #### **Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)** **Use Case**: Running **a single large model** (e.g., Boltz) or **distributed inference** across both GPUs. **Configuration**: - **GPU 0 and GPU 1**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0,1`). - **Why**: - Leverages **NVLink** for **low-latency communication** (critical for distributed training/inference). - Enables **model parallelism** (split the model across GPUs). - Maximizes **VRAM utilization** (80 GB per GPU). **Implementation**: - Use `CUDA_VISIBLE_DEVICES=0,1` to allocate both GPUs to the same model. - Use **CUDA-aware MPI** or **Distributed Memory Frameworks** (e.g., PyTorch Distributed, Horovod) for communication. #### **Option C: Hybrid Approach (Best for Mixed Workloads)** **Use Case**: Running **Boltz** on one GPU and **other tools** on the second GPU, but **allowing shared memory** for lightweight tasks. **Configuration**: - **GPU 0**: Boltz (e.g., `CUDA_VISIBLE_DEVICES=0`). - **GPU 1**: Other tools (e.g., `CUDA_VISIBLE_DEVICES=1`). - **Shared Memory**: Use **NVLink** to enable **shared memory** for lightweight tasks (e.g., serving smaller models or caching). **Implementation**: - Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools. - Enable **shared memory** via `nvidia-smi` or kernel modules. --- ### **3. Best Practices for Maximum Efficiency** | **Factor** | **Recommendation** | |--------------------------|-------------------------------------------------------------------------------------| | **GPU Allocation** | Use **separate GPUs** for Boltz and other tools to avoid resource contention. | | **Inter-GPU Communication** | Use **NVLink** for shared GPU workflows (e.g., model parallelism). | | **Software Compatibility** | Ensure **Boltz** and other tools support **multi-GPU workflows** (e.g., CUDA-aware MPI). | | **VRAM Utilization** | Allocate **H100** to Boltz (high VRAM) and **L40** to other tools (low VRAM). | | **Driver Configuration** | Install **NVIDIA drivers 535+** and ensure **CUDA 12.x** compatibility. | --- ### **4. Tools for Monitoring and Optimization** - **nvidia-smi**: Monitor GPU utilization, VRAM, and temperature. - **Prometheus + Grafana**: Track real-time metrics for GPU usage and latency. - **CUDA Profiler (Nsight)**: Optimize kernel performance and memory transfers. - **Model Optimization**: Use **quantization** (e.g., 4-bit, 8-bit) for smaller models to reduce VRAM usage. --- ### **5. Final Recommendation** - **Prioritize Separate GPUs**: For most use cases, **Boltz** and **other inference tools** will benefit from **dedicated GPUs** to avoid contention. - **Use NVLink for Shared Workloads**: Only use it if you're running **distributed inference** or **model parallelism** for a single model. - **Monitor Performance**: Use `nvidia-smi` and **Prometheus/Grafana** to track GPU utilization and optimize resource allocation. By separating the workloads, you ensure **maximum efficiency** for both Boltz and other tools while leveraging the full potential of your H100 GPUs. 🚀