]
This commit is contained in:
62
projects/neosphere/ml-server/20250707-main.md
Normal file
62
projects/neosphere/ml-server/20250707-main.md
Normal file
@@ -0,0 +1,62 @@
|
||||
## Notes
|
||||
|
||||
## TODOs
|
||||
|
||||
- [x] Bestelle Kabel und Tranciever fuer 25Gbit/s NICs (make your research)
|
||||
- [x] Laenge der Kabel und Art der Kabel. DAC Kabel.
|
||||
|
||||
## Discussion with qwerty
|
||||
|
||||
- multi-GPU setup needs extra considerations
|
||||
1. Interconnect Compatibility
|
||||
2. Driver & and Software Configuration: multi-GPU support
|
||||
3. Power & Cooling
|
||||
|
||||
## Info
|
||||
|
||||
- Bjoern Schwalb number: +49 177 7539 085
|
||||
- Idee: Zentraler LDAP Server fuer User management in High Computing Environment ODER nutze Ansible fuer User Management
|
||||
|
||||
## Requirement
|
||||
|
||||
**Linux Betriebssystem** (momentan haben wir Ubuntu 22.04) da wir den Server dann optimalerweise in unseren Slurm Workload Scheduler einbinden wollen.
|
||||
Wir benötigen eine Anbindung an unseren **Qumulo Storage** (25 GBit bei unseren jetzigen Servern)
|
||||
GPU Architekturen NVIDIA **A100**, **H100** oder**L40s** mit min 48 GB GPU Memory. (Diese sind für viele AI Anwendungen empfohlen, unter anderem auch für das Programm das wir hauptsächlich nutzen wollen: boltz-2 [](https://github.com/jwohlwend/boltz "https://github.com/jwohlwend/boltz")[https://github.com/jwohlwend/boltz](https://github.com/jwohlwend/boltz "https://github.com/jwohlwend/boltz"))
|
||||
|
||||
## Tasks estimate
|
||||
|
||||
- HW auspacken, in Rack verbauen, verkabeln, zusammenbauen (potentiell custom build?)
|
||||
- Ubuntu 22.04 LTS (24.04 LTS??) installieren und konfigurieren; wuerde ich gerne mit ansible auf setzen, sodass man die anderen beiden Server gleichzeitig geclustered verwalten kann. Ich denke, dass wuerde auch dem Bjoern gefallen, da Ansible komplett Python basiert ist und die sich damit auskennen. Oder ich klone einfach die alten wie Holger, was ich aber nicht so mag.
|
||||
- 25Gbit Netzwerkkarte Treiber installieren und konfigurieren (nur einer noetig? und transciever?)
|
||||
- Interface failover bond zweier 25Gbit Anschluesse an die beiden 25Gbit Switches
|
||||
- NVIDIA Treiber installieren: nvidia-smi, cuda, multi-gpu support
|
||||
- fstab anpassen, sodass qumulo Storage ueber nfs an neuen Server angebunden ist
|
||||
|
||||
## Angebot Hardware
|
||||
|
||||
|
||||
| POS | BEZEICHNUNG | ANZ. | EINZELPREIS | GESAMTPREIS |
|
||||
| --- | --------------------------------------------------------------------- | ---- | ----------- | ----------- |
|
||||
| 1 | HPE DL380a Gen11 4DW CTO Svr | 1 | 6.084,55 € | 6.084,55 € |
|
||||
| 2 | INT Xeon-G 6526Y CPU for HPE 2,8 Ghz - 16 Kerne - 37,5MB L3 Cache | 2 | 1.429,21 € | 2.858,42 € |
|
||||
| 3 | HPE 64GB 2Rx4 PC5-5600B-R Smart Kit | 8 | 442,39 € | 3.539,14 € |
|
||||
| 4 | HPE DL380a Gen11 8SFF x4 U.3 NVMe BC | 1 | 217,72 € | 217,72 € |
|
||||
| 5 | HPE 960G NVMe RI SFF BC U.3ST V2 MV SSD | 2 | 304,54 € | 609,07 € |
|
||||
| 6 | BCM 57414 10/25GbE 2p SFP28 Adptr | 2 | 191,62 € | 383,23 € |
|
||||
| 7 | HPE Smart Hybrid Capacitor w/ 260mm Cbl | 1 | 73,08 € | 73,08 € |
|
||||
| 8 | HPE MR416i-o Gen11 SPDM Storage Cntlr | 1 | 841,15 € | 841,15 € |
|
||||
| 9 | BCM 5719 1Gb 4p BASE-T OCP Adptr | 1 | 83,71 € | 83,71 € |
|
||||
| 10 | **NVIDIA H100 NVL 94GB PCIe Accelerator** | 2 | 31.368,55 € | 62.737,10 € |
|
||||
| 11 | HPE 1800W-2200W FS Ti Ht Plg PS Kit | 4 | 280,36 € | 1.121,42 € |
|
||||
| 12 | HPE iLO Adv 1-svr Lic 3yr Support | 1 | 325,82 € | 325,82 € |
|
||||
| 13 | HPE Cmp Cloud Mgmt Srv FIO Enablement | 1 | 0,86 € | 0,86 € |
|
||||
| 14 | HPE AL STG SVR 4120 OCP2 Upg Cbl Kit | 1 | 58,42 € | 58,42 € |
|
||||
| 15 | HPE DL380a Gen11 OROC Prim TM Cbl Kit | 1 | 98,83 € | 98,83 € |
|
||||
| 16 | HPE DL380a Gen11 GPU 16p v3 FIO Cbl | 1 | 139,72 € | 139,72 € |
|
||||
| 17 | HPE DL3XX Gen11 2U Com Bezel Kit | 1 | 27,47 € | 27,47 € |
|
||||
| 18 | HPE Apollo 4200 Gen10+ CMA | 1 | 74,22 € | 74,22 € |
|
||||
| 19 | HPE AL STG SVR 4120 Stnd Heat Sink Kit | 2 | 55,72 € | 111,43 € |
|
||||
| 20 | HPE DL3XX Gen11 Ball Bearing Rail 8 Kit | 1 | 112,28 € | 112,28 € |
|
||||
| 21 | HPE COM Std 3yr Up ProLiant SaaS | 1 | 354,47 € | 354,47 € |
|
||||
| 22 | HPE iLO Advanced Non Blade Support HPE **5Y Tech Care Essential** SVC | 1 | 26,16 € | 26,16 € |
|
||||
| 23 | HPE DL380a Gen11 Support - zugehörig zu Pos. 22 | 1 | 12.048,30 € | 12.048,30 € |
|
||||
77
projects/neosphere/ml-server/20250708-qwerty-summary.md
Normal file
77
projects/neosphere/ml-server/20250708-qwerty-summary.md
Normal file
@@ -0,0 +1,77 @@
|
||||
|
||||
### **Comprehensive Project Plan: AI Server for Boltz with H100/L40 GPUs & Qumulo Storage**
|
||||
|
||||
---
|
||||
|
||||
#### **1. Hardware & Infrastructure**
|
||||
- **Server**: HP DL3XX Gen11 (512 GB RAM, 25 Gbit NICs, dual GPU slots).
|
||||
- **GPUs**:
|
||||
- **Option 1**: 2x H100 (96 GB vRAM each, NVLink for inter-GPU communication).
|
||||
- **Option 2**: 2x L40 (46 GB vRAM each, PCIe 4.0 for inter-GPU communication).
|
||||
- **Storage**: Qumulo cluster (300 TB) mounted via NFS on each compute node.
|
||||
- **Networking**:
|
||||
- 25 Gbit bonding (active-backup mode) for redundancy and high throughput.
|
||||
- 25 Gbit transceivers (SFP+/QSFP+) for NICs and switches.
|
||||
- **Power/Cooling**: Ensure PSU supports dual GPU power draw (e.g., H100: ~300W each, L40: ~250W each).
|
||||
|
||||
---
|
||||
|
||||
#### **2. OS & Software Stack**
|
||||
- **OS**: Ubuntu 24.04 LTS (latest stable release for H100/L40 support).
|
||||
- **Automation**:
|
||||
- Use **Ansible** for OS installation, GPU driver setup, and cluster management (3 nodes).
|
||||
- Create playbooks for:
|
||||
- Ubuntu 24.04 installation.
|
||||
- NVIDIA driver + CUDA toolkit.
|
||||
- 25 Gbit NIC bonding.
|
||||
- NFS mount configuration for Qumulo.
|
||||
- **CUDA/ROCm**: Install latest CUDA toolkit for NVIDIA GPUs (or ROCm for AMD).
|
||||
|
||||
---
|
||||
|
||||
#### **3. Network Configuration**
|
||||
- **Firewall**: Deploy **OPNsense** to:
|
||||
- Enforce Qumulo/NFS access controls.
|
||||
- Monitor traffic between compute nodes and storage.
|
||||
- **Bonding**: Configure 25 Gbit NICs with `bonding-mode=active-backup` for failover.
|
||||
- **NFS**: Optimize Qumulo NFS mounts with `noatime`, `async`, and `tcp` for large datasets.
|
||||
|
||||
---
|
||||
|
||||
#### **4. GPU & Multi-GPU Setup**
|
||||
- **Driver Installation**:
|
||||
- Install NVIDIA drivers (e.g., `nvidia-driver-535`) and verify `nvidia-smi`.
|
||||
- **Multi-GPU Support**:
|
||||
- For H100: Enable **NVLink** for low-latency inter-GPU communication.
|
||||
- For L40: Use PCIe 4.0 for maximum bandwidth.
|
||||
- **Boltz Compatibility**: Ensure Boltz is configured for multi-GPU use (CUDA-aware MPI or distributed memory).
|
||||
|
||||
---
|
||||
|
||||
#### **5. Storage & Performance**
|
||||
- **RAM**: 512 GB is sufficient for most workloads; monitor with `htop` or `free -h`.
|
||||
- **Disk I/O**: Use NVMe SSDs for `/tmp` and `/var` to reduce latency.
|
||||
- **Monitoring**:
|
||||
- Track GPU utilization (`nvidia-smi`), network throughput (`iftop`), and storage I/O (`iostat`).
|
||||
- Deploy **Prometheus + Grafana** for centralized metrics.
|
||||
|
||||
---
|
||||
|
||||
#### **6. Security & Monitoring**
|
||||
- **Security**:
|
||||
- Enable **SELinux/AppArmor** for access control.
|
||||
- Regularly back up configurations and critical data.
|
||||
- **Logging**: Set up ELK stack for centralized logging.
|
||||
- **Disaster Recovery**: Define steps for GPU/NIC failure, including hot-swappable components.
|
||||
|
||||
---
|
||||
|
||||
#### **7. Additional Tasks**
|
||||
- **Documentation**: Record all configurations (Ansible playbooks, firewall rules, GPU settings).
|
||||
- **Testing**: Validate NFS performance and GPU utilization with sample datasets.
|
||||
- **Optimization**: Tune Boltz for memory efficiency and parallel processing.
|
||||
|
||||
---
|
||||
|
||||
### **Summary**
|
||||
This project integrates high-performance hardware (H100/L40 GPUs, 25 Gbit networking) with Qumulo storage to run Boltz efficiently. Key steps include Ubuntu 24.04 setup, Ansible automation, GPU driver installation, NFS optimization, and security monitoring. Prioritize multi-GPU communication, RAM management, and disaster recovery to ensure reliability for large-scale data analysis.
|
||||
89
projects/neosphere/ml-server/20250708-qwerty-talk.md
Normal file
89
projects/neosphere/ml-server/20250708-qwerty-talk.md
Normal file
@@ -0,0 +1,89 @@
|
||||
|
||||
Your task list is comprehensive, but there are additional considerations to ensure smooth operation, especially with high-performance hardware and large datasets. Here's a breakdown of what you might be missing:
|
||||
|
||||
---
|
||||
|
||||
### **1. Hardware & Networking Considerations**
|
||||
- **PCIe Slot Availability**:
|
||||
Ensure the server has enough PCIe slots for both GPUs (H100/L40) and the 25 Gbit NICs. Dual GPUs (e.g., H100) may require NVLink support, which depends on the motherboard’s PCIe version (e.g., PCIe 4.0 or higher).
|
||||
- **Transceivers**:
|
||||
If using 25 Gbit SFP+ or QSFP+ ports, confirm the NICs and switches support the required transceivers (e.g., SFP+ for 25 Gbit).
|
||||
- **Power & Cooling**:
|
||||
Verify the PSU can handle the combined power draw of dual GPUs (e.g., H100s: ~300W each, L40s: ~250W each). Ensure adequate cooling for the server chassis.
|
||||
|
||||
---
|
||||
|
||||
### **2. OS & Software Stack**
|
||||
- **Ubuntu Version**:
|
||||
While Ubuntu 22.04 is stable, **24.04 is recommended** for better support of newer hardware (e.g., H100 GPUs) and security updates.
|
||||
- **Ansible Playbooks**:
|
||||
Create reusable Ansible playbooks for:
|
||||
- OS installation (e.g., Ubuntu 24.04).
|
||||
- GPU driver installation (NVIDIA).
|
||||
- Network bonding (e.g., `bonding-mode=active-backup`).
|
||||
- NFS mount configuration.
|
||||
- **CUDA & ROCm**:
|
||||
Install the latest CUDA toolkit (for NVIDIA GPUs) or ROCm (for AMD GPUs) to ensure compatibility with Boltz.
|
||||
|
||||
---
|
||||
|
||||
### **3. Network Configuration**
|
||||
- **Bonding Mode**:
|
||||
Use `bonding-mode=active-backup` for failover redundancy. Ensure the switches support LACP (Link Aggregation Control Protocol) for bonding.
|
||||
- **Firewall Rules**:
|
||||
Configure OPNsense to allow traffic between:
|
||||
- The server and Qumulo storage (NFS).
|
||||
- The server and other compute nodes (if using Ansible for clustering).
|
||||
- **Qumulo NFS Optimization**:
|
||||
Use NFS mount options like `noatime`, `async`, or `tcp` for large datasets. Consider setting up a cron job to check NFS mount status.
|
||||
|
||||
---
|
||||
|
||||
### **4. GPU & Multi-GPU Support**
|
||||
- **Driver Installation**:
|
||||
Install the latest NVIDIA drivers (e.g., via `nvidia-driver-535` package) and verify `nvidia-smi` works.
|
||||
- **Multi-GPU Configuration**:
|
||||
- For **H100 GPUs**: Enable NVLink (if supported) for faster inter-GPU communication.
|
||||
- For **L40 GPUs**: Use PCIe 4.0 for optimal bandwidth.
|
||||
- **CUDA Multi-GPU Support**:
|
||||
Ensure Boltz is configured to use multiple GPUs (e.g., via CUDA-aware MPI or distributed memory frameworks).
|
||||
|
||||
---
|
||||
|
||||
### **5. Storage & Performance**
|
||||
- **RAM Sufficiency**:
|
||||
512 GB RAM is sufficient for most workloads, but monitor memory usage with tools like `htop` or `free -h`. Optimize Boltz to minimize memory overhead (e.g., batch processing).
|
||||
- **Disk I/O**:
|
||||
Use NVMe SSDs for temporary storage (e.g., `/tmp`, `/var`) to reduce disk contention.
|
||||
|
||||
---
|
||||
|
||||
### **6. Security & Monitoring**
|
||||
- **Centralized Logging**:
|
||||
Set up a centralized logging system (e.g., ELK stack) to monitor server activity and GPU utilization.
|
||||
- **SELinux/AppArmor**:
|
||||
Enable security modules (e.g., AppArmor) to restrict access to critical resources.
|
||||
- **Prometheus + Grafana**:
|
||||
Monitor GPU utilization (`nvidia-smi`), network throughput (`iftop`), and storage I/O (`iostat`).
|
||||
|
||||
---
|
||||
|
||||
### **7. Additional Tasks**
|
||||
- **Backup & Recovery**:
|
||||
Set up regular backups for critical data (e.g., Boltz configurations, NFS mounts).
|
||||
- **Disaster Recovery Plan**:
|
||||
Define steps for hardware failure (e.g., GPU replacement, network redundancy).
|
||||
- **Documentation**:
|
||||
Document all configurations (Ansible playbooks, firewall rules, GPU settings) for future reference.
|
||||
|
||||
---
|
||||
|
||||
### **Summary of Missing Items**
|
||||
| Category | Missing Tasks |
|
||||
|--------------------|-------------------------------------------------------------------------------|
|
||||
| **Hardware** | PCIe slot availability, power/cooling, transceiver compatibility |
|
||||
| **Networking** | LACP support for bonding, OPNsense firewall rules, NFS optimization |
|
||||
| **GPU Setup** | NVLink/NVMe configuration, CUDA multi-GPU support, driver compatibility |
|
||||
| **Storage** | Disk I/O optimization, RAM monitoring, temporary storage setup |
|
||||
| **Security** | SELinux/AppArmor, centralized logging, access controls |
|
||||
| **Monitoring** | Prometheus/Grafana, regular backups, disaster recovery plan |
|
||||
65
projects/neosphere/ml-server/20250711-qwerty_multi-GPU.md
Normal file
65
projects/neosphere/ml-server/20250711-qwerty_multi-GPU.md
Normal file
@@ -0,0 +1,65 @@
|
||||
|
||||
### **Multi-GPU Setup: Boltz vs. Ollama+OpenWebUI**
|
||||
To maximize GPU utilization while running **Boltz** (likely a large-scale AI model) and **Ollama+OpenWebUI** (a smaller, lightweight LLM inference tool), here's a structured approach:
|
||||
|
||||
---
|
||||
|
||||
### **1. Separate GPUs for Different Workloads**
|
||||
**Best Practice**:
|
||||
- **Use separate GPUs** for **Boltz** and **Ollama+OpenWebUI**.
|
||||
- **Why**:
|
||||
- **Boltz** likely requires **high VRAM** (e.g., 96 GB for H100) and **low-latency inter-GPU communication** (NVLink) for distributed tasks.
|
||||
- **Ollama+OpenWebUI** uses **smaller models** (e.g., 7B or less) and **low VRAM** (e.g., 16–32 GB).
|
||||
- Separating them avoids **resource contention** (e.g., VRAM, compute bandwidth) and ensures each tool gets optimal performance.
|
||||
|
||||
**Implementation**:
|
||||
- Assign **H100** to **Boltz** (via `CUDA_VISIBLE_DEVICES=0`).
|
||||
- Assign **L40** to **Ollama+OpenWebUI** (via `CUDA_VISIBLE_DEVICES=1`).
|
||||
- Ensure both GPUs are **recognized and functional** via `nvidia-smi`.
|
||||
|
||||
---
|
||||
|
||||
### **2. Shared GPU for Same Workload**
|
||||
**Use Case**:
|
||||
- If both tools **require the same GPU** (e.g., for a single inference task or model parallelism), use **NVLink** for **low-latency communication**.
|
||||
- **Why**:
|
||||
- NVLink (H100) or PCIe 4.0 (L40) enables **cross-GPU data transfer** for distributed inference or model parallelism.
|
||||
- Requires **CUDA-aware MPI** or **Distributed Memory Frameworks** (e.g., PyTorch Distributed, Horovod).
|
||||
|
||||
**Implementation**:
|
||||
- Configure both tools to use **both GPUs** (e.g., `CUDA_VISIBLE_DEVICES=0,1`).
|
||||
- Use **NVLink** (for H100) or **PCIe 4.0** (for L40) to minimize latency.
|
||||
- Ensure **Boltz** and **Ollama** are compatible with **multi-GPU workflows** (e.g., model sharding, pipeline parallelism).
|
||||
|
||||
---
|
||||
|
||||
### **3. Key Considerations**
|
||||
| **Factor** | **Recommendation** |
|
||||
|--------------------------|-------------------------------------------------------------------------------------|
|
||||
| **GPU Allocation** | Use **separate GPUs** for Boltz and Ollama+OpenWebUI to avoid resource contention. |
|
||||
| **Inter-GPU Communication** | Use **NVLink** (H100) or **PCIe 4.0** (L40) for shared GPU workflows. |
|
||||
| **Software Compatibility** | Ensure **Boltz** and **Ollama** support multi-GPU setups (e.g., CUDA-aware MPI). |
|
||||
| **VRAM Utilization** | Allocate **H100** to Boltz (high VRAM) and **L40** to Ollama (low VRAM). |
|
||||
| **Driver Configuration** | Install **NVIDIA drivers 535+** and ensure **CUDA 12.x** compatibility. |
|
||||
|
||||
---
|
||||
|
||||
### **4. Optimal Workflow**
|
||||
- **Separate GPUs**:
|
||||
- **Boltz**: Use H100 with **NVLink** for distributed training/inference.
|
||||
- **Ollama+OpenWebUI**: Use L40 for lightweight LLM inference.
|
||||
- **Advantages**: Maximized VRAM, reduced latency, and no resource contention.
|
||||
|
||||
- **Shared GPU**:
|
||||
- **Boltz**: Use H100 for large-scale tasks.
|
||||
- **Ollama+OpenWebUI**: Use H100 for lightweight tasks (e.g., model serving).
|
||||
- **Advantages**: Single GPU utilization, but requires careful resource management.
|
||||
|
||||
---
|
||||
|
||||
### **5. Final Advice**
|
||||
- **Prioritize Separate GPUs**: For most use cases, **Boltz** and **Ollama+OpenWebUI** will benefit from **dedicated GPUs**.
|
||||
- **NVLink is Optional**: Only use it if you need **cross-GPU communication** for advanced workflows (e.g., model parallelism).
|
||||
- **Monitor Performance**: Use `nvidia-smi` and **Prometheus/Grafana** to track GPU utilization, VRAM, and latency.
|
||||
|
||||
By separating the workloads, you ensure **maximum efficiency** for both tools while leveraging the full potential of your hardware. 🚀
|
||||
@@ -0,0 +1,82 @@
|
||||
|
||||
When using **two H100 GPUs** (with **NVLink** interconnect) for **Boltz** and **other inference tools**, the goal is to **maximize GPU utilization**, **minimize latency**, and **ensure compatibility** between models. Here's a structured approach to configure the GPUs optimally:
|
||||
|
||||
---
|
||||
|
||||
### **1. Key Considerations for H100 GPUs**
|
||||
- **NVLink**: Enables **low-latency, high-bandwidth communication** between GPUs (up to 100 GB/s). Ideal for **distributed inference** or **model parallelism**.
|
||||
- **VRAM**: Each H100 has **80 GB VRAM**, so you can run **multiple large models** or **split a single model** across both GPUs.
|
||||
- **CUDA Compatibility**: Ensure **CUDA 12.x** and **NVIDIA drivers 535+** are installed for full NVLink support.
|
||||
|
||||
---
|
||||
|
||||
### **2. Recommended Configuration Options**
|
||||
|
||||
#### **Option A: Separate GPUs for Different Tasks (Best for Parallel Workloads)**
|
||||
**Use Case**: Running **Boltz** (large model) and **other inference tools** (e.g., Ollama, OpenWebUI, or smaller models) **independently**.
|
||||
|
||||
**Configuration**:
|
||||
- **GPU 0**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0`).
|
||||
- **GPU 1**: Assign to **other inference tools** (e.g., `CUDA_VISIBLE_DEVICES=1`).
|
||||
- **Why**:
|
||||
- Avoids **resource contention** (VRAM, compute bandwidth).
|
||||
- Ensures **maximum efficiency** for each task.
|
||||
- NVLink is **not needed** unless you're using **model parallelism**.
|
||||
|
||||
**Implementation**:
|
||||
- Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools.
|
||||
- Ensure both GPUs are **recognized** via `nvidia-smi`.
|
||||
|
||||
#### **Option B: Shared GPUs for a Single Model (Best for Large-Scale Inference)**
|
||||
**Use Case**: Running **a single large model** (e.g., Boltz) or **distributed inference** across both GPUs.
|
||||
|
||||
**Configuration**:
|
||||
- **GPU 0 and GPU 1**: Assign to **Boltz** (e.g., `CUDA_VISIBLE_DEVICES=0,1`).
|
||||
- **Why**:
|
||||
- Leverages **NVLink** for **low-latency communication** (critical for distributed training/inference).
|
||||
- Enables **model parallelism** (split the model across GPUs).
|
||||
- Maximizes **VRAM utilization** (80 GB per GPU).
|
||||
|
||||
**Implementation**:
|
||||
- Use `CUDA_VISIBLE_DEVICES=0,1` to allocate both GPUs to the same model.
|
||||
- Use **CUDA-aware MPI** or **Distributed Memory Frameworks** (e.g., PyTorch Distributed, Horovod) for communication.
|
||||
|
||||
#### **Option C: Hybrid Approach (Best for Mixed Workloads)**
|
||||
**Use Case**: Running **Boltz** on one GPU and **other tools** on the second GPU, but **allowing shared memory** for lightweight tasks.
|
||||
|
||||
**Configuration**:
|
||||
- **GPU 0**: Boltz (e.g., `CUDA_VISIBLE_DEVICES=0`).
|
||||
- **GPU 1**: Other tools (e.g., `CUDA_VISIBLE_DEVICES=1`).
|
||||
- **Shared Memory**: Use **NVLink** to enable **shared memory** for lightweight tasks (e.g., serving smaller models or caching).
|
||||
|
||||
**Implementation**:
|
||||
- Use `CUDA_VISIBLE_DEVICES=0` for Boltz and `CUDA_VISIBLE_DEVICES=1` for other tools.
|
||||
- Enable **shared memory** via `nvidia-smi` or kernel modules.
|
||||
|
||||
---
|
||||
|
||||
### **3. Best Practices for Maximum Efficiency**
|
||||
| **Factor** | **Recommendation** |
|
||||
|--------------------------|-------------------------------------------------------------------------------------|
|
||||
| **GPU Allocation** | Use **separate GPUs** for Boltz and other tools to avoid resource contention. |
|
||||
| **Inter-GPU Communication** | Use **NVLink** for shared GPU workflows (e.g., model parallelism). |
|
||||
| **Software Compatibility** | Ensure **Boltz** and other tools support **multi-GPU workflows** (e.g., CUDA-aware MPI). |
|
||||
| **VRAM Utilization** | Allocate **H100** to Boltz (high VRAM) and **L40** to other tools (low VRAM). |
|
||||
| **Driver Configuration** | Install **NVIDIA drivers 535+** and ensure **CUDA 12.x** compatibility. |
|
||||
|
||||
---
|
||||
|
||||
### **4. Tools for Monitoring and Optimization**
|
||||
- **nvidia-smi**: Monitor GPU utilization, VRAM, and temperature.
|
||||
- **Prometheus + Grafana**: Track real-time metrics for GPU usage and latency.
|
||||
- **CUDA Profiler (Nsight)**: Optimize kernel performance and memory transfers.
|
||||
- **Model Optimization**: Use **quantization** (e.g., 4-bit, 8-bit) for smaller models to reduce VRAM usage.
|
||||
|
||||
---
|
||||
|
||||
### **5. Final Recommendation**
|
||||
- **Prioritize Separate GPUs**: For most use cases, **Boltz** and **other inference tools** will benefit from **dedicated GPUs** to avoid contention.
|
||||
- **Use NVLink for Shared Workloads**: Only use it if you're running **distributed inference** or **model parallelism** for a single model.
|
||||
- **Monitor Performance**: Use `nvidia-smi` and **Prometheus/Grafana** to track GPU utilization and optimize resource allocation.
|
||||
|
||||
By separating the workloads, you ensure **maximum efficiency** for both Boltz and other tools while leveraging the full potential of your H100 GPUs. 🚀
|
||||
Reference in New Issue
Block a user