This commit is contained in:
Petar Cubela
2025-07-15 15:36:24 +02:00
parent f6398bf45c
commit 252a91dbcc
162 changed files with 4584 additions and 54 deletions

View File

@@ -0,0 +1,89 @@
Your task list is comprehensive, but there are additional considerations to ensure smooth operation, especially with high-performance hardware and large datasets. Here's a breakdown of what you might be missing:
---
### **1. Hardware & Networking Considerations**
- **PCIe Slot Availability**:
Ensure the server has enough PCIe slots for both GPUs (H100/L40) and the 25 Gbit NICs. Dual GPUs (e.g., H100) may require NVLink support, which depends on the motherboards PCIe version (e.g., PCIe 4.0 or higher).
- **Transceivers**:
If using 25 Gbit SFP+ or QSFP+ ports, confirm the NICs and switches support the required transceivers (e.g., SFP+ for 25 Gbit).
- **Power & Cooling**:
Verify the PSU can handle the combined power draw of dual GPUs (e.g., H100s: ~300W each, L40s: ~250W each). Ensure adequate cooling for the server chassis.
---
### **2. OS & Software Stack**
- **Ubuntu Version**:
While Ubuntu 22.04 is stable, **24.04 is recommended** for better support of newer hardware (e.g., H100 GPUs) and security updates.
- **Ansible Playbooks**:
Create reusable Ansible playbooks for:
- OS installation (e.g., Ubuntu 24.04).
- GPU driver installation (NVIDIA).
- Network bonding (e.g., `bonding-mode=active-backup`).
- NFS mount configuration.
- **CUDA & ROCm**:
Install the latest CUDA toolkit (for NVIDIA GPUs) or ROCm (for AMD GPUs) to ensure compatibility with Boltz.
---
### **3. Network Configuration**
- **Bonding Mode**:
Use `bonding-mode=active-backup` for failover redundancy. Ensure the switches support LACP (Link Aggregation Control Protocol) for bonding.
- **Firewall Rules**:
Configure OPNsense to allow traffic between:
- The server and Qumulo storage (NFS).
- The server and other compute nodes (if using Ansible for clustering).
- **Qumulo NFS Optimization**:
Use NFS mount options like `noatime`, `async`, or `tcp` for large datasets. Consider setting up a cron job to check NFS mount status.
---
### **4. GPU & Multi-GPU Support**
- **Driver Installation**:
Install the latest NVIDIA drivers (e.g., via `nvidia-driver-535` package) and verify `nvidia-smi` works.
- **Multi-GPU Configuration**:
- For **H100 GPUs**: Enable NVLink (if supported) for faster inter-GPU communication.
- For **L40 GPUs**: Use PCIe 4.0 for optimal bandwidth.
- **CUDA Multi-GPU Support**:
Ensure Boltz is configured to use multiple GPUs (e.g., via CUDA-aware MPI or distributed memory frameworks).
---
### **5. Storage & Performance**
- **RAM Sufficiency**:
512 GB RAM is sufficient for most workloads, but monitor memory usage with tools like `htop` or `free -h`. Optimize Boltz to minimize memory overhead (e.g., batch processing).
- **Disk I/O**:
Use NVMe SSDs for temporary storage (e.g., `/tmp`, `/var`) to reduce disk contention.
---
### **6. Security & Monitoring**
- **Centralized Logging**:
Set up a centralized logging system (e.g., ELK stack) to monitor server activity and GPU utilization.
- **SELinux/AppArmor**:
Enable security modules (e.g., AppArmor) to restrict access to critical resources.
- **Prometheus + Grafana**:
Monitor GPU utilization (`nvidia-smi`), network throughput (`iftop`), and storage I/O (`iostat`).
---
### **7. Additional Tasks**
- **Backup & Recovery**:
Set up regular backups for critical data (e.g., Boltz configurations, NFS mounts).
- **Disaster Recovery Plan**:
Define steps for hardware failure (e.g., GPU replacement, network redundancy).
- **Documentation**:
Document all configurations (Ansible playbooks, firewall rules, GPU settings) for future reference.
---
### **Summary of Missing Items**
| Category | Missing Tasks |
|--------------------|-------------------------------------------------------------------------------|
| **Hardware** | PCIe slot availability, power/cooling, transceiver compatibility |
| **Networking** | LACP support for bonding, OPNsense firewall rules, NFS optimization |
| **GPU Setup** | NVLink/NVMe configuration, CUDA multi-GPU support, driver compatibility |
| **Storage** | Disk I/O optimization, RAM monitoring, temporary storage setup |
| **Security** | SELinux/AppArmor, centralized logging, access controls |
| **Monitoring** | Prometheus/Grafana, regular backups, disaster recovery plan |