Files
notes/projects/neosphere/ml-server/20250708-qwerty-talk.md
Petar Cubela 252a91dbcc ]
2025-07-15 15:36:24 +02:00

90 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Your task list is comprehensive, but there are additional considerations to ensure smooth operation, especially with high-performance hardware and large datasets. Here's a breakdown of what you might be missing:
---
### **1. Hardware & Networking Considerations**
- **PCIe Slot Availability**:
Ensure the server has enough PCIe slots for both GPUs (H100/L40) and the 25 Gbit NICs. Dual GPUs (e.g., H100) may require NVLink support, which depends on the motherboards PCIe version (e.g., PCIe 4.0 or higher).
- **Transceivers**:
If using 25 Gbit SFP+ or QSFP+ ports, confirm the NICs and switches support the required transceivers (e.g., SFP+ for 25 Gbit).
- **Power & Cooling**:
Verify the PSU can handle the combined power draw of dual GPUs (e.g., H100s: ~300W each, L40s: ~250W each). Ensure adequate cooling for the server chassis.
---
### **2. OS & Software Stack**
- **Ubuntu Version**:
While Ubuntu 22.04 is stable, **24.04 is recommended** for better support of newer hardware (e.g., H100 GPUs) and security updates.
- **Ansible Playbooks**:
Create reusable Ansible playbooks for:
- OS installation (e.g., Ubuntu 24.04).
- GPU driver installation (NVIDIA).
- Network bonding (e.g., `bonding-mode=active-backup`).
- NFS mount configuration.
- **CUDA & ROCm**:
Install the latest CUDA toolkit (for NVIDIA GPUs) or ROCm (for AMD GPUs) to ensure compatibility with Boltz.
---
### **3. Network Configuration**
- **Bonding Mode**:
Use `bonding-mode=active-backup` for failover redundancy. Ensure the switches support LACP (Link Aggregation Control Protocol) for bonding.
- **Firewall Rules**:
Configure OPNsense to allow traffic between:
- The server and Qumulo storage (NFS).
- The server and other compute nodes (if using Ansible for clustering).
- **Qumulo NFS Optimization**:
Use NFS mount options like `noatime`, `async`, or `tcp` for large datasets. Consider setting up a cron job to check NFS mount status.
---
### **4. GPU & Multi-GPU Support**
- **Driver Installation**:
Install the latest NVIDIA drivers (e.g., via `nvidia-driver-535` package) and verify `nvidia-smi` works.
- **Multi-GPU Configuration**:
- For **H100 GPUs**: Enable NVLink (if supported) for faster inter-GPU communication.
- For **L40 GPUs**: Use PCIe 4.0 for optimal bandwidth.
- **CUDA Multi-GPU Support**:
Ensure Boltz is configured to use multiple GPUs (e.g., via CUDA-aware MPI or distributed memory frameworks).
---
### **5. Storage & Performance**
- **RAM Sufficiency**:
512 GB RAM is sufficient for most workloads, but monitor memory usage with tools like `htop` or `free -h`. Optimize Boltz to minimize memory overhead (e.g., batch processing).
- **Disk I/O**:
Use NVMe SSDs for temporary storage (e.g., `/tmp`, `/var`) to reduce disk contention.
---
### **6. Security & Monitoring**
- **Centralized Logging**:
Set up a centralized logging system (e.g., ELK stack) to monitor server activity and GPU utilization.
- **SELinux/AppArmor**:
Enable security modules (e.g., AppArmor) to restrict access to critical resources.
- **Prometheus + Grafana**:
Monitor GPU utilization (`nvidia-smi`), network throughput (`iftop`), and storage I/O (`iostat`).
---
### **7. Additional Tasks**
- **Backup & Recovery**:
Set up regular backups for critical data (e.g., Boltz configurations, NFS mounts).
- **Disaster Recovery Plan**:
Define steps for hardware failure (e.g., GPU replacement, network redundancy).
- **Documentation**:
Document all configurations (Ansible playbooks, firewall rules, GPU settings) for future reference.
---
### **Summary of Missing Items**
| Category | Missing Tasks |
|--------------------|-------------------------------------------------------------------------------|
| **Hardware** | PCIe slot availability, power/cooling, transceiver compatibility |
| **Networking** | LACP support for bonding, OPNsense firewall rules, NFS optimization |
| **GPU Setup** | NVLink/NVMe configuration, CUDA multi-GPU support, driver compatibility |
| **Storage** | Disk I/O optimization, RAM monitoring, temporary storage setup |
| **Security** | SELinux/AppArmor, centralized logging, access controls |
| **Monitoring** | Prometheus/Grafana, regular backups, disaster recovery plan |