]
This commit is contained in:
77
projects/neosphere/ml-server/20250708-qwerty-summary.md
Normal file
77
projects/neosphere/ml-server/20250708-qwerty-summary.md
Normal file
@@ -0,0 +1,77 @@
|
||||
|
||||
### **Comprehensive Project Plan: AI Server for Boltz with H100/L40 GPUs & Qumulo Storage**
|
||||
|
||||
---
|
||||
|
||||
#### **1. Hardware & Infrastructure**
|
||||
- **Server**: HP DL3XX Gen11 (512 GB RAM, 25 Gbit NICs, dual GPU slots).
|
||||
- **GPUs**:
|
||||
- **Option 1**: 2x H100 (96 GB vRAM each, NVLink for inter-GPU communication).
|
||||
- **Option 2**: 2x L40 (46 GB vRAM each, PCIe 4.0 for inter-GPU communication).
|
||||
- **Storage**: Qumulo cluster (300 TB) mounted via NFS on each compute node.
|
||||
- **Networking**:
|
||||
- 25 Gbit bonding (active-backup mode) for redundancy and high throughput.
|
||||
- 25 Gbit transceivers (SFP+/QSFP+) for NICs and switches.
|
||||
- **Power/Cooling**: Ensure PSU supports dual GPU power draw (e.g., H100: ~300W each, L40: ~250W each).
|
||||
|
||||
---
|
||||
|
||||
#### **2. OS & Software Stack**
|
||||
- **OS**: Ubuntu 24.04 LTS (latest stable release for H100/L40 support).
|
||||
- **Automation**:
|
||||
- Use **Ansible** for OS installation, GPU driver setup, and cluster management (3 nodes).
|
||||
- Create playbooks for:
|
||||
- Ubuntu 24.04 installation.
|
||||
- NVIDIA driver + CUDA toolkit.
|
||||
- 25 Gbit NIC bonding.
|
||||
- NFS mount configuration for Qumulo.
|
||||
- **CUDA/ROCm**: Install latest CUDA toolkit for NVIDIA GPUs (or ROCm for AMD).
|
||||
|
||||
---
|
||||
|
||||
#### **3. Network Configuration**
|
||||
- **Firewall**: Deploy **OPNsense** to:
|
||||
- Enforce Qumulo/NFS access controls.
|
||||
- Monitor traffic between compute nodes and storage.
|
||||
- **Bonding**: Configure 25 Gbit NICs with `bonding-mode=active-backup` for failover.
|
||||
- **NFS**: Optimize Qumulo NFS mounts with `noatime`, `async`, and `tcp` for large datasets.
|
||||
|
||||
---
|
||||
|
||||
#### **4. GPU & Multi-GPU Setup**
|
||||
- **Driver Installation**:
|
||||
- Install NVIDIA drivers (e.g., `nvidia-driver-535`) and verify `nvidia-smi`.
|
||||
- **Multi-GPU Support**:
|
||||
- For H100: Enable **NVLink** for low-latency inter-GPU communication.
|
||||
- For L40: Use PCIe 4.0 for maximum bandwidth.
|
||||
- **Boltz Compatibility**: Ensure Boltz is configured for multi-GPU use (CUDA-aware MPI or distributed memory).
|
||||
|
||||
---
|
||||
|
||||
#### **5. Storage & Performance**
|
||||
- **RAM**: 512 GB is sufficient for most workloads; monitor with `htop` or `free -h`.
|
||||
- **Disk I/O**: Use NVMe SSDs for `/tmp` and `/var` to reduce latency.
|
||||
- **Monitoring**:
|
||||
- Track GPU utilization (`nvidia-smi`), network throughput (`iftop`), and storage I/O (`iostat`).
|
||||
- Deploy **Prometheus + Grafana** for centralized metrics.
|
||||
|
||||
---
|
||||
|
||||
#### **6. Security & Monitoring**
|
||||
- **Security**:
|
||||
- Enable **SELinux/AppArmor** for access control.
|
||||
- Regularly back up configurations and critical data.
|
||||
- **Logging**: Set up ELK stack for centralized logging.
|
||||
- **Disaster Recovery**: Define steps for GPU/NIC failure, including hot-swappable components.
|
||||
|
||||
---
|
||||
|
||||
#### **7. Additional Tasks**
|
||||
- **Documentation**: Record all configurations (Ansible playbooks, firewall rules, GPU settings).
|
||||
- **Testing**: Validate NFS performance and GPU utilization with sample datasets.
|
||||
- **Optimization**: Tune Boltz for memory efficiency and parallel processing.
|
||||
|
||||
---
|
||||
|
||||
### **Summary**
|
||||
This project integrates high-performance hardware (H100/L40 GPUs, 25 Gbit networking) with Qumulo storage to run Boltz efficiently. Key steps include Ubuntu 24.04 setup, Ansible automation, GPU driver installation, NFS optimization, and security monitoring. Prioritize multi-GPU communication, RAM management, and disaster recovery to ensure reliability for large-scale data analysis.
|
||||
Reference in New Issue
Block a user