3.4 KiB
3.4 KiB
Comprehensive Project Plan: AI Server for Boltz with H100/L40 GPUs & Qumulo Storage
1. Hardware & Infrastructure
- Server: HP DL3XX Gen11 (512 GB RAM, 25 Gbit NICs, dual GPU slots).
- GPUs:
- Option 1: 2x H100 (96 GB vRAM each, NVLink for inter-GPU communication).
- Option 2: 2x L40 (46 GB vRAM each, PCIe 4.0 for inter-GPU communication).
- Storage: Qumulo cluster (300 TB) mounted via NFS on each compute node.
- Networking:
- 25 Gbit bonding (active-backup mode) for redundancy and high throughput.
- 25 Gbit transceivers (SFP+/QSFP+) for NICs and switches.
- Power/Cooling: Ensure PSU supports dual GPU power draw (e.g., H100: ~300W each, L40: ~250W each).
2. OS & Software Stack
- OS: Ubuntu 24.04 LTS (latest stable release for H100/L40 support).
- Automation:
- Use Ansible for OS installation, GPU driver setup, and cluster management (3 nodes).
- Create playbooks for:
- Ubuntu 24.04 installation.
- NVIDIA driver + CUDA toolkit.
- 25 Gbit NIC bonding.
- NFS mount configuration for Qumulo.
- CUDA/ROCm: Install latest CUDA toolkit for NVIDIA GPUs (or ROCm for AMD).
3. Network Configuration
- Firewall: Deploy OPNsense to:
- Enforce Qumulo/NFS access controls.
- Monitor traffic between compute nodes and storage.
- Bonding: Configure 25 Gbit NICs with
bonding-mode=active-backupfor failover. - NFS: Optimize Qumulo NFS mounts with
noatime,async, andtcpfor large datasets.
4. GPU & Multi-GPU Setup
- Driver Installation:
- Install NVIDIA drivers (e.g.,
nvidia-driver-535) and verifynvidia-smi.
- Install NVIDIA drivers (e.g.,
- Multi-GPU Support:
- For H100: Enable NVLink for low-latency inter-GPU communication.
- For L40: Use PCIe 4.0 for maximum bandwidth.
- Boltz Compatibility: Ensure Boltz is configured for multi-GPU use (CUDA-aware MPI or distributed memory).
5. Storage & Performance
- RAM: 512 GB is sufficient for most workloads; monitor with
htoporfree -h. - Disk I/O: Use NVMe SSDs for
/tmpand/varto reduce latency. - Monitoring:
- Track GPU utilization (
nvidia-smi), network throughput (iftop), and storage I/O (iostat). - Deploy Prometheus + Grafana for centralized metrics.
- Track GPU utilization (
6. Security & Monitoring
- Security:
- Enable SELinux/AppArmor for access control.
- Regularly back up configurations and critical data.
- Logging: Set up ELK stack for centralized logging.
- Disaster Recovery: Define steps for GPU/NIC failure, including hot-swappable components.
7. Additional Tasks
- Documentation: Record all configurations (Ansible playbooks, firewall rules, GPU settings).
- Testing: Validate NFS performance and GPU utilization with sample datasets.
- Optimization: Tune Boltz for memory efficiency and parallel processing.
Summary
This project integrates high-performance hardware (H100/L40 GPUs, 25 Gbit networking) with Qumulo storage to run Boltz efficiently. Key steps include Ubuntu 24.04 setup, Ansible automation, GPU driver installation, NFS optimization, and security monitoring. Prioritize multi-GPU communication, RAM management, and disaster recovery to ensure reliability for large-scale data analysis.