notes/projects/neosphere/ml-server/20250708-qwerty-summary.md at c83d178b7709663439b6e20d29156ab80d9f3005

CubelaPetar/notes

Fork 0

Files

Petar Cubela 252a91dbcc ]

2025-07-15 15:36:24 +02:00

3.4 KiB

Raw Blame History

Comprehensive Project Plan: AI Server for Boltz with H100/L40 GPUs & Qumulo Storage

1. Hardware & Infrastructure

Server: HP DL3XX Gen11 (512 GB RAM, 25 Gbit NICs, dual GPU slots).
GPUs:
- Option 1: 2x H100 (96 GB vRAM each, NVLink for inter-GPU communication).
- Option 2: 2x L40 (46 GB vRAM each, PCIe 4.0 for inter-GPU communication).
Storage: Qumulo cluster (300 TB) mounted via NFS on each compute node.
Networking:
- 25 Gbit bonding (active-backup mode) for redundancy and high throughput.
- 25 Gbit transceivers (SFP+/QSFP+) for NICs and switches.
Power/Cooling: Ensure PSU supports dual GPU power draw (e.g., H100: ~300W each, L40: ~250W each).

2. OS & Software Stack

OS: Ubuntu 24.04 LTS (latest stable release for H100/L40 support).
Automation:
- Use Ansible for OS installation, GPU driver setup, and cluster management (3 nodes).
- Create playbooks for:
  - Ubuntu 24.04 installation.
  - NVIDIA driver + CUDA toolkit.
  - 25 Gbit NIC bonding.
  - NFS mount configuration for Qumulo.
CUDA/ROCm: Install latest CUDA toolkit for NVIDIA GPUs (or ROCm for AMD).

3. Network Configuration

Firewall: Deploy OPNsense to:
- Enforce Qumulo/NFS access controls.
- Monitor traffic between compute nodes and storage.
Bonding: Configure 25 Gbit NICs with bonding-mode=active-backup for failover.
NFS: Optimize Qumulo NFS mounts with noatime, async, and tcp for large datasets.

4. GPU & Multi-GPU Setup

Driver Installation:
- Install NVIDIA drivers (e.g., nvidia-driver-535) and verify nvidia-smi.
Multi-GPU Support:
- For H100: Enable NVLink for low-latency inter-GPU communication.
- For L40: Use PCIe 4.0 for maximum bandwidth.
Boltz Compatibility: Ensure Boltz is configured for multi-GPU use (CUDA-aware MPI or distributed memory).

5. Storage & Performance

RAM: 512 GB is sufficient for most workloads; monitor with htop or free -h.
Disk I/O: Use NVMe SSDs for /tmp and /var to reduce latency.
Monitoring:
- Track GPU utilization (nvidia-smi), network throughput (iftop), and storage I/O (iostat).
- Deploy Prometheus + Grafana for centralized metrics.

6. Security & Monitoring

Security:
- Enable SELinux/AppArmor for access control.
- Regularly back up configurations and critical data.
Logging: Set up ELK stack for centralized logging.
Disaster Recovery: Define steps for GPU/NIC failure, including hot-swappable components.

7. Additional Tasks

Documentation: Record all configurations (Ansible playbooks, firewall rules, GPU settings).
Testing: Validate NFS performance and GPU utilization with sample datasets.
Optimization: Tune Boltz for memory efficiency and parallel processing.

Summary

This project integrates high-performance hardware (H100/L40 GPUs, 25 Gbit networking) with Qumulo storage to run Boltz efficiently. Key steps include Ubuntu 24.04 setup, Ansible automation, GPU driver installation, NFS optimization, and security monitoring. Prioritize multi-GPU communication, RAM management, and disaster recovery to ensure reliability for large-scale data analysis.

3.4 KiB Raw Blame History