Files
notes/projects/neosphere/ml-server/20250708-qwerty-summary.md
Petar Cubela 252a91dbcc ]
2025-07-15 15:36:24 +02:00

3.4 KiB

Comprehensive Project Plan: AI Server for Boltz with H100/L40 GPUs & Qumulo Storage


1. Hardware & Infrastructure

  • Server: HP DL3XX Gen11 (512 GB RAM, 25 Gbit NICs, dual GPU slots).
  • GPUs:
    • Option 1: 2x H100 (96 GB vRAM each, NVLink for inter-GPU communication).
    • Option 2: 2x L40 (46 GB vRAM each, PCIe 4.0 for inter-GPU communication).
  • Storage: Qumulo cluster (300 TB) mounted via NFS on each compute node.
  • Networking:
    • 25 Gbit bonding (active-backup mode) for redundancy and high throughput.
    • 25 Gbit transceivers (SFP+/QSFP+) for NICs and switches.
  • Power/Cooling: Ensure PSU supports dual GPU power draw (e.g., H100: ~300W each, L40: ~250W each).

2. OS & Software Stack

  • OS: Ubuntu 24.04 LTS (latest stable release for H100/L40 support).
  • Automation:
    • Use Ansible for OS installation, GPU driver setup, and cluster management (3 nodes).
    • Create playbooks for:
      • Ubuntu 24.04 installation.
      • NVIDIA driver + CUDA toolkit.
      • 25 Gbit NIC bonding.
      • NFS mount configuration for Qumulo.
  • CUDA/ROCm: Install latest CUDA toolkit for NVIDIA GPUs (or ROCm for AMD).

3. Network Configuration

  • Firewall: Deploy OPNsense to:
    • Enforce Qumulo/NFS access controls.
    • Monitor traffic between compute nodes and storage.
  • Bonding: Configure 25 Gbit NICs with bonding-mode=active-backup for failover.
  • NFS: Optimize Qumulo NFS mounts with noatime, async, and tcp for large datasets.

4. GPU & Multi-GPU Setup

  • Driver Installation:
    • Install NVIDIA drivers (e.g., nvidia-driver-535) and verify nvidia-smi.
  • Multi-GPU Support:
    • For H100: Enable NVLink for low-latency inter-GPU communication.
    • For L40: Use PCIe 4.0 for maximum bandwidth.
  • Boltz Compatibility: Ensure Boltz is configured for multi-GPU use (CUDA-aware MPI or distributed memory).

5. Storage & Performance

  • RAM: 512 GB is sufficient for most workloads; monitor with htop or free -h.
  • Disk I/O: Use NVMe SSDs for /tmp and /var to reduce latency.
  • Monitoring:
    • Track GPU utilization (nvidia-smi), network throughput (iftop), and storage I/O (iostat).
    • Deploy Prometheus + Grafana for centralized metrics.

6. Security & Monitoring

  • Security:
    • Enable SELinux/AppArmor for access control.
    • Regularly back up configurations and critical data.
  • Logging: Set up ELK stack for centralized logging.
  • Disaster Recovery: Define steps for GPU/NIC failure, including hot-swappable components.

7. Additional Tasks

  • Documentation: Record all configurations (Ansible playbooks, firewall rules, GPU settings).
  • Testing: Validate NFS performance and GPU utilization with sample datasets.
  • Optimization: Tune Boltz for memory efficiency and parallel processing.

Summary

This project integrates high-performance hardware (H100/L40 GPUs, 25 Gbit networking) with Qumulo storage to run Boltz efficiently. Key steps include Ubuntu 24.04 setup, Ansible automation, GPU driver installation, NFS optimization, and security monitoring. Prioritize multi-GPU communication, RAM management, and disaster recovery to ensure reliability for large-scale data analysis.