Files
notes/projects/neosphere/ml-server/20250708-qwerty-talk.md
Petar Cubela 252a91dbcc ]
2025-07-15 15:36:24 +02:00

4.4 KiB
Raw Blame History

Your task list is comprehensive, but there are additional considerations to ensure smooth operation, especially with high-performance hardware and large datasets. Here's a breakdown of what you might be missing:


1. Hardware & Networking Considerations

  • PCIe Slot Availability:
    Ensure the server has enough PCIe slots for both GPUs (H100/L40) and the 25 Gbit NICs. Dual GPUs (e.g., H100) may require NVLink support, which depends on the motherboards PCIe version (e.g., PCIe 4.0 or higher).
  • Transceivers:
    If using 25 Gbit SFP+ or QSFP+ ports, confirm the NICs and switches support the required transceivers (e.g., SFP+ for 25 Gbit).
  • Power & Cooling:
    Verify the PSU can handle the combined power draw of dual GPUs (e.g., H100s: ~300W each, L40s: ~250W each). Ensure adequate cooling for the server chassis.

2. OS & Software Stack

  • Ubuntu Version:
    While Ubuntu 22.04 is stable, 24.04 is recommended for better support of newer hardware (e.g., H100 GPUs) and security updates.
  • Ansible Playbooks:
    Create reusable Ansible playbooks for:
    • OS installation (e.g., Ubuntu 24.04).
    • GPU driver installation (NVIDIA).
    • Network bonding (e.g., bonding-mode=active-backup).
    • NFS mount configuration.
  • CUDA & ROCm:
    Install the latest CUDA toolkit (for NVIDIA GPUs) or ROCm (for AMD GPUs) to ensure compatibility with Boltz.

3. Network Configuration

  • Bonding Mode:
    Use bonding-mode=active-backup for failover redundancy. Ensure the switches support LACP (Link Aggregation Control Protocol) for bonding.
  • Firewall Rules:
    Configure OPNsense to allow traffic between:
    • The server and Qumulo storage (NFS).
    • The server and other compute nodes (if using Ansible for clustering).
  • Qumulo NFS Optimization:
    Use NFS mount options like noatime, async, or tcp for large datasets. Consider setting up a cron job to check NFS mount status.

4. GPU & Multi-GPU Support

  • Driver Installation:
    Install the latest NVIDIA drivers (e.g., via nvidia-driver-535 package) and verify nvidia-smi works.
  • Multi-GPU Configuration:
    • For H100 GPUs: Enable NVLink (if supported) for faster inter-GPU communication.
    • For L40 GPUs: Use PCIe 4.0 for optimal bandwidth.
  • CUDA Multi-GPU Support:
    Ensure Boltz is configured to use multiple GPUs (e.g., via CUDA-aware MPI or distributed memory frameworks).

5. Storage & Performance

  • RAM Sufficiency:
    512 GB RAM is sufficient for most workloads, but monitor memory usage with tools like htop or free -h. Optimize Boltz to minimize memory overhead (e.g., batch processing).
  • Disk I/O:
    Use NVMe SSDs for temporary storage (e.g., /tmp, /var) to reduce disk contention.

6. Security & Monitoring

  • Centralized Logging:
    Set up a centralized logging system (e.g., ELK stack) to monitor server activity and GPU utilization.
  • SELinux/AppArmor:
    Enable security modules (e.g., AppArmor) to restrict access to critical resources.
  • Prometheus + Grafana:
    Monitor GPU utilization (nvidia-smi), network throughput (iftop), and storage I/O (iostat).

7. Additional Tasks

  • Backup & Recovery:
    Set up regular backups for critical data (e.g., Boltz configurations, NFS mounts).
  • Disaster Recovery Plan:
    Define steps for hardware failure (e.g., GPU replacement, network redundancy).
  • Documentation:
    Document all configurations (Ansible playbooks, firewall rules, GPU settings) for future reference.

Summary of Missing Items

Category Missing Tasks
Hardware PCIe slot availability, power/cooling, transceiver compatibility
Networking LACP support for bonding, OPNsense firewall rules, NFS optimization
GPU Setup NVLink/NVMe configuration, CUDA multi-GPU support, driver compatibility
Storage Disk I/O optimization, RAM monitoring, temporary storage setup
Security SELinux/AppArmor, centralized logging, access controls
Monitoring Prometheus/Grafana, regular backups, disaster recovery plan