notes/projects/neosphere/ml-server/20250708-qwerty-talk.md at 252a91dbcc7352cbb59f434d5d2eed366d8dd679

Files

Petar Cubela 252a91dbcc ]

2025-07-15 15:36:24 +02:00

4.4 KiB

Raw Blame History

Your task list is comprehensive, but there are additional considerations to ensure smooth operation, especially with high-performance hardware and large datasets. Here's a breakdown of what you might be missing:

1. Hardware & Networking Considerations

PCIe Slot Availability:
Ensure the server has enough PCIe slots for both GPUs (H100/L40) and the 25 Gbit NICs. Dual GPUs (e.g., H100) may require NVLink support, which depends on the motherboard’s PCIe version (e.g., PCIe 4.0 or higher).
Transceivers:
If using 25 Gbit SFP+ or QSFP+ ports, confirm the NICs and switches support the required transceivers (e.g., SFP+ for 25 Gbit).
Power & Cooling:
Verify the PSU can handle the combined power draw of dual GPUs (e.g., H100s: ~300W each, L40s: ~250W each). Ensure adequate cooling for the server chassis.

2. OS & Software Stack

Ubuntu Version:
While Ubuntu 22.04 is stable, 24.04 is recommended for better support of newer hardware (e.g., H100 GPUs) and security updates.
Ansible Playbooks:
Create reusable Ansible playbooks for:
- OS installation (e.g., Ubuntu 24.04).
- GPU driver installation (NVIDIA).
- Network bonding (e.g., bonding-mode=active-backup).
- NFS mount configuration.
CUDA & ROCm:
Install the latest CUDA toolkit (for NVIDIA GPUs) or ROCm (for AMD GPUs) to ensure compatibility with Boltz.

3. Network Configuration

Bonding Mode:
Use bonding-mode=active-backup for failover redundancy. Ensure the switches support LACP (Link Aggregation Control Protocol) for bonding.
Firewall Rules:
Configure OPNsense to allow traffic between:
- The server and Qumulo storage (NFS).
- The server and other compute nodes (if using Ansible for clustering).
Qumulo NFS Optimization:
Use NFS mount options like noatime, async, or tcp for large datasets. Consider setting up a cron job to check NFS mount status.

4. GPU & Multi-GPU Support

Driver Installation:
Install the latest NVIDIA drivers (e.g., via nvidia-driver-535 package) and verify nvidia-smi works.
Multi-GPU Configuration:
- For H100 GPUs: Enable NVLink (if supported) for faster inter-GPU communication.
- For L40 GPUs: Use PCIe 4.0 for optimal bandwidth.
CUDA Multi-GPU Support:
Ensure Boltz is configured to use multiple GPUs (e.g., via CUDA-aware MPI or distributed memory frameworks).

5. Storage & Performance

RAM Sufficiency:
512 GB RAM is sufficient for most workloads, but monitor memory usage with tools like htop or free -h. Optimize Boltz to minimize memory overhead (e.g., batch processing).
Disk I/O:
Use NVMe SSDs for temporary storage (e.g., /tmp, /var) to reduce disk contention.

6. Security & Monitoring

Centralized Logging:
Set up a centralized logging system (e.g., ELK stack) to monitor server activity and GPU utilization.
SELinux/AppArmor:
Enable security modules (e.g., AppArmor) to restrict access to critical resources.
Prometheus + Grafana:
Monitor GPU utilization (nvidia-smi), network throughput (iftop), and storage I/O (iostat).

7. Additional Tasks

Backup & Recovery:
Set up regular backups for critical data (e.g., Boltz configurations, NFS mounts).
Disaster Recovery Plan:
Define steps for hardware failure (e.g., GPU replacement, network redundancy).
Documentation:
Document all configurations (Ansible playbooks, firewall rules, GPU settings) for future reference.

Summary of Missing Items

Category	Missing Tasks
Hardware	PCIe slot availability, power/cooling, transceiver compatibility
Networking	LACP support for bonding, OPNsense firewall rules, NFS optimization
GPU Setup	NVLink/NVMe configuration, CUDA multi-GPU support, driver compatibility
Storage	Disk I/O optimization, RAM monitoring, temporary storage setup
Security	SELinux/AppArmor, centralized logging, access controls
Monitoring	Prometheus/Grafana, regular backups, disaster recovery plan

4.4 KiB Raw Blame History Unescape Escape