Files
notes/projects/neosphere/ml-server/20250827-new-server-setup.md
2025-09-21 19:15:25 +02:00

3.0 KiB

mail an bjoern

  • boeltz: sample daten fuer test des setups; boltz braucht wie es scheint nur cuda installiert

  • ollama: ordner auf qumulo fuer docker daten

  • nfs docker mount timing

  • explicit gpu

  • searxng openwebui

  • collect all url's

Steps

  1. mounting and cabeling
  2. check bios settings
  3. setup storage ?? or do they have hardware raid1
  4. setup iLo
  5. os installation via usb stick - prepare before hand
  6. ansible base install (sec, packages, docker)
  7. ansible compose
  8. ansible nfs - mount qumulo share(s)
  9. manuall 25 GBits config -> use saved netplan file
  10. manuall nvidia driver install with manuall (nvidia driver, cuda driver and container toolkit)
  11. install beszel agent
  12. spin up containers and test them
  13. install boltz and test it

TODO

  • (optional) clean from snap
  • [=] beszel reverse proxying via firewall. sophos intuitively not made for this
  • [=] install beszel agent on all devices
  • extend network diagram
  • write ansible playbook?
  • test ansible contruct
  • prepare boot stick

base

  • Hostname: neo-srv-ai-01
  • IP Addres: 192.168.60.203
  • Floating IP: 192.168.60.213
  • iLo IP: 192.168.50.213

ansible-roles

  • geerlingguy.security

  • geerlingguy.docker

  • nfs-client (mount qumulo shares)

  • users (separate)

  • nvidia (driver) -> do manually

  • interfaces (25GBits NICs) -> do manually

Manual nvidia driver, cuda driver and container toolkit

NVIDIA driver

Check if GPUs are recognized by the base OS:

sudo lspci | grep -i nvidia

Which should some output if it finds nvidia deivces.

Search for required drivers for your GPUs:

sudo ubuntu-drivers devices

Automatically install all drivers:

sudo ubuntu-drivers autoinstall

Reboot the system for changes to take effect:

sudo reboot

Shot GPU stats with:

nvidia-smi

Cuda driver

Disable Secure Boot in BIOS

Install Cuda drivers:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb  
sudo apt update  
sudo apt -y install cuda-toolkit-12-8  
sudo apt install -y cuda-drivers

Container toolkit

Install the Nvidia Container toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update  
apt install -y nvidia-container-toolkit

Test a simple cuda container and nvidia-smi command inside:

docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi