Files
notes/projects/neosphere/ml-server/20250827-new-server-setup.md
Petar Cubela 584265c22c 20250907
2025-09-07 13:07:01 +02:00

117 lines
3.0 KiB
Markdown

## mail an bjoern
- boeltz: sample daten fuer test des setups; boltz braucht wie es scheint nur cuda installiert
- ollama: ordner auf qumulo fuer docker daten
- [x] nfs docker mount timing
- [x] explicit gpu
- [x] searxng openwebui
- [x] collect all url's
## Steps
1. [x] mounting and cabeling
2. [x] check bios settings
3. [x] setup storage ?? or do they have hardware raid1
4. [x] setup iLo
5. [x] os installation via usb stick - prepare before hand
6. [x] ansible base install (sec, packages, docker)
7. [x] ansible compose
8. [x] ansible nfs - mount qumulo share(s)
9. [x] manuall 25 GBits config -> use saved netplan file
10. [x] manuall nvidia driver install with manuall (nvidia driver, cuda driver and container toolkit)
11. [x] install beszel agent
12. [x] spin up containers and test them
13. [x] install [boltz](https://github.com/jwohlwend/boltz) and test it
## TODO
- [ ] (optional) clean from snap
- [=] beszel reverse proxying via firewall. sophos intuitively not made for this
- [=] install beszel agent on all devices
- [ ] extend network diagram
- [x] write ansible playbook?
- [x] test ansible contruct
- [x] prepare boot stick
## base
- Hostname: neo-srv-ai-01
- IP Addres: 192.168.60.203
- Floating IP: 192.168.60.213
- iLo IP: 192.168.50.213
## ansible-roles
- [x] geerlingguy.security
- [x] geerlingguy.docker
- [x] nfs-client (mount qumulo shares)
- [ ] users (separate)
- [x] nvidia (driver) -> do manually
- [x] interfaces (25GBits NICs) -> do manually
## Manual nvidia driver, cuda driver and container toolkit
### NVIDIA driver
Check if GPUs are recognized by the base OS:
```bash
sudo lspci | grep -i nvidia
```
Which should some output if it finds nvidia deivces.
Search for required drivers for your GPUs:
```bash
sudo ubuntu-drivers devices
```
Automatically install all drivers:
```bash
sudo ubuntu-drivers autoinstall
```
Reboot the system for changes to take effect:
```bash
sudo reboot
```
Shot GPU stats with:
```bash
nvidia-smi
```
### Cuda driver
**Disable Secure Boot in BIOS**
Install Cuda drivers:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt -y install cuda-toolkit-12-8
sudo apt install -y cuda-drivers
```
### Container toolkit
Install the Nvidia Container toolkit:
```bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update
apt install -y nvidia-container-toolkit
```
Test a simple cuda container and nvidia-smi command inside:
```bash
docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
```