This commit is contained in:
Petar Cubela
2025-09-07 13:07:01 +02:00
parent c83d178b77
commit 584265c22c
92 changed files with 3011 additions and 100 deletions

View File

@@ -0,0 +1,117 @@
## mail an bjoern
- boeltz: sample daten fuer test des setups; boltz braucht wie es scheint nur cuda installiert
- ollama: ordner auf qumulo fuer docker daten
- [x] nfs docker mount timing
- [x] explicit gpu
- [x] searxng openwebui
- [x] collect all url's
## Steps
1. [x] mounting and cabeling
2. [x] check bios settings
3. [x] setup storage ?? or do they have hardware raid1
4. [x] setup iLo
5. [x] os installation via usb stick - prepare before hand
6. [x] ansible base install (sec, packages, docker)
7. [x] ansible compose
8. [x] ansible nfs - mount qumulo share(s)
9. [x] manuall 25 GBits config -> use saved netplan file
10. [x] manuall nvidia driver install with manuall (nvidia driver, cuda driver and container toolkit)
11. [x] install beszel agent
12. [x] spin up containers and test them
13. [x] install [boltz](https://github.com/jwohlwend/boltz) and test it
## TODO
- [ ] (optional) clean from snap
- [=] beszel reverse proxying via firewall. sophos intuitively not made for this
- [=] install beszel agent on all devices
- [ ] extend network diagram
- [x] write ansible playbook?
- [x] test ansible contruct
- [x] prepare boot stick
## base
- Hostname: neo-srv-ai-01
- IP Addres: 192.168.60.203
- Floating IP: 192.168.60.213
- iLo IP: 192.168.50.213
## ansible-roles
- [x] geerlingguy.security
- [x] geerlingguy.docker
- [x] nfs-client (mount qumulo shares)
- [ ] users (separate)
- [x] nvidia (driver) -> do manually
- [x] interfaces (25GBits NICs) -> do manually
## Manual nvidia driver, cuda driver and container toolkit
### NVIDIA driver
Check if GPUs are recognized by the base OS:
```bash
sudo lspci | grep -i nvidia
```
Which should some output if it finds nvidia deivces.
Search for required drivers for your GPUs:
```bash
sudo ubuntu-drivers devices
```
Automatically install all drivers:
```bash
sudo ubuntu-drivers autoinstall
```
Reboot the system for changes to take effect:
```bash
sudo reboot
```
Shot GPU stats with:
```bash
nvidia-smi
```
### Cuda driver
**Disable Secure Boot in BIOS**
Install Cuda drivers:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt -y install cuda-toolkit-12-8
sudo apt install -y cuda-drivers
```
### Container toolkit
Install the Nvidia Container toolkit:
```bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update
apt install -y nvidia-container-toolkit
```
Test a simple cuda container and nvidia-smi command inside:
```bash
docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
```