20250907
This commit is contained in:
117
projects/neosphere/ml-server/20250827-new-server-setup.md
Normal file
117
projects/neosphere/ml-server/20250827-new-server-setup.md
Normal file
@@ -0,0 +1,117 @@
|
||||
## mail an bjoern
|
||||
|
||||
- boeltz: sample daten fuer test des setups; boltz braucht wie es scheint nur cuda installiert
|
||||
- ollama: ordner auf qumulo fuer docker daten
|
||||
|
||||
- [x] nfs docker mount timing
|
||||
- [x] explicit gpu
|
||||
- [x] searxng openwebui
|
||||
- [x] collect all url's
|
||||
|
||||
## Steps
|
||||
|
||||
1. [x] mounting and cabeling
|
||||
2. [x] check bios settings
|
||||
3. [x] setup storage ?? or do they have hardware raid1
|
||||
4. [x] setup iLo
|
||||
5. [x] os installation via usb stick - prepare before hand
|
||||
6. [x] ansible base install (sec, packages, docker)
|
||||
7. [x] ansible compose
|
||||
8. [x] ansible nfs - mount qumulo share(s)
|
||||
9. [x] manuall 25 GBits config -> use saved netplan file
|
||||
10. [x] manuall nvidia driver install with manuall (nvidia driver, cuda driver and container toolkit)
|
||||
11. [x] install beszel agent
|
||||
12. [x] spin up containers and test them
|
||||
13. [x] install [boltz](https://github.com/jwohlwend/boltz) and test it
|
||||
|
||||
## TODO
|
||||
|
||||
- [ ] (optional) clean from snap
|
||||
- [=] beszel reverse proxying via firewall. sophos intuitively not made for this
|
||||
- [=] install beszel agent on all devices
|
||||
- [ ] extend network diagram
|
||||
- [x] write ansible playbook?
|
||||
- [x] test ansible contruct
|
||||
- [x] prepare boot stick
|
||||
|
||||
## base
|
||||
|
||||
- Hostname: neo-srv-ai-01
|
||||
- IP Addres: 192.168.60.203
|
||||
- Floating IP: 192.168.60.213
|
||||
- iLo IP: 192.168.50.213
|
||||
|
||||
## ansible-roles
|
||||
|
||||
- [x] geerlingguy.security
|
||||
- [x] geerlingguy.docker
|
||||
- [x] nfs-client (mount qumulo shares)
|
||||
|
||||
- [ ] users (separate)
|
||||
|
||||
- [x] nvidia (driver) -> do manually
|
||||
- [x] interfaces (25GBits NICs) -> do manually
|
||||
|
||||
|
||||
## Manual nvidia driver, cuda driver and container toolkit
|
||||
|
||||
### NVIDIA driver
|
||||
|
||||
Check if GPUs are recognized by the base OS:
|
||||
```bash
|
||||
sudo lspci | grep -i nvidia
|
||||
```
|
||||
|
||||
Which should some output if it finds nvidia deivces.
|
||||
|
||||
Search for required drivers for your GPUs:
|
||||
```bash
|
||||
sudo ubuntu-drivers devices
|
||||
```
|
||||
|
||||
Automatically install all drivers:
|
||||
```bash
|
||||
sudo ubuntu-drivers autoinstall
|
||||
```
|
||||
|
||||
Reboot the system for changes to take effect:
|
||||
```bash
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
Shot GPU stats with:
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
### Cuda driver
|
||||
|
||||
**Disable Secure Boot in BIOS**
|
||||
|
||||
Install Cuda drivers:
|
||||
|
||||
```bash
|
||||
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
|
||||
sudo dpkg -i cuda-keyring_1.1-1_all.deb
|
||||
sudo apt update
|
||||
sudo apt -y install cuda-toolkit-12-8
|
||||
sudo apt install -y cuda-drivers
|
||||
```
|
||||
|
||||
### Container toolkit
|
||||
|
||||
Install the Nvidia Container toolkit:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
|
||||
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
|
||||
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
||||
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||||
apt update
|
||||
apt install -y nvidia-container-toolkit
|
||||
```
|
||||
|
||||
Test a simple cuda container and nvidia-smi command inside:
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
|
||||
```
|
||||
Reference in New Issue
Block a user