Home

Charmed Kubeflow on Ubuntu 22.04 with MicroK8s

This is a collection of notes on how to get KubeFlow running on Ubuntu 22.04 with MicroK8s. I'll add to it over time and try to fill in some of the resources that have been most useful.

Miller Hooks has been guiding me through some explorations in self-hosting the ops for training ML models and other GPU heavy workloads on my own in-house infrastructure

I'm excited about the future of machine learning and large language models but want to be an active participant and not a passive consumer of commercial APIs and various walled gardens.

Good clean fun 🌱

PRs to this post welcome!

Ubuntu System

I started out with SSD and 64gb of RAM and over the course of the build ended up with nv.me and 128gb of RAM. I would recommend starting with the latter. 😅

This machine has two NVidia a4500 GPUs since the purpose here is to run ML workloads and pipeline in Kubeflow.

Remove apache2 right out of the gate:

sudo apt remove apache2
sudo apt autoremove

I'm using SSH and connectiong to Ubuntu from my Macbook Pro. It's more ergonomic for me to work that way, but I still installed Ubuntu desktop so that I can use the GUI for some things. Very handy for trouble shooting.

sudo apt update
sudo apt install openssh-server curl build-essential

# Install Git
sudo apt install git

# Configure Git
git config --global user.name "Joel Hooks"
git config --global user.email "joel@pissoff.party"
git config --global init.defaultBranch 'main'
git config --global credential.helper store

I set the machine up with tools that I like such as zsh, neovim, github cli, atuin, nodenv, and other bits and bobs that I personallylike to have on hand, but aren't essential to this process.

We are also using Tailscale, so I set that up:

curl -fsSL https://tailscale.com/install.sh | sh

NVidia Drivers

You can list the available drivers with:

ubuntu-drivers devices

At the time nvidia-driver-530 was the latest driver. I installed it with:

sudo apt install nvidia-driver-530
sudo reboot now

Now we'll install the CUDA toolkit:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo apt-get -y install nvidia-gds

From here you should be able to run nvidia-smi and see the GPUs!

You need to add cuda to your path:

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

These go in your .zshrc or .bashrc file (or wherever you keep that stuff).

Go ahead and reboot again.

sudo reboot now

Docker and the NVIDIA Container Toolkit

Some documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide

Just let them handle the install:

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

This should output the nvidia-smi output that shows the GPUs in the machine.

🎉

MicroK8s

MicroK8s is a single node Kubernetes cluster that is easy to install and manage. It's a great way to get started with Kubernetes. It's fuckin cool tbh.

We are mostly following along with this guide: https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow

sudo snap install microk8s --classic --channel=1.24/stable

Note that we specified a version for the install that is NOT the latest version. This is because Kubeflow has specific requirements and the latest MicroK8s doesn't run it properly, so we pegged to a specific known working version.

We need to set up some permissions for MicroK8s on the system and join a group:

sudo usermod -a -G microk8s $USER
newgrp microk8s

sudo chown -f -R $USER ~/.kube
microk8s config > ~/.kube/config

This copies the config into our home directoy. There are instructions for how to manage this when multiple users are on the system and it requires some additional work.

Fire up some MicroK8s goodies:

microk8s enable dns gpu hostpath-storage host-access ingress metallb:10.64.140.43-10.64.140.49
microk8s status --wait-ready

This differs from the docs because we added gpu because we have GPUs available. DNS, storage, ingress, and the load balancer are all required to run Kubeflow.

Generally I let this settle a bit before I do anything else and monitor the pods as they spin up:

watch -n 1 microk8s kubectl get pods --all-namespaces

I also like to use a tool like btop to monitor the system resources:

sudo snap install btop

That helps get a feel for the whole thing as we proceed.

cert-manager

We need to install cert-manager to manage the TLS certificates for the cluster. This isn't a requirement for Kubeflow, but it's great to have in place.

microk8s helm3 repo add jetstack https://charts.jetstack.io
microk8s helm3 install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --set installCRDs=true

This also requires issuers to be set up. I'm using Let's Encrypt, so I followed the instructions here: https://cert-manager.io/docs/configuration/acme/

One big note here is that the ingressClassName needs to be public instead of nginx as the docs suggest. This is because we are using the MicroK8s ingress controller instead of the Nginx ingress controller.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    email: jyouremail@example.com # change this
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: public
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: youremail@example.com # change this
    privateKeySecretRef:
       name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: public

Save that to a letsencrypt-issuer.yaml file and apply it:

microk8s kubectl apply -f letsencrypt-issuer.yaml

💡 set aliases for microk8s kubectl and microk8s helm3 to make your life easier!

Kubeflow

You could stop here and mess around for awhile. It's a good stopping point before we dump Kubeflow onto the system, which is pretty heavy. You can set it up and tear it down as needed wothout breaking things (very much), which is a cool feature of Kubernetes in general.

Charmed Kubeflow is a straight forward way to stand up Kubeflow inside of MicroK8s. It uses Juju, which is a tool for managing Kubernetes applications. This shit is a black box still to me.

sudo snap install juju --classic --channel=2.9/stable

We are spcifying a version of Juju here as well because that's the way it needs to be. All of these steps take a minute. I like to watch btop and microk8s kubectl get pods -A -w to see everything roll in.

juju bootstrap microk8s
juju add-model kubeflow

Now run these magic numbers:

sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360

Kubeflow is doing a lot of shit and needs to have some defaults boosted on the system. It's beyond my understanding, but it seems to work.

Alright, now we are ready to kick off Kubeflow. This can take about an hour, so don't sweat it. I like to take a walk and think about The Cloud while it runs.

juju deploy kubeflow --trust  --channel=1.7/stable

Once this is going, we can also watch the Kubeflow pods Juju is controlling:

watch -c 'juju status --color | grep -E "blocked|error|maintenance|waiting|App|Unit"'

If you're watching it and notice the pod named tensorboard-controller stuck in a state labeld Waiting for gateway relation run this to kick it into gear:

juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"

It's an upstream issue and consistently popped up for me.

Kubeflow Dashboard

In the Kubeflow Tutorial they show you how to get the dashbboard up and running on the mmachine that you've installed Kubeflow on, which is fine, but I wanted to access the dashboard from anywhere, all the time. So much fucking MLops. Globally.

For that we need a couple fo things in place. First you'll need a domain that resolves to your machine's public IP address. There are loads of dynamic DNS services (gamers use them a lot, for instance) and you can use something like Kubesail to deploy a nice little pod on your new Kubernetes cluster to monitor for DNS changes and update your provider. I used Cloudflare, and it worked great.

You can also figure out your IP address manually and point an a record of any domain or sub-domain you own at that IP. This isn't as durable, or as fun, but it works to get started.

DNS resolvable top-level domains are foundational to all of this, so getting this part working was important to me.

To get access to my Kubeflow dashboard externally at a resolvable domain, I created an ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: kubeflow-dashboard-ingress
  namespace: kubeflow
  labels:
    app: istio-ingressgateway
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    # this starts with `letsencrypt-staging` so we can test it out
    cert-manager.io/cluster-issuer: "letsencrypt-staging"
spec:
  # this needs to be `public` and not `nginx`!
  ingressClassName: public
  tls:
  - hosts:
    - your.cool-domain.com
    secretName: tls-secret
  rules:
  - host: your.cool-domain.com
    http:
      paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              # this is the service that the ingress will route to
              # NOT the istio-ingressgateway
              name: istio-ingressgateway-workload
              port:
                number: 80

Save that as whatever.yaml and apply it:

microk8s kubectl apply -f whatever.yaml

Assuming your domain is pointing at your machine's public IP address, you should be able to access the dashboard at https://your.cool-domain.com.

Just kidding. There's still more to do. 🤡

YOu can check to see the status of the underlying Let's Encrypt certificate with this:

microk8s kubectl describe -nkubeflow certificate tls-secret

You can also look at the ClusterIssuer to see what's going on there:

microk8s kubectl describe clusterissuer letsencrypt-staging

Assuming the DNS can resolve, it should work and you''ll see something that tells you the ready status is TRUE and you'll know it's cool to switch the yample to use letsencrypt-prod instead of letsencrypt-staging and apply it again.

microk8s kubectl apply -f whatever.yaml

This Stack Overflow answer is an excellent reference.

Now, you need to tell Kubeflow what the domain is through this config:

juju config dex-auth public-url=https://your.cool-domain.com
juju config oidc-gatekeeper public-url=https://your.cool-domain.com

juju config dex-auth static-username=admin
juju config dex-auth static-password=admin

You can watch all the dashboards and you'll see the oidc-gatekeeper pod restart and then you'll be able to log in with the username and password you set above.

Tearing it all down

One noce thing is that wiping the whole thing and rebuilding it is straight forward. I had to do this dozens of times and wrote this post as a way to remember all the steps.

sudo snap remove microk8s --purge; juju unregister -y microk8s-localhost
sudo snap remove --purge  juju                                        
rm -rf ~/.local/share/juju

Gone.

What can you use Kubeflow for?

I'll let ChatGPT answer that. I dunno, do whatever you want! Let me know.