Table of Contents
Kubernetes @ ICS
Some of the infromation below may be inaccessible outside of helpdesk staff. Please send mail to helpdesk@ics.uci.edu for more information.
Kubernetes Management Resources(support group only)
Internal Respositories
Kubernetes Tutorials
Kubernetes ICS Hosted Sites
Common Commands
kubectl
kubectl cluster-info
% kubectl cluster-info Kubernetes master is running at https://rancher.ics.uci.edu/k8s/clusters/local CoreDNS is running at https://rancher.ics.uci.edu/k8s/clusters/local/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
kubectl get nodes
% kubectl get nodes /home/hans NAME STATUS ROLES AGE VERSION kyvernitis-1 Ready controlplane,etcd 28d v1.14.1 kyvernitis-2 Ready controlplane,etcd 28d v1.14.1 kyvernitis-3 Ready controlplane,etcd 28d v1.14.1 kyvernitis-4 Ready worker 28d v1.14.1 kyvernitis-5 Ready worker 28d v1.14.1 kyvernitis-6 Ready worker 28d v1.14.1
kubectl get pods
kubectl -n kube-system get pods NAME READY STATUS RESTARTS AGE canal-5n86q 2/2 Running 0 37d canal-8sq4l 2/2 Running 2 37d
Building Containers
Docker containers can be run on the Kubernetes @ ICS cluster.
- Pull: Choose and download container for Docker Hub
- Modify
- Build
- Push
Harbor
Default SA
For Jupyterlab, it appears necessary to put the imagePullSecrets value into the default SA account:
kubectl -n jhub-ics53-stage edit sa default
docker build -t containers.ics.uci.edu/jupyter/ics-notebook .
SA default config <config> # Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 imagePullSecrets: - name: harbor-jupyter kind: ServiceAccount metadata:
creationTimestamp: "2019-06-27T20:59:48Z" name: default namespace: jhub-ics53-stage resourceVersion: "53685477" selfLink: /api/v1/namespaces/jhub-ics53-stage/serviceaccounts/default uid: 7fa3a015-991e-11e9-89c7-78e7d1227a28
secrets: - name: default-token-dnlw4 </config>
Example: Ubuntu
Choose
We have chosen the base Ubuntu 18.04 docker.
Dockerfile
We're going to add some utilities to the docker, so let's make our own dockerfile:
FROM ubuntu:18.04 MAINTAINER ICS Computing Support Helpdesk <helpdesk@ics.uci.edu> RUN apt-get update RUN apt-get install -y gcc gdb net-tools telnet iputils-ping iproute2 ssh tcpdump strace
There is a lot of documentation for harbor and jupyter lab saying that imagePullSecrets could be put into the helm release object, but that has been remarkably unsuccessful. We thought we had it working for a day, but then it stopped.
Note this is really bad because the default SA is created and deleted with the namespace and we don't haven't setup a yml to it.
Docker Build
docker build -t ubuntu-instruction .
Docker Push
# docker login containers.ics.uci.edu # docker tag ubunti-instruction containers.ics.uci.edu/hans/ubuntu-instruction # docker push containers.ics.uci.edu/hans/ubuntu-instruction
Docker images
# docker images
Backup Strategies
- Ephemoral components are not backed up
- Persistent components are backed up
- Kubernetes is not covered via DR
Ephemoral Components
Defintion: Ephemoral: Temporary, fleeting, impermanent.
The overwhelming abundance of kubernetes components are created to be ephemeral. For this reason, there are a number of components that would be permamently lost
- User Containers or Pods
- Name Spaces
- Local storage on a container
- Rook (Ceph) based persistent volumes and claims.
Persistent Components
- ETCD Database
- Harbors, the ICS Containers Registry
- Containers in the ICS Container registry
- Netapp Based Persistent Volume and Persistent Volume Claims
- Addition kubeadm managed databases
- * Unordered List Item
Disaster Recovery
There are currently no disaster recover plans for the ICS Kubernetes cluster. Production or time critical workloads are not recommended to run in the ICS Kubernetes cluster.
Vulnerable Services/Deployments
kubedb
This deployment has persistent data.
- namespace: kubedb
- provided by: kubernetes/kubedb/helmrelease.yml
harbor
This deployment has persistent data.
- namespace: harbor
- database: postgres.kubedb.com/postgres 11.1-v1 Running 43d
kube-ics
hackmd
This deployment has persistent data.
- namesapace hack
- database: postgres.kubedb.com/postgres
- provided by:
Backups
kubedb
snap yaml file
Create a snap.yml file:
apiVersion: kubedb.com/v1alpha1 kind: Snapshot metadata: name: harbor-pgsql-snap namespace: harbor labels: kubedb.com/kind: Postgres spec: databaseName: postgres local: mountPath: /repo persistentVolumeClaim: claimName: snap-harbor-pvc
Running the snap yaml file
kubectl apply -f snap.yml
Backup
Running the snap yaml file will create an archived-<dbase>-kubedb-<snapshotName>-pvc-<alpha-numeric-code> in cybertron's vol11 ics_kyvernitis/k8s_pv directory. Inside this directory will be a dumpfile.sql.
Snaps
Cybertron snapshots are present for volume 11 on cybertron so we will have several copies of the snapshot.
Scheduling
Use cron to schedule reguarly. Make sure your PVC pre-exists. Use get events to see errors.
Run: `kubectl -n harbor edit pg hans-pg `
spec: backupSchedule: cronExpression: "@every 1d" databaseName: postgres local: mountPath: /repo persistentVolumeClaim: claimName: harbor-snap-pvc
Restores
DR
Tips and Tricks
Cheatsheet
All namespaces
kubectl get namespace
Events
Because the kubernetes designers decided to jettison proper log reporting as a conduit to rapid product delivery, not all events are reported in a sensible location. Much like systemd actually. Anyhow, see what your missing:
kubectl get events
Logs
You can return logs from a pod by running the 'log' command on a pod name (oddly, for once declaring the object type is unecessary). This was instrumental in solving DTR 71791.
ex.
kubectl -n jhub-ics53-stage log hub-9747d8b7-2xsjx
Cordoning a Worknode
If you want to run maintenance on a node, use the cordone and evict commands:
kubectl cordon kyvernitis-5
kubectl evict kyvernitis-5
[2:07:30 hans@medusa]kubectl get node /home/hans/kube-ics/kustomize-deployments NAME STATUS ROLES AGE VERSION kyvernitis-1 Ready controlplane,etcd 81d v1.14.1 kyvernitis-2 Ready controlplane,etcd 81d v1.14.1 kyvernitis-3 Ready controlplane,etcd 81d v1.14.1 kyvernitis-4 Ready worker 81d v1.14.1 kyvernitis-5 Ready,SchedulingDisabled worker 81d v1.14.1 kyvernitis-6 Ready worker 81d v1.14.1
kubectl uncordon kyvernitis-5
Docker Container Secrets
The following directions should work:
…but they don't. Haven't discovered why yet. And you shouldn't put your password into a secret anyhow.
Use a robot account in harbor
kubectl create secret docker-registry NAME_HERE \ --docker-server containers.ics.uci.edu \ --docker-username "ics53Jupyter" \ --docker-password="token" \ --docker-email="k8s@hub.ics.uci.edu" \ --n NAMESPACE_HERE \ --dry-run -o yaml
In order to save docker credentials to allow login to containers.ics.uci.edu.
kubectl -n jhub-ics53-stage create secret generic regcred --from-file=.dockerconfigjson=/home/luser/.docker/config.json --type=kubernetes.io/dockerconfigjson
Jupyter Hub
- Staging Hub (move to hub.stage.ics.uci.edu).
- Image: zhaofengli/ics-k8s-hub:2019042201
Building
This section describes how to build a new ics-k8s-hub image.
Base Image
Choosing a base image from jupyterhub stacks
Troubleshooting
How to See Install Options for Installed Charts
A
helm get values metalb helm get values jhub
Check your Chrome Console for Errors
Harbor Stops Accepting Logins
Q Found that the https://containers.ics.uci.edu harbor server is not accepting logins.
A
Harbor APIs are all returning 500, checking core logs now
Jupyterlab errors will often be visible in your chrome console.
Try a Different Browser
Crazy world, continuous integration, both browsers and kubernetes are updating constantly. If you run into a problem you can't explain with your chrome console, remember to try a different browser.
Pending_Upgrade
Q This happened when bringing up a jupyter staging environment. Repeated attempts to bring up the helm release saw no changes.
A Delete any configmaps that have the same name as you project from kube-system
helm history jhub-stage /home/hans/kube-ics/ics/instructional/jupyter/stage REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Aug 27 14:58:21 2019 DEPLOYED jupyterhub-0.8.2 Install complete 2 Wed Nov 27 11:49:30 2019 PENDING_UPGRADE jupyterhub-0.8.2 Preparing upgrade 3 Wed Nov 27 11:58:08 2019 PENDING_UPGRADE jupyterhub-0.8.2 Preparing upgrade
Get the list of relevant config maps:
kubectl -n kube-system get configmaps | grep jhub-stage /home/hans/kube-ics/ics/instructional/jupyter/stage jhub-stage.v1 1 91d jhub-stage.v2 1 41m jhub-stage.v3 1 32m
Delete the relevant config maps:
kubectl -n kube-system delete configmap jhub-stage.v3 jhub-stage.v2 jhub-stage.v1 /home/hans/kube-ics/ics/instructional/jupyter/stage configmap "jhub-stage.v3" deleted configmap "jhub-stage.v2" deleted configmap "jhub-stage.v1" deleted
Your helm install is now gone:
helm history jhub-stage /home/hans/kube-ics/ics/instructional/jupyter/stage Error: release: "jhub-stage" not found
Release your helm, again:
helm upgrade --install jhub-stage jupyterhub/jupyterhub --namespace jhub-stage --version=0.8.2 --values config.yml
Error:
Error: release jhub-stage failed: poddisruptionbudgets.policy "hub" already exists
Alright, at this point things went a little off the rails. I wound up wiping out the entire jhub-stage namespace and needed to start from scratch. Furthermore, `helm upgrade –install` didn't seem to do what it used to do (see about updating helm). In the meantime:
/usr/local/bin/helm install jupyterhub/jupyterhub --namespace jhub-stage --version=0.8.2 --values config.yml
Error: timed out waiting for the condition
This happened when recreating the jhub-stage helm release
/usr/local/bin/helm install jupyterhub/jupyterhub --namespace jhub-stage --version=0.8.2 --values config.yml
Ask tiller what happened:
kubectl -n kube-system log tiller-deploy-54979dfb45-qkvjj ... [tiller] 2019/11/27 21:29:40 warning: Release aged-anteater pre-install jupyterhub/templates/image-puller/job.yaml could not complete: timed out waiting for the condition ....
kubectl -n jhub-stage -o wide get pods
It can take a long time fo rthe hook-image-puller to pull down your image. Let those guys finish and then run the helm install command again.
Error: a release named jhub-stage already exists
A
Run: helm ls --all jhub-stage; to check the status of the release Or run: helm del --purge jhub-stage; to delete iError: a release named jhub-stage already exists.
VSCode Can't Open Directory
Q Everytime I click on a directory, change a directory, in the file explorer, the explorer crashes.
A In my case, there was a link in my home directory that pointed to a non-existent file. The presence of this file crashed the file explorer. I deleted the symlink and the file explorere worked normally. – hans 06/01/2019
VS Code GIT complains too many changes to track
you should see if there are any directories you can add .gitignore My repo directory contained
.vagrant vagrant vendor runtime web/assets
All of which I was not going to add to git
unable to validate against any pod security policy
Q You recieve the following messages:
Warning FailedCreate daemonset/test Error creating: pods "test-" is forbidden: unable to validate against any pod security policy: []
Commonly this message was recived when running a deployment on a namespace.
A
See the docs on pod security policies
Pods/PVC stuck "Terminating"
A See if there is a matching pvc which may be preventing the pod from deleting:
#kubectl -n jhub get pvc /tmp/harbor/harbor/templates/ingress NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE claim-hans Terminating pvc-2dcdf159-c089-11e9-ba6d-525400aa5c0c 10Gi RWO cyb-inst-scratch 4d3h # kubectl -n jhub patch pvc claim-hans -p '{"metadata":{"finalizers":null}}' /tmp/harbor/harbor/templates/ingress persistentvolumeclaim/claim-hans patched
A Brute force: rebooting the node that the pod was running on appears to have cleared it. I believed simply restarting docker would have done the trick, but alas, it did not.
404 Default Backend
503 : Service Unavailable Your server appears to be down. Try restarting it from the hub.
A They likely ran out of space on their home directory. In this example, the user is in ics53. Use the following commands to get to the relevant error message:
kubectl get namespaces kubectl -n jhub-ics53 get pods kubectl -n jhub-ics53 describe pod jupyter-username kubectl -n jhub-ics53 logs jupyter-username will show: "Disk quota exceeded"
The above commands may need to be ran in conjunction with starting and accessing the user's server from https://ics53.hub.ics.uci.edu/hub/admin.
Another reason for the error is running out of inodes. Run the following command on the user's home dir:
/home/support/bin/inodecount
The ugrad limit is currently 75k.
Kubelet has disk pressure
This is caused when the server runs out of disk space.
A. Add disk space Growing Logical Volumes
A. Prune docker images
Also see kubernetes garbage collectin
No Kind Job
Error: UPGRADE FAILED: no kind “Job” is registered for version “batch/v1” in scheme “k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30”
A https://github.com/helm/helm/issues/6894
At this time (2019/12/19), the only solution is to upgrade to 3 or downgrade to 2.15. This is what is wrong with this confederation of software, everybody is breaking one piece or another on a weekly if not daily basis.