Troubleshooting

If you run into problems, there are a few things you can do to research the problem. This document describes what you can do.

NOTE: cluster$ indicates that the commands should be run as root on your OAS machine.

Known issues

Take a look if the problem you have encountered is already in our issue tracker.

Run the cli tests

To get an overall status of your cluster you can run the tests from the command line.

There are two types of tests: testinfra tests, and behave tests.

Testinfra tests

Testinfra tests are split into two groups, lets call them blackbox and clearbox tests. The blackbox tests run on your provisioning machine and test the OAS cluster from the outside. For example, the certificate check will check if the OAS will return valid certificates for the provided services. The clearbox tests run on the OAS host and check i.e. if docker is installed in the right version etc.

First, enter the test directory in the Git repository on your provisioning machine.

To run the test against your cluster, first export the CLUSTER_DIR environment variabel with the location of your cluster config directory:

export CLUSTER_DIR="../clusters/CLUSTERNAME"

Run all tests:

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'

Known Issues

  • Default ssh backend for testinfra tests is paramiko, which doesn’t work oout of the box. It fails to connect to the host because the ed25519 hostkey was not verified. Therefore we need to force plain ssh:// with either connection=ssh or --hosts=ssh://…

Behave tests

Behave tests run in a browser and test if all the interfaces are up and running and correctly connected to each other. They are integrated in the openappstack CLI command suite.

Prerequisites

By default the behave tests use the Chromium browser and Chromium webdriver. If you want/need to use the Firefox webdriver please refer to the manual behave test instructions below, under Advanced Usage.

Install Chromedriver, i.e. for Debian/Ubuntu use:

apt install chromium-chromedriver

Usage

To run all behave tests, run the following command in this repository:

python -m openappstack CLUSTERNAME test

In the future, this command will run all tests, but now only behave is implemented. To learn more about the test subcommand, run:

python -m openappstack CLUSTERNAME test --help

You can also only run a behave test for a specific application, i.e.:

python -m openappstack CLUSTERNAME test --behave-tags nextcloud

Advanced usage

Testinfra tests

Specify host manually:

py.test -s --hosts='ssh://root@example.openappstack.net'

Run only tests tagged with prometheus:

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m prometheus

Run cert test manually using the ansible inventory file:

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m certs

Run cert test manually against a different cluster, not configured in any ansible inventory file, either by using pytest:

FQDN='example.openappstack.net' py.test -sv -m 'certs'

or directly:

FQDN='example.openappstack.net' pytest/test_certs.py

Running testinfra tests with local gitlab-runner docker executor

Export the following environment variables like this:

export CI_REGISTRY_IMAGE='open.greenhost.net:4567/openappstack/openappstack'
export SSH_PRIVATE_KEY="$(cat ~/.ssh/id_ed25519_oas_ci)"
export COSMOS_API_TOKEN='…'

then:

gitlab-runner exec docker --env CI_REGISTRY_IMAGE="$CI_REGISTRY_IMAGE" --env SSH_PRIVATE_KEY="$SSH_PRIVATE_KEY" --env COSMOS_API_TOKEN="$COSMOS_API_TOKEN" bootstrap

Behave tests

Using Firefox instead of Chromium

If you want to use Firefox instead of Chromium, you need to install the gecko driver

apt install firefox-geckodriver

Now you only need to add -D browser=firefox to the behave command line options, so run:

python -m openappstack CLUSTER_NAME test --behave-param='-D browser=firefox'
Using behave without the OpenAppStack CLI

Go to the test/behave directory and run:

For nextcloud & onlyoffice tests:

behave -D nextcloud.url=https://files.example.openappstack.net \
       -D nextcloud.password="$(cat ../../clusters/YOUR_CLUSTERNAME/secrets/nextcloud_admin_password)" \
       -t nextcloud

You can replace nextcloud with grafana or rocketchat to test the other applications.

Run behave tests in openappstack-ci docker image

docker run --rm -it open.greenhost.net:4567/openappstack/openappstack/openappstack-ci sh

  apk --no-cache add git
  git clone https://open.greenhost.net/openappstack/openappstack.git
  cd openappstack/test/behave
  behave -D nextcloud.url=https://files.ci-20410.ci.openappstack.net \
   -D nextcloud.admin.password=…

SSH access

You can SSH login to your VPS. Some programs that are available to the root user on the VPS:

  • kubectl, the Kubernetes control program. The root user is connected to the cluster automatically.
  • helm is the “Kubernetes package manager”. Use i.e. helm ls --all-namespaces to see what apps are installed in your cluster. You can also use it to perform manual upgrades; see helm --help.

Using kubectl to debug your cluster

You can use kubectl, the Kubernetes control program, to find and manipulate your Kubernetes cluster. Once you have installed kubectl, to get access to your cluster with the OAS CLI:

$ python -m openappstack my-cluster info

Look for these lines:

To use kubectl with this cluster, copy-paste this in your terminal:

export KUBECONFIG=/home/you/projects/openappstack/clusters/my-cluster/secrets/kube_config_cluster.yml

Copy the whole export line into your terminal. In the same terminal window, kubectl will connect to your cluster.

HTTPS Certificates

OAS uses cert-manager to automatically fetch Let’s Encrypt certificates for all deployed services. If you experience invalid SSL certificates, i.e. your browser warns you when visiting Rocketchat (https://chat.example.org), here’s how to debug this. A useful resource for troubleshooting is also the official cert-manager Troubleshooting Issuing ACME Certificates documentation.

In this example we fix a failed certificate request for chat.example.org. We will start by checking if cert-manager is set up correctly.

Did you create your cluster using the --acme-staging argument? Please check the resulting value of the acme_staging key in clusters/YOUR_CLUSTERNAME/group_vars/all/settings.yml. If this is set to true, certificates are fetched from the Let’s Encrypt staging API, which can’t be validated by default in your browser.

Are all cert-manager pods in the oas namespace in the READY state ?

$ kubectl -n oas get pods | grep cert-manager

Are there any cm-acme-http-solver-* pods still running, indicating that there are unfinished certificate requests ?

$ kubectl get pods --all-namespaces | grep cm-acme-http-solver

Show the logs of the main cert-manager pod:

$ kubectl -n oas logs -l "app.kubernetes.io/name=cert-manager"

You can grep for your cluster domain or for any specific subdomain to narrow down results.

Query for failed certificates, -requests, challenges or orders:

$ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
oas-apps    certificate.cert-manager.io/oas-rocketchat                 False   oas-rocketchat                 15h
oas-apps    certificaterequest.cert-manager.io/oas-rocketchat-2045852889                 False   15h
oas-apps    challenge.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563-837515681   pending   chat.example.org   15h
oas-apps    order.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563                 pending   15h

We see that the Rocketchat certificate resources are in a bad state since 15h.

Show certificate resource status message:

$ kubectl -n oas-apps get certificate oas-rocketchat -o jsonpath="{.status.conditions[*]['message']}"
Waiting for CertificateRequest "oas-rocketchat-2045852889" to complete

We see that the certificate is waiting for the certificaterequest, lets query it’s status message:

$ kubectl -n oas-apps get certificaterequest oas-rocketchat-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
Waiting on certificate issuance from order oas-apps/oas-rocketchat-2045852889-1775447563: "pending"

Show the related order resource and look at the status and events:

	kubectl -n oas-apps describe order oas-rocketchat-2045852889-1775447563

Show the failed challenge resource reason:

	$ kubectl -n oas-apps get challenge oas-rocketchat-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
	Waiting for http-01 challenge propagation: wrong status code '503', expected '200'

In this example, deleting the challenge fixed the issue and a proper certificate could get fetched:

$ kubectl -n oas-apps delete challenges.acme.cert-manager.io oas-rocketchat-2045852889-1775447563-837515681

Application installation fails

Find applications that fail to install:

helm ls --all-namespaces | grep -i -v DEPLOYED
kubectl get helmreleases --all-namespaces | grep -i -v DEPLOYED

Especially the nextcloud installation process is brittle and error-prone. Lets take it as an example how to debug the root cause.

Purge OAS and install from scratch

If ever things fail beyond possible recovery, here’s how to completely purge an OAS installation in order to start from scratch:

cluster$ apt purge docker-ce-cli containerd.io
cluster$ mount | egrep '^(.*kubelet|nsfs.*docker)' | cut -d' ' -f 3 | xargs umount
cluster$ rm -rf /var/lib/docker /var/lib/OpenAppStack /etc/kubernetes /var/lib/etcd /var/lib/rancher /var/lib/kubelet /var/log/OpenAppStack /var/log/containers /var/log/pods
cluster$ systemctl reboot