Troubleshooting

If you run into problems, there are a few things you can do to research the problem. This document describes what you can do.

Note

cluster$ indicates that the commands should be run as root on your OAS machine.

We would love to hear from you! If you have problems, please create an issue in our issue tracker or reach out as described on our contact page. We want to be in communication with our users, and we want to help you if you run into problems.

Known issues

If you run into a problem, please check our issue tracker to see if others have run into the same problem. We might have suggested a workaround or temporary solution in one of our issues. If your problems is not described in an issue, please open a new one so we can solve the problems you encounter.

Run the CLI tests

To get an overall status of your cluster you can run the tests from the command line.

There are two types of tests: [testinfra](https://testinfra.readthedocs.io/en/latest/) tests, and [Taiko](https://taiko.dev) tests.

Testinfra tests

Testinfra tests are split into two groups, lets call them blackbox and clearbox tests. The blackbox tests run on your provisioning machine and test the OAS cluster from the outside. For example, the certificate check will check if the OAS returns valid certificates for the provided services. The clearbox tests run on the OAS host and check i.e. if docker is installed in the right version etc. Our testinfra tests are a combination of blackbox and clearbox tests.

First, enter the test directory in the Git repository on your provisioning machine.

cd test

To run the test against your cluster, first export the CLUSTER_DIR environment variable with the location of your cluster config directory (replace oas.example.org with your cluster name):

export CLUSTER_DIR="../clusters/oas.example.org"

Run all tests

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'

Test all applications

This will check for:

  • The applications return proper certificates

  • All helm releases are successfully installed

  • All app pods are running and healthy (this test includes all optional applications)

These tests includes all optional applications and will fail for optional applications that are not installed.

pytest -s -m 'app' --connection=ansible --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'

Tests a specific application

pytest -s -m 'app' --app="wordpress" --connection=ansible --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'

Known Issues

The Default ssh backend for testinfra tests is paramiko, which doesn’t work out of the box. It fails to connect to the host because the ed25519 hostkey was not verified. Therefore we need to force plain ssh:// with either connection=ssh or --hosts=ssh://…

Taiko tests

Taiko tests run in a browser and test if all the interfaces are up and running and correctly connected to each other. They are integrated in the openappstack CLI command suite.

Prerequisites

Install [Taiko](https://taiko.dev) on your provisioning machine:

npm install -g taiko

Run Taiko tests

To run all Taiko tests, run the following command in this repository:

python -m openappstack CLUSTERNAME test

To learn more about the test subcommand, run:

python -m openappstack CLUSTERNAME test --help

You can also only run a Taiko test for a specific application, i.e.:

python -m openappstack CLUSTERNAME test --taiko-tags nextcloud

Advanced usage

Testinfra tests

Specify host manually:

py.test -s --hosts='ssh://root@example.openappstack.net'

Run only tests tagged with prometheus:

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m prometheus

Run cert test manually using the ansible inventory file:

py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m certs

Run cert test manually against a different cluster, not configured in any ansible inventory file, either by using pytest:

FQDN='example.openappstack.net' py.test -sv -m 'certs'

or directly:

FQDN='example.openappstack.net' pytest/test_certs.py

Running Testinfra tests with local gitlab-runner docker executor

Export the following environment variables like this:

export CI_REGISTRY_IMAGE='open.greenhost.net:4567/openappstack/openappstack'
export SSH_PRIVATE_KEY="$(cat ~/.ssh/id_ed25519_oas_ci)"
export COSMOS_API_TOKEN='…'

then:

gitlab-runner exec docker --env CI_REGISTRY_IMAGE="$CI_REGISTRY_IMAGE" --env SSH_PRIVATE_KEY="$SSH_PRIVATE_KEY" --env COSMOS_API_TOKEN="$COSMOS_API_TOKEN" bootstrap

Taiko tests

Using Taiko without the OpenAppStack CLI

Go to the test/taiko directory and run:

For nextcloud & onlyoffice tests:

export DOMAIN='oas.example.net'
export SSO_USERNAME='user1'
export SSO_USER_PW='...'
export TAIKO_TESTS='nextcloud'
taiko --observe taiko-tests.js

You can replace nextcloud with grafana or wordpress to test the other applications, or with all to test all applications.

SSH access

You can SSH login to your VPS. Some programs that are available to the root user on the VPS:

  • kubectl, the Kubernetes control program. The root user is connected to the cluster automatically.

  • helm is the “Kubernetes package manager”. Use i.e. helm ls --all-namespaces to see what apps are installed in your cluster. You can also use it to perform manual upgrades; see helm --help.

  • flux is the flux command line tool

Using kubectl to debug your cluster

You can use kubectl, the Kubernetes control program, to find and manipulate your Kubernetes cluster. Once you have installed kubectl, to get access to your cluster with the OAS CLI:

$ python -m openappstack oas.example.org info

Look for these lines:

To use kubectl with this cluster, copy-paste this in your terminal:
export KUBECONFIG=/home/you/projects/openappstack/clusters/oas.example.org/kube_config_cluster.yml

Copy the whole export line into your terminal. In the same terminal window, kubectl will connect to your cluster.

HTTPS Certificates

OAS uses cert-manager to automatically fetch Let’s Encrypt certificates for all deployed services. If you experience invalid SSL certificates, i.e. your browser warns you when visiting Rocketchat (https://chat.oas.example.org), a useful resource for troubleshooting is the official cert-manager Troubleshooting Issuing ACME Certificates documentation. First, try this:

In this example we fix a failed certificate request for https://chat.oas.example.org. We will start by checking if cert-manager is set up correctly.

Is your cluster using the live ACME server?

$ kubectl get clusterissuers -o yaml | grep 'server:'

Should return server: https://acme-v02.api.letsencrypt.org/directory and not something with the word staging in it.

Are all cert-manager pods in the oas namespace in the READY state ?

$ kubectl -n cert-manager get pods

Cert-manager uses a “custom resource” to keep track of your certificates, so you can also check the status of your certificates by running:

This returns all the certificates for all applications on your system. The command includes example output of healthy certificates.

$ kubectl get certificates -A
NAMESPACE   NAME                           READY   SECRET                         AGE
oas         hydra-public.tls               True    hydra-public.tls               14d
oas         single-sign-on-userpanel.tls   True    single-sign-on-userpanel.tls   14d
oas-apps    oas-nextcloud-files            True    oas-nextcloud-files            14d
oas-apps    oas-nextcloud-office           True    oas-nextcloud-office           14d
oas         grafana-tls                    True    grafana-tls                    13d
oas         alertmanager-tls               True    alertmanager-tls               13d
oas         prometheus-tls                 True    prometheus-tls                 13d

If there are problems, you can check for the specific certificaterequests:

$ kubectl get certificaterequests -A

If you still need more information, you can dig into the logs of the cert-manager pod:

$ kubectl -n oas logs -l “app.kubernetes.io/name=cert-manager”

You can grep for your cluster domain or for any specific subdomain to narrow down results.

Example

Query for failed certificates, -requests, challenges or orders:

$ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
oas-apps    certificate.cert-manager.io/oas-rocketchat                 False   oas-rocketchat                 15h
oas-apps    certificaterequest.cert-manager.io/oas-rocketchat-2045852889                 False   15h
oas-apps    challenge.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563-837515681   pending   chat.oas.example.org   15h
oas-apps    order.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563                 pending   15h

We see that the Rocketchat certificate resources are in a bad state since 15h.

Show certificate resource status message:

$ kubectl -n oas-apps get certificate oas-rocketchat -o jsonpath="{.status.conditions[*]['message']}"
Waiting for CertificateRequest "oas-rocketchat-2045852889" to complete

We see that the certificate is waiting for the certificaterequest, lets query its status message:

$ kubectl -n oas-apps get certificaterequest oas-rocketchat-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
Waiting on certificate issuance from order oas-apps/oas-rocketchat-2045852889-1775447563: "pending"

Show the related order resource and look at the status and events:

$ kubectl -n oas-apps describe order oas-rocketchat-2045852889-1775447563

Show the failed challenge resource reason:

$ kubectl -n oas-apps get challenge oas-rocketchat-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
Waiting for http-01 challenge propagation: wrong status code '503', expected '200'

In this example, deleting the challenge fixed the issue and a proper certificate could get fetched:

$ kubectl -n oas-apps delete challenges.acme.cert-manager.io oas-rocketchat-2045852889-1775447563-837515681

Application installation or upgrade failures

Application installations and upgrades are managed by flux. Flux uses helm-controller to install and upgrade applications with helm charts.

An application installed with Flux consists of a kustomization. This is a resource that defines where the information about the application is stored in our Git repository. The kustomization contains a helmrelease, which is an object that represents an installation of a Helm chart. Read more about the difference between kustomizations and helmreleases in the flux documentation

To find out if all kustomizations have been applied correctly, run the following flux command in your cluster:

cluster$ flux get kustomizations -A

If all your kustomizations are in a Ready state, take a look at your helmreleases:

cluster$ flux get helmreleases -A

Often, you can resolve complications with kustomizations or helmreleases by telling Flux to reconcile them:

cluster$ flux reconcile helmrelease nextcloud

Will make sure that the Nextcloud helmrelease gets brought into a state that our OpenAppStack wants it to be in.

Purge OAS and install from scratch

If ever things fail beyond possible recovery, here’s how to completely purge an OAS installation in order to start from scratch:

Warning

You will lose all your data! This completely destroys OpenAppStack and takes everything offline. If you chose to do this, you will need to re-install OpenAppStack and make sure that your data is stored somewhere other than the VPS that runs OpenAppStack.

cluster$ /usr/local/bin/k3s-killall.sh
cluster$ systemctl disable k3s
cluster$ rm -rf /var/lib/{rancher,OpenAppStack,kubelet,cni,docker,etcd} /etc/{kubernetes,rancher} /var/log/{OpenAppStack,containers,pods} /tmp/k3s /etc/systemd/system/k3s.service
cluster$ systemctl reboot