Troubleshooting¶
If you run into problems, there are a few things you can do to research the problem. This document describes what you can do.
NOTE:cluster$
indicates that the commands should be run as root on your OAS machine.
Known issues¶
Take a look if the problem you have encountered is already in our issue tracker.
Run the cli tests¶
To get an overall status of your cluster you can run the tests from the command line.
There are two types of tests: testinfra tests, and behave tests.
Testinfra tests¶
Testinfra tests are split into two groups, lets call them blackbox
and
clearbox
tests. The blackbox tests run on your provisioning machine and test
the OAS cluster from the outside. For example, the certificate check will check
if the OAS will return valid certificates for the provided services.
The clearbox tests run on the OAS host and check i.e. if docker is installed
in the right version etc.
First, enter the test
directory in the Git repository on your provisioning
machine.
cd test
To run the test against your cluster, first export the CLUSTER_DIR
environment
variabel with the location of your cluster config directory:
export CLUSTER_DIR="../clusters/CLUSTERNAME"
Run all tests:
py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'
Test all applications, that will check for:
- proper certificate
- helm release successfully installed
- all app pods are running and healthy
pytest -s -m 'app' --connection=ansible --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'
Test a specific application:
pytest -s -m 'app' --app="wordpress" --connection=ansible --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'
Known Issues¶
- Default ssh backend for testinfra tests is
paramiko
, which doesn’t work oout of the box. It fails to connect to the host because theed25519
hostkey was not verified. Therefore we need to force plain ssh:// with eitherconnection=ssh
or--hosts=ssh://…
Behave tests¶
Behave tests run in a browser and test if all the interfaces are up
and running and correctly connected to each other. They are integrated in the
openappstack
CLI command suite.
Prerequisites¶
By default the behave tests use the Chromium browser and Chromium webdriver. If you want/need to use the Firefox webdriver please refer to the manual behave test instructions below, under Advanced Usage.
Install Chromedriver, i.e. for Debian/Ubuntu use:
apt install chromium-chromedriver
Usage¶
To run all behave tests, run the following command in this repository:
python -m openappstack CLUSTERNAME test
In the future, this command will run all tests, but now only behave is
implemented. To learn more about the test
subcommand, run:
python -m openappstack CLUSTERNAME test --help
You can also only run a behave test for a specific application, i.e.:
python -m openappstack CLUSTERNAME test --behave-tags nextcloud
Advanced usage¶
Testinfra tests¶
Specify host manually:
py.test -s --hosts='ssh://root@example.openappstack.net'
Run only tests tagged with prometheus
:
py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m prometheus
Run cert test manually using the ansible inventory file:
py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m certs
Run cert test manually against a different cluster, not configured in any ansible inventory file, either by using pytest:
FQDN='example.openappstack.net' py.test -sv -m 'certs'
or directly:
FQDN='example.openappstack.net' pytest/test_certs.py
Running testinfra tests with local gitlab-runner docker executor¶
Export the following environment variables like this:
export CI_REGISTRY_IMAGE='open.greenhost.net:4567/openappstack/openappstack'
export SSH_PRIVATE_KEY="$(cat ~/.ssh/id_ed25519_oas_ci)"
export COSMOS_API_TOKEN='…'
then:
gitlab-runner exec docker --env CI_REGISTRY_IMAGE="$CI_REGISTRY_IMAGE" --env SSH_PRIVATE_KEY="$SSH_PRIVATE_KEY" --env COSMOS_API_TOKEN="$COSMOS_API_TOKEN" bootstrap
Behave tests¶
Using Firefox instead of Chromium¶
If you want to use Firefox instead of Chromium, you need to install the gecko driver
apt install firefox-geckodriver
Now you only need to add -D browser=firefox
to the behave command line
options, so run:
python -m openappstack CLUSTER_NAME test --behave-param='-D browser=firefox'
Using behave without the OpenAppStack CLI¶
Go to the test/behave
directory and run:
For nextcloud & onlyoffice tests:
behave -D nextcloud.url=https://files.example.openappstack.net \
-D nextcloud.password="$(cat ../../clusters/YOUR_CLUSTERNAME/secrets/nextcloud_admin_password)" \
-t nextcloud
You can replace nextcloud
with grafana
or rocketchat
to test the other
applications.
Run behave tests in openappstack-ci docker image¶
docker run --rm -it open.greenhost.net:4567/openappstack/openappstack/openappstack-ci sh
apk --no-cache add git
git clone https://open.greenhost.net/openappstack/openappstack.git
cd openappstack/test/behave
behave -D nextcloud.url=https://files.ci-20410.ci.openappstack.net \
-D nextcloud.admin.password=…
SSH access¶
You can SSH login to your VPS. Some programs that are available to the root user on the VPS:
kubectl
, the Kubernetes control program. The root user is connected to the cluster automatically.helm
is the “Kubernetes package manager”. Use i.e.helm ls --all-namespaces
to see what apps are installed in your cluster. You can also use it to perform manual upgrades; seehelm --help
.
Using kubectl to debug your cluster¶
You can use kubectl
, the Kubernetes control program, to find and manipulate
your Kubernetes cluster. Once you have installed kubectl
, to get access to your
cluster with the OAS CLI:
$ python -m openappstack my-cluster info
Look for these lines:
To use kubectl with this cluster, copy-paste this in your terminal:
export KUBECONFIG=/home/you/projects/openappstack/clusters/my-cluster/secrets/kube_config_cluster.yml
Copy the whole export
line into your terminal. In the same terminal window,
kubectl will connect to your cluster.
HTTPS Certificates¶
OAS uses cert-manager to automatically
fetch Let’s Encrypt certificates for all deployed
services. If you experience invalid SSL certificates, i.e. your browser warns you
when visiting Rocketchat (https://chat.example.org
), here’s how to
debug this. A useful resource for troubleshooting is also the official cert-manager
Troubleshooting Issuing ACME Certificates
documentation.
In this example we fix a failed certificate request for chat.example.org
.
We will start by checking if cert-manager
is set up correctly.
Did you create your cluster using the --acme-staging
argument?
Please check the resulting value of the acme_staging
key in
clusters/YOUR_CLUSTERNAME/group_vars/all/settings.yml
. If this is set to true
, certificates
are fetched from the Let’s Encrypt staging API,
which can’t be validated by default in your browser.
Are all cert-manager pods in the oas
namespace in the READY
state ?
$ kubectl -n oas get pods | grep cert-manager
Are there any cm-acme-http-solver-*
pods still running, indicating that there
are unfinished certificate requests ?
$ kubectl get pods --all-namespaces | grep cm-acme-http-solver
Show the logs of the main cert-manager
pod:
$ kubectl -n oas logs -l "app.kubernetes.io/name=cert-manager"
You can grep
for your cluster domain or for any specific subdomain to narrow
down results.
Query for failed certificates, -requests, challenges or orders:
$ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
oas-apps certificate.cert-manager.io/oas-rocketchat False oas-rocketchat 15h
oas-apps certificaterequest.cert-manager.io/oas-rocketchat-2045852889 False 15h
oas-apps challenge.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563-837515681 pending chat.example.org 15h
oas-apps order.acme.cert-manager.io/oas-rocketchat-2045852889-1775447563 pending 15h
We see that the Rocketchat certificate resources are in a bad state since 15h.
Show certificate resource status message:
$ kubectl -n oas-apps get certificate oas-rocketchat -o jsonpath="{.status.conditions[*]['message']}"
Waiting for CertificateRequest "oas-rocketchat-2045852889" to complete
We see that the certificate
is waiting for the certificaterequest
, lets
query it’s status message:
$ kubectl -n oas-apps get certificaterequest oas-rocketchat-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
Waiting on certificate issuance from order oas-apps/oas-rocketchat-2045852889-1775447563: "pending"
Show the related order resource and look at the status and events:
kubectl -n oas-apps describe order oas-rocketchat-2045852889-1775447563
Show the failed challenge resource reason:
$ kubectl -n oas-apps get challenge oas-rocketchat-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
Waiting for http-01 challenge propagation: wrong status code '503', expected '200'
In this example, deleting the challenge fixed the issue and a proper certificate could get fetched:
$ kubectl -n oas-apps delete challenges.acme.cert-manager.io oas-rocketchat-2045852889-1775447563-837515681
Application installation fails¶
Find applications that fail to install:
helm ls --all-namespaces | grep -i -v DEPLOYED
kubectl get helmreleases --all-namespaces | grep -i -v DEPLOYED
Especially the nextcloud installation process is brittle and error-prone. Lets take it as an example how to debug the root cause.
Purge OAS and install from scratch¶
If ever things fail beyond possible recovery, here’s how to completely purge an OAS installation in order to start from scratch:
cluster$ apt purge docker-ce-cli containerd.io
cluster$ mount | egrep '^(.*kubelet|nsfs.*docker)' | cut -d' ' -f 3 | xargs umount
cluster$ rm -rf /var/lib/docker /var/lib/OpenAppStack /etc/kubernetes /var/lib/etcd /var/lib/rancher /var/lib/kubelet /var/log/OpenAppStack /var/log/containers /var/log/pods
cluster$ systemctl reboot