Engine Troubleshooting Guide
Introduction
This guide explains tools to troubleshooting the azeti Engine server stack, this requires a deep system understanding and is considered to be an advanced read.
General overview of how services are configured within an azeti Engine Installation
In order to have a general overview please read the general vm installation guide:
https://azetinetworks.atlassian.net/wiki/display/AC/azeti+Cloud+-+Install
Identifying Services on a given azeti Engine Installation
In order to identify the vm/container's one should query the *azeti-cfg* database.
To access the *azeti-cfg* database on can take a look on how Tomcat is configured to access it by looking at the {{/etc/default/tomcat8}} file.
JAVA_OPTS="$JAVA_OPTS -Dazeti.ssc.config.jdbc.driverClassName=org.postgresql.Driver -Dazeti.ssc.config.jdbc.url=jdbc:postgresql://127.0.0.1:5432/azeti-cfg?ApplicationName=SSCConfig
Then we must query the table azeti_ssc_app_config to view how the applications access to the specific services:
select * from azeti_ssc_app_config
Identifying Services on a given VM
In order to identify the services running on a VM one should look at how they are monitored, started and stopped by looking at the {{/etc/monit/conf.d/azeti_cloud_monit}} file.
For example we identify how PostgreSQL is monitored, started and stopped.
# Postgresql 9.4 check process postgresql with pidfile /var/lib/postgresql/9.4/main/postmaster.pid group database stop program = "/etc/init.d/postgresql stop" start program = "/etc/init.d/postgresql restart" if failed unixsocket /var/run/postgresql/.s.PGSQL.5432 protocol pgsql then start if failed unixsocket /var/run/postgresql/.s.PGSQL.5432 protocol pgsql then alert if failed host localhost port 5432 protocol pgsql then start if failed host localhost port 5432 protocol pgsql then alert if failed host localhost port 5432 protocol pgsql then exec /etc/monit/slack if failed host localhost port 5432 protocol pgsql then exec /etc/monit/hipchat if 5 restarts within 5 cycles then timeout
API Test
The objective of the api/test call is to have an automated way of checking the status of the inner applications services by directly sending an event message from a simulated gateway (a SiteController called CloudCheck
 and a Sensor called CloudTest
) and then checking all the internals trough this message and asserting that all the step have been performed correctly.
Pre-requisites
In order to perform this test it is mandatory to have an application user with two specific roles:
ADMIN or OPERATOR role (In order to been able to make and API call, and subscribe to the application Topics)
plus
GATEWAYÂ role (In order to send a message as a SiteController called CloudCheck for a Sensor called CloudTest).
Call
The call is a POST request to theÂ
https://azeticloudXX.azeti.net/SSCServices/api/test
 with the user described in the previous step.curl -sL -u 'test@azeti.org:xxxxxxxx' -X POST https://azeticloudXX.azeti.net/SSCServices/api/test| python -m json.tool
- Tests Performed
There are currently 8 Tests performed returning 0 for failure of the test or 1 for success of the test.- Api
This is the basic test of reaching the API endpoint ofÂSSCServices/api/test
 plus successfully authenticating the user of the call. - Database
This is the basic test of making a relational database query for the number of locations plus checking for valid numeric result. - Broker
This is the basic test of making a Broker API request plus checking for the Store, Memory and Temp percentages and the Current Connections Count.
In order to pass the test the Store has to be less than 75% the Memory less than 90% and the Temp less than 75% and the Current Connections Count less than 5000. - BrokerPub
This is the basic test of publishing a random event message to the Broker from theÂCloudTest
 sensor from theÂCloudCheck
 gateway. - BrokerSub
This is the basic test of making a subscription to the ACP application topic for theÂCloudTest
 sensor from theÂCloudCheck
 gateway plus checking that the random event message has been successfully received at the ACP application level within 750 milliseconds. - BrokerActivity
This test checks the Broker Queues that the ACP application uses internally and and Returns Error when one of the queues has More Than 200 Pending Messages.
Also in case that the outbound queues has been enabled but no consumer is dequeuing (No Stomp Client connected) them deletes them in case that the Pending size is higher than 100K. - TsInflux08
 This is is the basic test of making a query to the InfluxDB version 0.8 time series database to assure that the random event message has been stored within 1200 milliseconds. - TsInfluxDB
This is is the basic test of making a query to the InfluxDB version 0.13.+ time series database to assure that the random event message has been stored within 1200 milliseconds.
- Api
Additional Results
Apart from the mentioned tests, the answer also returns number of Errors (That must be 0 if all the tests returned 1) with consequent ErrorMessages.
The number of Warnings (Independent of the success result of the tests) with consequent WarningMessages.
The total Time spend on the tests in milliseconds.
Example of a successful answer without warnigs
"Errors": 0
"Warnings": 0
Note:Â The error counter of 0 corresponds to the all the current 8 tests being successful and delivering 1 as a result"Api": 1
"Database": 1
"Broker": 1
"BrokerPub": 1
"BrokerSub": 1
"TsInflux08": 1
"TsInfluxDB": 1
"BrokerActivity": 1[ { "Api": 1 }, { "Database": 1 }, { "Broker": 1 }, { "BrokerPub": 1 }, { "BrokerSub": 1 }, { "TsInflux08": 1 }, { "TsInfluxDB": 1 }, { "BrokerActivity": 1 }, { "InfoMessages": [ "Api Test result: user: 580751", "Database Test result: locations: 4", "Broker Uptime result: 8 hours 6 minutes. CurrentConnections: 150. Store: 10% Memory: 0% Temp: 0%", "BrokerPub Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Published: true", "BrokerSub Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1", "TsInflux08 Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1", "TsInfluxDB Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1", "BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]" ] }, { "Errors": 0 }, { "ErrorMessages": [] }, { "Warnings": 0 }, { "WarningMessages": [] }, { "Time": 3119 } ]
Example of an answer with ERRORS
"Errors": 2
"Warnings": 0Note:Â The error counter of 2 corresponds to 2 tests delivering 0 as a result.
"TsInflux08": 0
"BrokerActivity": 0
Note: The error counter of 2 produced the corresponding 2Â ErrorMessages with further details.[ { "Api": 1 }, { "Database": 1 }, { "Broker": 1 }, { "BrokerPub": 1 }, { "BrokerSub": 1 }, { "TsInflux08": 0 }, { "TsInfluxDB": 1 }, { "BrokerActivity": 0 }, { "InfoMessages": [ "Api Test result: user: 580751", "Database Test result: locations: 4", "Broker Uptime result: 1 hour 26 minutes. CurrentConnections: 152. Store: 1% Memory: 0% Temp: 0%", "BrokerPub Test result: Value: -195000897 Time: 2016-08-24T13:09:09.764Z+0200 Published: true", "BrokerSub Test result: Value: -195000897 Time: 2016-08-24T13:09:09.764Z+0200 Values: true Times: true Points: 1", "TsInfluxDB Test result: Value: 552711248 Time: 2016-08-24T13:08:33.386Z+0200 Values: false Times: false Points: 1", "BrokerActivity Test result: [With Zero Pending: 21] [azeti.ts_hd_topersist 1] [azeti.sscbroker_events 1] [azeti.notification.in 54] [azeti.idbts_values_topersist 1] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]" ] }, { "Errors": 2 }, { "ErrorMessages": [ "TsInflux08 empty result! not yet persisted after waiting for 1200ms? query : select value from \"sp.89fc49f5-0da3-47e9-8a65-4e149a17364d.cutype.3.checkunit.c50fbc07-2efa-421a-8f27-43bd42a3d45e.Value\" where value = -195000897 and time > 1472036949764 limit 5", "BrokerActivity Internal Pending Limits Reached for: azeti.sscbroker_hd: 308 Enq/Deq: 1072/1058" ] }, { "Warnings": 0 }, { "WarningMessages": [] }, { "Time": 2946 } ]Â
About the InfoMessages
Some details of about the InfoMessages."InfoMessages": [ "Api Test result: user: 9303", "Database Test result: locations: 17", "Broker Uptime result: 3 hours 8 minutes. CurrentConnections: 115. Store: 4% Memory: 0% Temp: 0%", "BrokerPub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Published: true", "BrokerSub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1", "TsInflux08 Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1", "TsInfluxDB Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1", "BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]" ]
Even in the case that the test resulted with success: 1 there are InfoMessages in order to analyse in detail the performed tests.
Note: How the InfoMessages deliver further details on how the test performed.Â"Api Test result: user: 9303"Â
Displays that the user 9303 was successfully authenticated."Database Test result: locations: 17"Â
Displays that the database test query returned 17 locations."Broker Uptime result: 3 hours 8 minutes. CurrentConnections: 115. Store: 4% Memory: 0% Temp: 0%"Â
Displays the actual Broker values."BrokerPub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Published: true"Â
Displays the value and timestamp of the published test event."BrokerSub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1"Â
Displays the correctness of the value and timestamp of the application subscribed test event."TsInflux08 Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1"Â
Displays the correctness of the value and timestamp of the InfluxDB 08 stored test event."TsInfluxDB Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1"Â
Displays the correctness of the value and timestamp of the InfluxDB 0.13.+ stored test event."BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]"Â
Displays the Broker Queues with 0 Pending messages plus the Non Existing Queues to test.Troubleshoting
In case that some test returns with error: 0 it is important to analyse the ErrorMessages plus the InfoMessages in order of understanding what is going on.Performing the same call after a minute will help to understand if it was a temporally problem (i.e.: due to a heavy load) or a more serious problem.
Troubleshooting
Some nice to have troubleshooting tips on live environments:
Display current activity on PostgreSQL database
su postgres psql -d postgres select count(*) as total, datname, application_name, waiting, state, query from pg_stat_activity group by datname, application_name, waiting, state, query order by total desc;
- InfluxDBÂ Monitoring
System MonitoringÂ
https://github.com/influxdata/influxdb/blob/master/monitor/README.md
How to use the SHOW STATS command and the _internal database to monitor InfluxDB
https://www.influxdata.com/how-to-use-the-show-stats-command-and-the-_internal-database-to-monitor-influxdb/ - Â