Help Center > > User Guide> MRS Manager Operation Guide> Health Check Management> OMS Health Check

OMS Health Check

Updated at:Sep 12, 2019 GMT+08:00

OMS Status Check

Indicator name: OMS Status

Indicator description: The OMS status check includes the HA status check and resource status check. The values of the HA status include active, standby, and NULL, indicating the active node, standby node, and unknown, respectively. The values of the resource status include normal, abnormal, and NULL. If the HA status is NULL, the OMS is unhealthy. If the resource status is NULL or abnormal, the OMS is unhealthy.

Table 1 OMS status description

Name

Description

HA status

active indicates that the node is an active node.

standby indicates that the node is a standby node.

NULL indicates that the status is unknown.

Resource status

normal indicates that all resources are in normal state.

abnormal indicates that all resources are abnormal.

NULL indicates that the status is unknown.

Recovery guidance:

  1. Log in to the active management node, and run su - omm to switch to user omm. Run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status of OMS.
  2. If the HA status is NULL, the system may be being restarted. NULL is an intermediate status, and the HA status will automatically change to a normal value.
  3. If the resource status is abnormal, certain MRS Manager component resources are abnormal. Check whether the status of components such as acs, aos, cep, controller, feed_watchdog, fms, guassDB, httpd, iam, ntp, okerberos, oldap, pms, and tomcat is normal.
  4. If any MRS Manager component resource is abnormal, see the information about the MRS Manager component status check to rectify the fault.

Manager Component Status Check

Indicator name: Manager Component Status

Indicator description: The Manager component status check includes the component resource running status check and resource HA status check. The values of the resource running status include Normal, Abnormal, and others, and the values of the resource HA status include Normal, Exception, and others. The Manager components include acs, aos, cep, controller, feed_watchdog, floatip, fms, gaussDB, heartBeatCheck, httpd, iam, ntp, okerberos, oldap, pms, and tomcat. When the running status and HA status are not Normal, the MRS Manager components are unhealthy.

Table 2 Manager component status description

Name

Description

Resource running status

Normal indicates that the resource is running properly.

Abnormal indicates that the resource is running abnormally.

Stopped indicates that the resource is stopped.

Unknown indicates that the resource status is unknown.

Starting indicates that the resource is being started.

Stopping indicates that the resource is being stopped.

Active_normal indicates that the resource is running properly as the active module.

Standby_normal indicates that the resource is running properly as the standby module.

Raising_active indicates that the resource is raising to the active module.

Lowing_standby indicates that the resource is descending to the standby module.

No_action indicates that no action is performed.

Repairing indicates that the resource is being repaired.

NULL indicates that the resource status is unknown.

Resource HA status

Normal indicates that the resource HA is normal.

Exception indicates that the resource HA is faulty.

Non_steady indicates that the resource HA is not steady.

Unknown indicates that the resource HA is unknown.

NULL indicates that the resource HA is null.

Recovery guidance:

  1. Log in to the active management node, and run su - omm to switch to user omm. Run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status of OMS.
  2. If floatip, okerberos, and oldap are abnormal, see ALM-12002, ALM-12004, and ALM-12005 respectively to resolve the problems.
  3. If other resources are abnormal, you are advised to view the logs of the faulty modules.

    If the controller resource is abnormal, view the /var/log/Bigdata/controller/controller.log log file of the faulty node.

    If the cep resource is abnormal, view the /var/log/Bigdata/omm/oms/cep/cep.log log file of the faulty node.

    If the aos resource is abnormal, view the /var/log/Bigdata/controller/aos/aos.log log file of the faulty node.

    If the feed_watchdog resource is abnormal, view the /var/log/Bigdata/watchdog/watchdog.log log file of the faulty node.

    If the httpd resource is abnormal, view the /var/log/Bigdata/httpd/error_log log file of the faulty node.

    If the fms resource is abnormal, view the /var/log/Bigdata/omm/oms/fms/fms.log log file of the faulty node.

    If the pms resource is abnormal, view the /var/log/Bigdata/omm/oms/pms/pms.log log file of the faulty node.

    If the iam resource is abnormal, view the /var/log/Bigdata/omm/oms/iam/iam.log log file of the faulty node.

    If the gaussDB resource is abnormal, view the /var/log/Bigdata/omm/oms/db/omm_gaussdba.log log file of the faulty node.

    If the ntp resource is abnormal, view the /var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log log file of the faulty node.

    If the tomcat resource is abnormal, view the /var/log/Bigdata/tomcat/catalina.log log file of the faulty node.

  4. If the problem cannot be resolved by viewing logs, contact maintenance personnel and send the collected fault logs.

OMA Running Status

Indicator name: OMA Status

Indicator description: This indicator is used to check the running status of OMA. The values include running and not running. If the value is not running, the OMA is unhealthy.

Recovery guidance:

  1. Log in to the unhealthy node, and run su - omm to switch to user omm.
  2. Run ${OMA_PATH}/restart_oma_app to start the OMA manually, and perform the check again. If the check result is still unhealthy, go to 3.
  3. If the problem cannot be resolved by manually starting the OMA, you are advised to view and analyze the OMA log file /var/log/Bigdata/omm/oma/omm_agent.log.
  4. If the problem cannot be resolved by viewing logs, contact maintenance personnel and send the collected fault logs.

SSH Trust Relationship Between Each Node and the Active Management Node

Indicator name: Authentication: the authentication between the OMS node and the node

Indicator description: This indicator is used to check whether the SSH trust relationship is normal. If you do not need to enter the password when using the SSH to log in to other nodes from the active management node as user omm, the SSH trust relationship is healthy. Otherwise, the SSH trust relationship is unhealthy. If you can use the SSH to log in to other nodes from the active management node but cannot use the SSH to log in to the active management node from other nodes, the SSH trust relationship is unhealthy.

Recovery guidance:

  1. If this indicator is abnormal, the SSH trust relationship between each node and the active management node is abnormal. In this case, check whether the permission on the /home/omm directory is omm. If other users have permission on the directory, the SSH trust relationship may be abnormal. You are advised to run chown omm:wheel to modify the permission and perform the check again. If the permission on the /home/omm directory is normal, go to 2.
  2. If the SSH trust relationship is abnormal, the heartbeat between the Controller and the NodeAgent will be abnormal. As a result, an alarm indicating a node failure will be generated. In this case, see ALM-12006 to handle the alarm.

Process Running Time

Indicator name: NodeAgent Runtime, Controller Runtime, and Tomcat Runtime

Indicator description: These indicators are used to check the running time of the NodeAgent, Controller, and Tomcat processes. If the running time is less than half an hour (1800s), the process may have been restarted. You are advised to check the running time half an hour later. If the running time is still less than half an hour after multiple checks, the process is abnormal.

Recovery guidance:

  1. Log in to the unhealthy node, and run su - omm to switch to user omm.
  2. Run the following command to view the PID by process name:

    ps -ef | grep NodeAgent

  3. Run the following command to view the process start time by PID:

    ps -p pid -o lstart

  4. Check whether the process start time is normal. If the process repeatedly restarts, go to 5.
  5. View the related logs and analyze restart causes.

    If the running time of NodeAgent is abnormal, check the /var/log/Bigdata/nodeagent/agentlog/agent.log log file.

    If the running time of Controller is abnormal, check the /var/log/Bigdata/controller/controller.log log file.

    If the running time of Tomcat is abnormal, check the /var/log/Bigdata/tomcat/web.log log file.

  6. If the problem cannot be resolved by viewing logs, contact maintenance personnel and send the collected fault logs.

Account and Password Expiry Check

Indicator name: Account and Password Expiry Check

Indicator description: This indicator is used to check the two OS users omm and ommdba of the MRS system. For an OS user, this indicator is used to check the expiration time of the account and password. If the validity period of the account or password is less than or equal to 15 days, the check result is unhealthy.

Recovery guidance: If the account or password validity period is less than or equal to 15 days, you are advised to contact maintenance personnel to resolve the problem.

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel