3. Monitoring Scenarios

Size: px

Start display at page:

Download "3. Monitoring Scenarios"

Jessie Wilson
5 years ago
Views:

Cluster Summary This scenario checks Clusters health state.

1 3. Monitoring Scenarios This section describes the following: Navigation Alerts Interval Rules Navigation Ambari SCOM Use the Ambari SCOM main navigation tree to browse cluster, HDFS and MapReduce performance metrics. Cluster Summary This scenario checks Clusters health state. User can choose the Cluster by clicking Cluster Name, after User can see intuitively visualization: Cluster Services Participating Hosts Live vs. Dead Nodes Space Utilization After user selects a Cluster Service, Participating Hosts will populate automatically.

Cluster Diagram See a layout of Services and Components across your cluster hosts. HDFS Service Summary This scenario checks HDFS Cluster Services health state.

2 Cluster Diagram See a layout of Services and Components across your cluster hosts. HDFS Service Summary This scenario checks HDFS Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization: Files Summary metrics Block Summary metrics I/O Summary metrics Capacity Remaining

3 HDFS NameNode This scenario checks NameNode Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization: Memory Heap Utilization Thread Status Garbage Collection Time (ms) Average RPC Wait Time MapReduce Service Summary This scenario checks MapReduce Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization: Jobs Summary TaskTrackers Summary Slots Utilization Maps vs. Reducers

4 MapReduce JobTracker This scenario checks JobTracker Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization: Memory Heap Utilization Threads Status Garbage Collection Time (ms) Average RPC Wait Time Alerts The following Alerts are configured by Ambari SCOM: Name Alert Message Description Threshold Capacity Remaining There is little or no space capacity remaining in HDFS. percentage of available space on all HDFS nodes together is less then upper/lower threshold. 30-Warning 10-Critical Under-Replicated Blocks Number of under-replicated blocks in the HDFS is too high. percentage of under-replicated blocks is more than lower/upper threshold. 1-Warning 5-Critical Corrupted Blocks There are corrupted file blocks in HDFS. Gives critical alert if number of corrupted blocks is more than threshold. 1 DataNodes Down A significant number of DataNodes are down in the cluster. percentage of dead HDFS data nodes in cluster is more than lower /upper threshold. 10-Warning 20-Critical Failed Jobs MapReduce jobs are failing too frequently. percentage of map-reduce failed jobs is more than lower/upper threshold. 10-Warning 40-Critical Invalid TaskTrackers There are TaskTracker nodes which are in the invalid state. Gives critical alert if there is at least one blacklisted task-tracker. 1 Memory Heap Usage JobTracker is working under high memory pressure. percentage of used job-tracker memory heap is more than lower /upper threshold. 80-Warning 90-Critical Memory Heap Usage NameNode is working under high memory pressure. percentage of used NameNode memory heap is more than lower /upper threshold. 80-Warning 90-Critical

TaskTrackers Down A significant number of TaskTrackers are down in the cluster. percentage of map reduce dead task-trackers is more than lower /upper threshold.

NameNode Service State NameNode component is not Gives critical alert if a NameNode service is unavailable.

5 TaskTrackers Down A significant number of TaskTrackers are down in the cluster. percentage of map reduce dead task-trackers is more than lower /upper threshold. 10-Warning 20-Critical TaskTracker Service State TaskTracker component is not Turns TaskTracker service to warning state if the TaskTracker service is unavailable. NameNode Service State NameNode component is not Gives critical alert if a NameNode service is unavailable. Secondary NameNode Service State Secondary NameNode component is not Gives warning alert if a Secondary NameNode service is unavailable. JobTracker Service State JobTracker component is not Gives critical alert if a JobTracker service is unavailable. Oozie Server Service State Oozie Server component is not Gives critical alert if a Oozie Server service is unavailable. Hive Metastore State Hive Metastore component is not Gives critical alert if a Hive Metastore service is unavailable. HiveServer State HiveServer component is not Gives critical alert if a Hive Server service is unavailable. WebHCat Server Service State WebHCat Server component is not Gives critical alert if a WebHCat Server service is unavailable. Viewing The Cluster Diagram view will show when an alert has been raised on an object in the cluster. In the image below this is indicated with a on the cluster icon. You can find out more information about any alerts by accessing the Alert View. The Alert View can be accessed from the Tasks panel on the right. Alert View shows all of the alerts for the selected object. You can see details about any alert or edit its monitor by selecting it in the list.

Another way to see all of the alerts for a specific object or to override the default thresholds and properties is to access the Health Explorer.

The list on the left shows all of the alerts for the selected object.

This will show details about the monitor that is associated with the alert and allow you to override the properties and thresholds of the monitor.

6 Another way to see all of the alerts for a specific object or to override the default thresholds and properties is to access the Health Explorer. You can bring up the Health Explorer by right clicking on any object in the diagram view and selecting from the menu. The list on the left shows all of the alerts for the selected object. You can see the Monitor Properties by right clicking on any alert in the list and selecting from the menu. This will show details about the monitor that is associated with the alert and allow you to override the properties and thresholds of the monitor. You can also see the state changes of an object in the Health Explorer by selecting an alert and picking the State Changes tab on the right. This tab shows the time as well as the from and to state of any state change for the monitor associated with the selected alert. The tab also shows the state of the object that triggered the state change. Customizing

By selecting Overrides you can change the default values of the monitor (Critical Threshold, Warning Threshold, Internal). Check the override box and enter a new value.

7 By selecting Overrides you can change the default values of the monitor (Critical Threshold, Warning Threshold, Internal). Check the override box and enter a new value. Then select the destination management pack where the overrides will be stored. Interval Rules The following table lists performance rules that have default intervals for alert checks that might require additional tuning to suit your environment. Evaluate these rules to determine whether the default intervals are appropriate for your environment. If a default interval is not appropriate for your environment, you should obtain a baseline for the relevant performance counters, and then adjust the interval by applying an override to them. Name Description Interval (secs) Collect HDFS Blocks Read Collect HDFS Blocks Written Collect HDFS Bytes Read Collect HDFS Bytes Written Collect HDFS Capacity Non-DFS Used (GB) Collect HDFS Capacity Remaining (GB) Collect HDFS Capacity Total (GB) Collect HDFS Capacity Used (GB) Collect HDFS Corrupted Blocks Collect HDFS Dead DataNodes Collect HDFS Decommissioned DataNodes Collect HDFS Files Appended Collect HDFS Files Created Collect HDFS Files Deleted This rule collects amount of heap memory used by Host Component. This rule collects amount of non-heap memory committed to Host Component. This rule collects amount of non-heap memory used by Host Component. This rule collects number of garbage collections performed for Host Component process. This rule collects number of blocked threads for Host Component process. This rule collects number of new threads for Host Component process. This rule collects number of runnable This rule collects number of terminated This rule collects number of timed waiting This rule collects number of waiting threads for Host Component process. This rule collects time spent in garbage collection of Host Component process. This rule collects number of dead TaskTrackers for cluster. This rule collects number of completed This rule collects number of failed

8 Collect HDFS Live DataNodes Collect HDFS Missing Blocks Collect HDFS Pending Deletion Blocks Collect HDFS Pending Replication Blocks Collect HDFS Total Blocks Collect HDFS Total Files Collect HDFS Under-Replicated Blocks Collect Live vs Dead DataNodes Widget Data Collect Space Utilization Widget Data Collect JVM Errors Logged Collect JVM Fatal Errors Logged Collect JVM Heap Memory Committed Collect JVM Heap Memory Used Collect JVM Non Heap Memory Committed Collect JVM Non Heap Memory Used Collect JVM Number of Garbage Collections Collect JVM Threads Blocked Collect JVM Threads New Collect JVM Threads Runnable Collect JVM Threads Terminated Collect JVM Threads Timed Waiting Collect JVM Threads Waiting Collect JVM Time Spent in Garbage Collection (ms) Collect MapReduce Dead TaskTrackers Collect MapReduce Jobs Completed This rule collects percent of failed MapReduce jobs in cluster. This rule collects number of killed This rule collects number of preparing This rule collects number of running This rule collects number of submitted This rule collects number of live TaskTrackers for cluster. This rule collects number of reserved map slots for cluster. This rule collects number of completed maps This rule collects number of failed map This rule collects number of killed map tasks for cluster. This rule collects number of launched map This rule collects total number of TaskTrackers in cluster. This rule collects number of occupied map slots for cluster. This rule collects number of occupied reduce slots for cluster. This rule collects number of reserved reduce slots for cluster. This rule collects number of completed reduce This rule collects number of failed reduce This rule collects number of killed reduce This rule collects number of launched reduce This rule collects number of running map This rule collects number of running reduce This rule collects number of blacklisted TaskTrackers in cluster. This rule collects number of decommissioned TaskTrackers in cluster. This rule collects number of graylisted TaskTrackers in cluster. This rule collects number of waiting map

9 Collect MapReduce Jobs Failed Collect MapReduce Jobs Failed (%) Collect MapReduce Jobs Killed Collect MapReduce Jobs Preparing Collect MapReduce Jobs Running Collect MapReduce Jobs Submitted Collect MapReduce Live TaskTrackers Collect MapReduce Map Slots Reserved Collect MapReduce Maps Completed Collect MapReduce Maps Failed Collect MapReduce Maps Killed Collect MapReduce Maps Launched Collect MapReduce Number of TaskTrackers Collect MapReduce Occupied Map Slots Collect MapReduce Reduced Slots Occupied Collect MapReduce Reduced Slots Reserved Collect MapReduce Reduces Completed Collect MapReduce Reduces Failed Collect MapReduce Reduces Killed Collect MapReduce Reduces Launched Collect MapReduce Running Map Tasks Collect MapReduce Running Reduce tasks Collect MapReduce TaskTrackers Blacklisted This rule collects number of waiting reduce This rule collects bytes received by Host Component. This rule collects bytes sent by Host Component. This rule collects queue average time (ms) of remote procedure calls to Host Component. This rule collects number of failed remote procedure call authorization attempts to Host Component. This rule collects average processing time (ms) of remote procedure calls to Host Component. This rule collects number of processing remote procedure calls to Host Component. This rule collects number of queued remote procedure calls to Host Component. This rule collects number of available map slots on TaskTracker. This rule collects number of available reduce slots on TaskTracker. This rule collects number of running map tasks on TaskTracker. This rule collects number of running reduce tasks on TaskTracker. This rule collects number of caught exceptions for shuffle running on TaskTracker. This rule collects number of failed outputs for shuffle running on TaskTracker. This rule collects percentage of busy shuffle handlers on TaskTracker. This rule collects number of bytes produced by shuffle running on TaskTracker. This rule collects number of successful outputs for shuffle running on TaskTracker. This rule collects amount of heap memory used by Host Component. This rule collects amount of non-heap memory committed to Host Component. This rule collects amount of non-heap memory used by Host Component. This rule collects number of garbage collections performed for Host Component process. This rule collects number of blocked threads for Host Component process. This rule collects number of new threads for Host Component process.

10 Collect MapReduce TaskTrackers Decommissioned Collect MapReduce TaskTrackers Graylisted Collect MapReduce Waiting Map Tasks Collect MapReduce Waiting Reduce tasks Collect Network Bytes Received Collect Network Bytes Sent Collect Queue Average Wait Time Collect RPC Authorization Failures Collect RPC Processing Average Time Collect RPC Processing Number of Operations Collect RPC Queue Number of Operations Collect TaskTracker Map Slots Collect TaskTracker Reduce Slots Collect TaskTracker Running Map Tasks Collect TaskTracker Running Reduce tasks Collect TaskTracker Shuffle Exceptions Caught Collect TaskTracker Shuffle Failed Outputs Collect TaskTracker Shuffle Handler Busy (%) Collect TaskTracker Shuffle Output Bytes Collect TaskTracker Shuffle Success Outputs This rule collects number of runnable This rule collects number of terminated This rule collects number of timed waiting This rule collects number of waiting threads for Host Component process. This rule collects time spent in garbage collection of Host Component process. This rule collects number of dead TaskTrackers for cluster. This rule collects number of completed This rule collects number of failed This rule collects percent of failed MapReduce jobs in cluster. This rule collects number of killed This rule collects number of preparing This rule collects number of running This rule collects number of submitted This rule collects number of live TaskTrackers for cluster. This rule collects number of reserved map slots for cluster. This rule collects number of completed maps This rule collects number of failed map This rule collects number of killed map tasks for cluster. This rule collects number of launched map This rule collects total number of TaskTrackers in cluster.

Hortonworks Data Platform

Hortonworks Data Platform Apache Ambari Operations () docs.hortonworks.com : Apache Ambari Operations Copyright 2012-2018 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open