监控关键指标

下面提供了一组推荐的指标来监测。指标分为6类,每一类表示一个共同的系统组件,可能导致该指标报告一个警告/临界值。该类别如下:

  • Application: Application metrics are KPIs that frequently indicate in the customer's application.
  • Memory: Memory metrics are KPIs that may be used to indicate abnormal memory utilization.
  • Network: Network metrics are KPIs that may indicate problems on the network layer.
  • Storage: Storage metrics are KPIs that may be used to indicate abnormal disk utilization.
  • Service: Service/Other metrics are KPIs are a mix between metrics that indicate abnormal Database operation (such as migrations outside of a maintenance event) or system problems that may cause abnormal Database operations (such as time skew).服务/其他指标kpi指标表明之间的混合数据库操作异常(如迁移以外的维修事件)或系统问题,可能会导致异常的数据库操作(如时间偏移)。
  • Trend: Trend metrics are useful stats to allow operations deeper understanding of system behaviors leading up to a particular event.趋势指标是有用的数据,允许业务深入了解系统行为导致一个特定的事件。

除了监控统计以下操作也应该监控Linux重要点,如空闲磁盘空间、空闲RAM,交换等

指标表 Metric Table

更多指标查看参考手册章节中的度量指标

名称 用法
available_pct
Category: storage
Location: namespace
IF available_pct drops below 20% THEN may indicate that defrag is unable to keep up with the current load, warn operations
IF available_pct drops below 15% THEN critical alert to operations, usable disk resources are critically low may result in a stop-writes if situation if available_pct drops drops below 5%.
cluster_size
Category: network
Location: statistics
IF cluster_size does not equal the expected cluster size and cluster in not undergoing maintenance THEN Operations need to investigate the cause.
hwm-breached
Category: storage
Location: namespace
IF hwm-breached is true THEN alert operations that memory or disk resources are strained, may indicate need to increase cluster capacity.
stop-writes
Category: storage
Location: namespace
IF stop-writes is true THEN critical alert to operations, system has entered stop-writes mode, until cause is reverted system will reject all writes from the Application.
timediff_lastship_cur_secs
Category: service
Location: xdr
Number of seconds it took since the last shipment was logged to when it was finally sent to remote cluster. This is an approximation of latency between what has been comitted locally to what has been shipped.
xdr-uptime
Category: service
Location: xdr
IF xdr-uptime is below 300 and the cluster is not undergoing maintenance THEN This XDR on this node was restarted within the last 5 minutes.