Metrics¶
This section lists the metrics exposed by main classes.
({scope} is referencing current scope value of passed in StatsLogger.)
MonitoredFuturePool¶
{scope}/tasks_pending
Gauge. How many tasks are pending in this future pool? If this value becomes high, it means that the future pool execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this future pool. It could also cause heavy jvm gc if this pool keeps building up.
{scope}/task_pending_time
OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either tasks_pending is building up or task_execution_time is high blocking other tasks to execute.
{scope}/task_execution_time
OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.
{scope}/task_enqueue_time
OpStats. The time that tasks spent on submission. The submission time would also impact user end latency.
MonitoredScheduledThreadPoolExecutor¶
{scope}/pending_tasks
Gauge. How many tasks are pending in this thread pool executor? If this value becomes high, it means that the thread pool executor execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this executor. It could also cause heavy jvm gc if queue keeps building up.
{scope}/completed_tasks
Gauge. How many tasks are completed in this thread pool executor?
{scope}/total_tasks
Gauge. How many tasks are submitted to this thread pool executor?
{scope}/task_pending_time
OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either pending_tasks is building up or task_execution_time is high blocking other tasks to execute.
{scope}/task_execution_time
OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.
OrderedScheduler¶
OrderedScheduler is a thread pool based ScheduledExecutorService. It is comprised with multiple MonitoredScheduledThreadPoolExecutor. Each MonitoredScheduledThreadPoolExecutor is wrapped into a MonitoredFuturePool. So there are aggregated stats and per-executor stats exposed.
Aggregated Stats¶
{scope}/task_pending_time
OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either pending_tasks is building up or task_execution_time is high blocking other tasks to execute.
{scope}/task_execution_time
OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.
{scope}/futurepool/tasks_pending
Gauge. How many tasks are pending in this future pool? If this value becomes high, it means that the future pool execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this future pool. It could also cause heavy jvm gc if this pool keeps building up.
{scope}/futurepool/task_pending_time
OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either tasks_pending is building up or task_execution_time is high blocking other tasks to execute.
{scope}/futurepool/task_execution_time
OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.
{scope}/futurepool/task_enqueue_time
OpStats. The time that tasks spent on submission. The submission time would also impact user end latency.
Per Executor Stats¶
Stats about individual executors are exposed under {scope}/{name}-executor-{id}-0. {name} is the scheduler name and {id} is the index of the executor in the pool. The corresponding stats of its futurepool are exposed under {scope}/{name}-executor-{id}-0/futurepool. See MonitoredScheduledThreadPoolExecutor and MonitoredFuturePool for more details.
ZooKeeperClient¶
Operation Stats¶
All operation stats are exposed under {scope}/zk. The stats are latency OpStats on zookeeper operations.
{scope}/zk/{op}
latency stats on operations. these operations are create_client, get_data, set_data, delete, get_children, multi, get_acl, set_acl and sync.
Watched Event Stats¶
All stats on zookeeper watched events are exposed under {scope}/watcher. The stats are Counter about the watched events that this client received:
{scope}/watcher/state/{keeper_state}
the number of KeeperState changes that this client received. The states are Disconnected, SyncConnected, AuthFailed, ConnectedReadOnly, SaslAuthenticated and Expired. By monitoring metrics like SyncConnected or Expired it would help understanding the healthy of this zookeeper client.
{scope}/watcher/events/{event}
the number of `Watcher.Event`s received by this client. Those events are None, NodeCreated, NodeDeleted, NodeDataChanged, NodeChildrenChanged.
Watcher Manager Stats¶
This ZooKeeperClient provides a watcher manager to manage watchers for applications. It tracks the mapping between paths and watcher. It is the way to provide the ability on removing watchers. The stats are Gauge about the number of watchers managed by this zookeeper client.
{scope}/watcher_manager/total_watches
total number of watches that are managed by this watcher manager. If it keeps growing, it usually means that watchers are leaking (resources aren't closed properly). It will cause OOM.
{scope}/watcher_manager/num_child_watches
total number of paths that are watched by this watcher manager.
BookKeeperClient¶
TODO: add bookkeeper stats there
DistributedReentrantLock¶
All stats related to locks are exposed under {scope}/lock.
{scope}/acquire
OpStats. It measures the characteristics about the time that spent on acquiring locks.
{scope}/release
OpStats. It measures the characteristics about the time that spent on releasing locks.
{scope}/reacquire
OpStats. The lock will be expired when the underneath zookeeper session expired. The reentrant lock will attempt to re-acquire the lock automatically when session expired. This metric measures the characteristics about the time that spent on re-acquiring locks.
{scope}/internalTryRetries
Counter. The number of retries that locks spend on re-creating internal locks. Typically, a new internal lock will be created when session expired.
{scope}/acquireTimeouts
Counter. The number of timeouts that caller experienced when acquiring locks.
{scope}/tryAcquire
OpStats. It measures the characteristics about the time that each internal lock spent on acquiring.
{scope}/tryTimeouts
Counter. The number of timeouts that internal locks try acquiring.
{scope}/unlock
OpStats. It measures the characteristics about the time that the caller spent on unlocking internal locks.
BKLogHandler¶
The log handler is a base class on managing log segments. so all the metrics in this class are related log segments retrieval and exposed under {scope}/logsegments. They are all OpStats in the format of {scope}/logsegments/{op}. Those operations are:
- force_get_list: force to get the list of log segments.
- get_list: get the list of the log segments. it might just retrieve from local log segment cache.
- get_filtered_list: get the filtered list of log segments.
- get_full_list: get the full list of log segments.
- get_inprogress_segment: time between the inprogress log segment created and the handler read it.
- get_completed_segment: time between a log segment is turned to completed and the handler read it.
- negative_get_inprogress_segment: record the negative values for get_inprogress_segment.
- negative_get_completed_segment: record the negative values for get_completed_segment.
- recover_last_entry: recovering last entry from a log segment.
- recover_scanned_entries: the number of entries that are scanned during recovering.
See BKLogWriteHandler for write handlers.
See BKLogReadHandler for read handlers.
BKLogReadHandler¶
The core logic in log reader handle is readahead worker. Most of readahead stats are exposed under {scope}/readahead_worker.
{scope}/readahead_worker/wait
Counter. Number of waits that readahead worker is waiting. If this keeps increasing, it usually means readahead keep getting full because of reader slows down reading.
{scope}/readahead_worker/repositions
Counter. Number of repositions that readhead worker encounters. Reposition means that a readahead worker finds that it isn't advancing to a new log segment and force re-positioning.
{scope}/readahead_worker/entry_piggy_back_hits
Counter. It increases when the last add confirmed being advanced because of the piggy-back lac.
{scope}/readahead_worker/entry_piggy_back_misses
Counter. It increases when the last add confirmed isn't advanced by a read entry because it doesn't iggy back a newer lac.
{scope}/readahead_worker/read_entries
OpStats. Stats on number of entries read per readahead read batch.
{scope}/readahead_worker/read_lac_counter
Counter. Stats on the number of readLastConfirmed operations
{scope}/readahead_worker/read_lac_and_entry_counter
Counter. Stats on the number of readLastConfirmedAndEntry operations.
{scope}/readahead_worker/cache_full
Counter. It increases each time readahead worker finds cache become full. If it keeps increasing, that means reader slows down reading.
{scope}/readahead_worker/resume
OpStats. Stats on readahead worker resuming reading from wait state.
{scope}/readahead_worker/long_poll_interruption
OpStats. Stats on the number of interruptions happened to long poll. the interruptions are usually because of receiving zookeeper notifications.
{scope}/readahead_worker/notification_execution
OpStats. Stats on executions over the notifications received from zookeeper.
{scope}/readahead_worker/metadata_reinitialization
OpStats. Stats on metadata reinitialization after receiving notifcation from log segments updates.
{scope}/readahead_worker/idle_reader_warn
Counter. It increases each time the readahead worker detects itself becoming idle.
BKLogWriteHandler¶
Log write handlers are responsible for log segment creation/deletions. All the metrics are exposed under {scope}/segments.
{scope}/segments/open
OpStats. Latency characteristics on starting a new log segment.
{scope}/segments/close
OpStats. Latency characteristics on completing an inprogress log segment.
{scope}/segments/recover
OpStats. Latency characteristics on recovering a log segment.
{scope}/segments/delete
OpStats. Latency characteristics on deleting a log segment.
BKAsyncLogWriter¶
{scope}/log_writer/write
OpStats. latency characteristics about the time that write operations spent.
{scope}/log_writer/write/queued
OpStats. latency characteristics about the time that write operations spent in the queue. {scope}/log_writer/write latency is high might because the write operations are pending in the queue for long time due to log segment rolling.
{scope}/log_writer/bulk_write
OpStats. latency characteristics about the time that bulk_write operations spent.
{scope}/log_writer/bulk_write/queued
OpStats. latency characteristics about the time that bulk_write operations spent in the queue. {scope}/log_writer/bulk_write latency is high might because the write operations are pending in the queue for long time due to log segment rolling.
{scope}/log_writer/get_writer
OpStats. the time spent on getting the writer. it could spike when there is log segment rolling happened during getting the writer. it is a good stat to look into when the latency is caused by queuing time.
{scope}/log_writer/pending_request_dispatch
Counter. the number of queued operations that are dispatched after log segment is rolled. it is an metric on measuring how many operations has been queued because of log segment rolling.
BKAsyncLogReader¶
{scope}/async_reader/future_set
OpStats. Time spent on satisfying futures of read requests. if it is high, it means that the caller takes time on processing the result of read requests. The side effect is blocking consequent reads.
{scope}/async_reader/schedule
OpStats. Time spent on scheduling next reads.
{scope}/async_reader/background_read
OpStats. Time spent on background reads.
{scope}/async_reader/read_next_exec
OpStats. Time spent on executing reader#readNext()
{scope}/async_reader/time_between_read_next
OpStats. Time spent on between two consequent reader#readNext(). if it is high, it means that the caller is slowing down on calling reader#readNext().
{scope}/async_reader/delay_until_promise_satisfied
OpStats. Total latency for the read requests.
{scope}/async_reader/idle_reader_error
Counter. The number idle reader errors.
BKDistributedLogManager¶
Future Pools¶
The stats about future pools that used by writers are exposed under {scope}/writer_future_pool, while the stats about future pools that used by readers are exposed under {scope}/reader_future_pool. See MonitoredFuturePool for detail stats.
Distributed Locks¶
The stats about the locks used by writers are exposed under {scope}/lock while those used by readers are exposed under {scope}/read_lock/lock. See DistributedReentrantLock for detail stats.
Log Handlers¶
{scope}/logsegments
All basic stats of log handlers are exposed under {scope}/logsegments. See BKLogHandler for detail stats.
{scope}/segments
The stats about write log handlers are exposed under {scope}/segments. See BKLogWriteHandler for detail stats.
{scope}/readhead_worker
The stats about read log handlers are exposed under {scope}/readahead_worker. See BKLogReadHandler for detail stats.
Writers¶
All writer related metrics are exposed under {scope}/log_writer. See BKAsyncLogWriter for detail stats.
Readers¶
All reader related metrics are exposed under {scope}/async_reader. See BKAsyncLogReader for detail stats.
BKDistributedLogNamespace¶
ZooKeeper Clients¶
There are various of zookeeper clients created per namespace for different purposes. They are:
{scope}/dlzk_factory_writer_shared
Stats about the zookeeper client shared by all DL writers.
{scope}/dlzk_factory_reader_shared
Stats about the zookeeper client shared by all DL readers.
{scope}/bkzk_factory_writer_shared
Stats about the zookeeper client used by bookkeeper client that shared by all DL writers.
{scope}/bkzk_factory_reader_shared
Stats about the zookeeper client used by bookkeeper client that shared by all DL readers.
See ZooKeeperClient for zookeeper detail stats.
BookKeeper Clients¶
All the bookkeeper client related stats are exposed directly to current {scope}. See BookKeeperClient for detail stats.
Utils¶
{scope}/factory/thread_pool
Stats about the ordered scheduler used by this namespace. See OrderedScheduler for detail stats.
{scope}/factory/readahead_thread_pool
Stats about the readahead thread pool executor used by this namespace. See MonitoredScheduledThreadPoolExecutor for detail stats.
{scope}/writeLimiter
Stats about the global write limiter used by list namespace.
DistributedLogManager¶
All the core stats about reader and writer are exposed under current {scope} via BKDistributedLogManager.