Metrics

This section lists the metrics exposed by main classes.

({scope} is referencing current scope value of passed in StatsLogger.)

MonitoredFuturePool

{scope}/tasks_pending

Gauge. How many tasks are pending in this future pool? If this value becomes high, it means that the future pool execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this future pool. It could also cause heavy jvm gc if this pool keeps building up.

{scope}/task_pending_time

OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either tasks_pending is building up or task_execution_time is high blocking other tasks to execute.

{scope}/task_execution_time

OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.

{scope}/task_enqueue_time

OpStats. The time that tasks spent on submission. The submission time would also impact user end latency.

MonitoredScheduledThreadPoolExecutor

{scope}/pending_tasks

Gauge. How many tasks are pending in this thread pool executor? If this value becomes high, it means that the thread pool executor execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this executor. It could also cause heavy jvm gc if queue keeps building up.

{scope}/completed_tasks

Gauge. How many tasks are completed in this thread pool executor?

{scope}/total_tasks

Gauge. How many tasks are submitted to this thread pool executor?

{scope}/task_pending_time

OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either pending_tasks is building up or task_execution_time is high blocking other tasks to execute.

{scope}/task_execution_time

OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.

OrderedScheduler

OrderedScheduler is a thread pool based ScheduledExecutorService. It is comprised with multiple MonitoredScheduledThreadPoolExecutor. Each MonitoredScheduledThreadPoolExecutor is wrapped into a MonitoredFuturePool. So there are aggregated stats and per-executor stats exposed.

Aggregated Stats

{scope}/task_pending_time

OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either pending_tasks is building up or task_execution_time is high blocking other tasks to execute.

{scope}/task_execution_time

OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.

{scope}/futurepool/tasks_pending

Gauge. How many tasks are pending in this future pool? If this value becomes high, it means that the future pool execution rate couldn't keep up with submission rate. That would be cause high task_pending_time hence affecting the callers that use this future pool. It could also cause heavy jvm gc if this pool keeps building up.

{scope}/futurepool/task_pending_time

OpStats. It measures the characteristics about the time that tasks spent on waiting being executed. It becomes high because either tasks_pending is building up or task_execution_time is high blocking other tasks to execute.

{scope}/futurepool/task_execution_time

OpStats. It measures the characteristics about the time that tasks spent on execution. If it becomes high, it would block other tasks to execute if there isn't enough threads in this executor, hence cause high task_pending_time and impact user end latency.

{scope}/futurepool/task_enqueue_time

OpStats. The time that tasks spent on submission. The submission time would also impact user end latency.

Per Executor Stats

Stats about individual executors are exposed under {scope}/{name}-executor-{id}-0. {name} is the scheduler name and {id} is the index of the executor in the pool. The corresponding stats of its futurepool are exposed under {scope}/{name}-executor-{id}-0/futurepool. See MonitoredScheduledThreadPoolExecutor and MonitoredFuturePool for more details.

ZooKeeperClient

Operation Stats

All operation stats are exposed under {scope}/zk. The stats are latency OpStats on zookeeper operations.

{scope}/zk/{op}

latency stats on operations. these operations are create_client, get_data, set_data, delete, get_children, multi, get_acl, set_acl and sync.

Watched Event Stats

All stats on zookeeper watched events are exposed under {scope}/watcher. The stats are Counter about the watched events that this client received:

{scope}/watcher/state/{keeper_state}

the number of KeeperState changes that this client received. The states are Disconnected, SyncConnected, AuthFailed, ConnectedReadOnly, SaslAuthenticated and Expired. By monitoring metrics like SyncConnected or Expired it would help understanding the healthy of this zookeeper client.

{scope}/watcher/events/{event}

the number of `Watcher.Event`s received by this client. Those events are None, NodeCreated, NodeDeleted, NodeDataChanged, NodeChildrenChanged.

Watcher Manager Stats

This ZooKeeperClient provides a watcher manager to manage watchers for applications. It tracks the mapping between paths and watcher. It is the way to provide the ability on removing watchers. The stats are Gauge about the number of watchers managed by this zookeeper client.

{scope}/watcher_manager/total_watches

total number of watches that are managed by this watcher manager. If it keeps growing, it usually means that watchers are leaking (resources aren't closed properly). It will cause OOM.

{scope}/watcher_manager/num_child_watches

total number of paths that are watched by this watcher manager.

BookKeeperClient

TODO: add bookkeeper stats there

DistributedReentrantLock

All stats related to locks are exposed under {scope}/lock.

{scope}/acquire

OpStats. It measures the characteristics about the time that spent on acquiring locks.

{scope}/release

OpStats. It measures the characteristics about the time that spent on releasing locks.

{scope}/reacquire

OpStats. The lock will be expired when the underneath zookeeper session expired. The reentrant lock will attempt to re-acquire the lock automatically when session expired. This metric measures the characteristics about the time that spent on re-acquiring locks.

{scope}/internalTryRetries

Counter. The number of retries that locks spend on re-creating internal locks. Typically, a new internal lock will be created when session expired.

{scope}/acquireTimeouts

Counter. The number of timeouts that caller experienced when acquiring locks.

{scope}/tryAcquire

OpStats. It measures the characteristics about the time that each internal lock spent on acquiring.

{scope}/tryTimeouts

Counter. The number of timeouts that internal locks try acquiring.

{scope}/unlock

OpStats. It measures the characteristics about the time that the caller spent on unlocking internal locks.

BKLogHandler

The log handler is a base class on managing log segments. so all the metrics in this class are related log segments retrieval and exposed under {scope}/logsegments. They are all OpStats in the format of {scope}/logsegments/{op}. Those operations are:

  • force_get_list: force to get the list of log segments.
  • get_list: get the list of the log segments. it might just retrieve from local log segment cache.
  • get_filtered_list: get the filtered list of log segments.
  • get_full_list: get the full list of log segments.
  • get_inprogress_segment: time between the inprogress log segment created and the handler read it.
  • get_completed_segment: time between a log segment is turned to completed and the handler read it.
  • negative_get_inprogress_segment: record the negative values for get_inprogress_segment.
  • negative_get_completed_segment: record the negative values for get_completed_segment.
  • recover_last_entry: recovering last entry from a log segment.
  • recover_scanned_entries: the number of entries that are scanned during recovering.

See BKLogWriteHandler for write handlers.

See BKLogReadHandler for read handlers.

BKLogReadHandler

The core logic in log reader handle is readahead worker. Most of readahead stats are exposed under {scope}/readahead_worker.

{scope}/readahead_worker/wait

Counter. Number of waits that readahead worker is waiting. If this keeps increasing, it usually means readahead keep getting full because of reader slows down reading.

{scope}/readahead_worker/repositions

Counter. Number of repositions that readhead worker encounters. Reposition means that a readahead worker finds that it isn't advancing to a new log segment and force re-positioning.

{scope}/readahead_worker/entry_piggy_back_hits

Counter. It increases when the last add confirmed being advanced because of the piggy-back lac.

{scope}/readahead_worker/entry_piggy_back_misses

Counter. It increases when the last add confirmed isn't advanced by a read entry because it doesn't iggy back a newer lac.

{scope}/readahead_worker/read_entries

OpStats. Stats on number of entries read per readahead read batch.

{scope}/readahead_worker/read_lac_counter

Counter. Stats on the number of readLastConfirmed operations

{scope}/readahead_worker/read_lac_and_entry_counter

Counter. Stats on the number of readLastConfirmedAndEntry operations.

{scope}/readahead_worker/cache_full

Counter. It increases each time readahead worker finds cache become full. If it keeps increasing, that means reader slows down reading.

{scope}/readahead_worker/resume

OpStats. Stats on readahead worker resuming reading from wait state.

{scope}/readahead_worker/long_poll_interruption

OpStats. Stats on the number of interruptions happened to long poll. the interruptions are usually because of receiving zookeeper notifications.

{scope}/readahead_worker/notification_execution

OpStats. Stats on executions over the notifications received from zookeeper.

{scope}/readahead_worker/metadata_reinitialization

OpStats. Stats on metadata reinitialization after receiving notifcation from log segments updates.

{scope}/readahead_worker/idle_reader_warn

Counter. It increases each time the readahead worker detects itself becoming idle.

BKLogWriteHandler

Log write handlers are responsible for log segment creation/deletions. All the metrics are exposed under {scope}/segments.

{scope}/segments/open

OpStats. Latency characteristics on starting a new log segment.

{scope}/segments/close

OpStats. Latency characteristics on completing an inprogress log segment.

{scope}/segments/recover

OpStats. Latency characteristics on recovering a log segment.

{scope}/segments/delete

OpStats. Latency characteristics on deleting a log segment.

BKAsyncLogWriter

{scope}/log_writer/write

OpStats. latency characteristics about the time that write operations spent.

{scope}/log_writer/write/queued

OpStats. latency characteristics about the time that write operations spent in the queue. {scope}/log_writer/write latency is high might because the write operations are pending in the queue for long time due to log segment rolling.

{scope}/log_writer/bulk_write

OpStats. latency characteristics about the time that bulk_write operations spent.

{scope}/log_writer/bulk_write/queued

OpStats. latency characteristics about the time that bulk_write operations spent in the queue. {scope}/log_writer/bulk_write latency is high might because the write operations are pending in the queue for long time due to log segment rolling.

{scope}/log_writer/get_writer

OpStats. the time spent on getting the writer. it could spike when there is log segment rolling happened during getting the writer. it is a good stat to look into when the latency is caused by queuing time.

{scope}/log_writer/pending_request_dispatch

Counter. the number of queued operations that are dispatched after log segment is rolled. it is an metric on measuring how many operations has been queued because of log segment rolling.

BKAsyncLogReader

{scope}/async_reader/future_set

OpStats. Time spent on satisfying futures of read requests. if it is high, it means that the caller takes time on processing the result of read requests. The side effect is blocking consequent reads.

{scope}/async_reader/schedule

OpStats. Time spent on scheduling next reads.

{scope}/async_reader/background_read

OpStats. Time spent on background reads.

{scope}/async_reader/read_next_exec

OpStats. Time spent on executing reader#readNext()

{scope}/async_reader/time_between_read_next

OpStats. Time spent on between two consequent reader#readNext(). if it is high, it means that the caller is slowing down on calling reader#readNext().

{scope}/async_reader/delay_until_promise_satisfied

OpStats. Total latency for the read requests.

{scope}/async_reader/idle_reader_error

Counter. The number idle reader errors.

BKDistributedLogManager

Future Pools

The stats about future pools that used by writers are exposed under {scope}/writer_future_pool, while the stats about future pools that used by readers are exposed under {scope}/reader_future_pool. See MonitoredFuturePool for detail stats.

Distributed Locks

The stats about the locks used by writers are exposed under {scope}/lock while those used by readers are exposed under {scope}/read_lock/lock. See DistributedReentrantLock for detail stats.

Log Handlers

{scope}/logsegments

All basic stats of log handlers are exposed under {scope}/logsegments. See BKLogHandler for detail stats.

{scope}/segments

The stats about write log handlers are exposed under {scope}/segments. See BKLogWriteHandler for detail stats.

{scope}/readhead_worker

The stats about read log handlers are exposed under {scope}/readahead_worker. See BKLogReadHandler for detail stats.

Writers

All writer related metrics are exposed under {scope}/log_writer. See BKAsyncLogWriter for detail stats.

Readers

All reader related metrics are exposed under {scope}/async_reader. See BKAsyncLogReader for detail stats.

BKDistributedLogNamespace

ZooKeeper Clients

There are various of zookeeper clients created per namespace for different purposes. They are:

{scope}/dlzk_factory_writer_shared

Stats about the zookeeper client shared by all DL writers.

{scope}/dlzk_factory_reader_shared

Stats about the zookeeper client shared by all DL readers.

{scope}/bkzk_factory_writer_shared

Stats about the zookeeper client used by bookkeeper client that shared by all DL writers.

{scope}/bkzk_factory_reader_shared

Stats about the zookeeper client used by bookkeeper client that shared by all DL readers.

See ZooKeeperClient for zookeeper detail stats.

BookKeeper Clients

All the bookkeeper client related stats are exposed directly to current {scope}. See BookKeeperClient for detail stats.

Utils

{scope}/factory/thread_pool

Stats about the ordered scheduler used by this namespace. See OrderedScheduler for detail stats.

{scope}/factory/readahead_thread_pool

Stats about the readahead thread pool executor used by this namespace. See MonitoredScheduledThreadPoolExecutor for detail stats.

{scope}/writeLimiter

Stats about the global write limiter used by list namespace.

DistributedLogManager

All the core stats about reader and writer are exposed under current {scope} via BKDistributedLogManager.