Hynek Schlawack. Get Instrumented. How Prometheus Can Unify Your Metrics

Size: px

Start display at page:

Download "Hynek Schlawack. Get Instrumented. How Prometheus Can Unify Your Metrics"

Poppy Phillips
5 years ago
Views:

1 Hynek Schlawack Get Instrumented How Prometheus Can Unify Your Metrics

2 Goals

3 Goals

4 Goals

5 Goals

6 Goals

7 Service Level

8 Service Level Indicator

9 Service Level Indicator Objective

10 Service Level Indicator Objective (Agreement)

11 Metrics

12 Metrics avg latency

13 Metrics avg latency :00 12:01 12:02 12:03 12:04

14 Metrics avg latency server load :00 12:01 12:02 12:03 12:04

16 Instrument

17 Instrument

18 Instrument

19 Instrument

20 Instrument

23 Metric Types

24 counter Metric Types

25 Metric Types counter gauge

26 Metric Types counter summary gauge

27 Metric Types counter gauge summary histogram

28 Metric Types counter gauge summary histogram buckets (1s, 0.5s, 0.25, )

29 Averages

30 Averages avg(request time) avg(ux)

31 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8

32 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8

33 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8

34 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8 median({1, 1, 1, 1, 10}) = 1

35 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8 median({1, 1, 1, 1, 10}) = 1

36 Averages avg(request time) avg(ux) avg({1, 1, 1, 1, 10}) = 2.8 median({1, 1, 1, 1, 10}) = 1 median({1, 1, 100_000}) = 1

37 Percentiles

38 Percentiles n th percentile P of a data set = P n% of values

40 50 th percentile = 1 ms

41 50 th percentile = 1 ms 50% of requests done by 1 ms

42 Percentiles

43 Percentiles P {1, 1, 100_000} 50 th 1

44 Percentiles P {1, 1, 100_000} 50 th 1 95 th 90_000

45 Naming

46 Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get

47 Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get app_http_reqs_total

48 Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get app_http_reqs_total

49 Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get app_http_reqs_total

50 Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get app_http_reqs_total{meth="post", path="/msgs", backend="1"} app_http_reqs_total{meth="get", path="/msgs", backend="1"}

53 1. resolution = scraping interval

54 1. resolution = scraping interval 2. missing scrapes = less resolution

55 Pull: Problems short lived jobs

57 Pull: Problems short lived jobs target discovery

58 Configuration scrape_configs: - job_name: 'prometheus' static_configs: - targets: - 'localhost:9090'

59 Configuration scrape_configs: - job_name: 'prometheus' static_configs: - targets: - 'localhost:9090'

60 Configuration scrape_configs: - job_name: 'prometheus' static_configs: - targets: - 'localhost:9090'

61 Configuration scrape_configs: - job_name: 'prometheus' static_configs: - targets: - 'localhost:9090' {instance="localhost:9090",job="prometheus"}

63 Pull: Problems target discovery short lived jobs Heroku/NATed systems

64 Pull: Advantages

65 Pull: Advantages multiple Prometheis easy

66 Pull: Advantages multiple Prometheis easy outage detection

67 Pull: Advantages multiple Prometheis easy outage detection predictable, no self-dos

68 Pull: Advantages multiple Prometheis easy outage detection predictable, no self-dos easy to instrument 3 rd parties

69 Metrics Format # HELP req_seconds Time spent \ processing a request in seconds. # TYPE req_seconds histogram req_seconds_count req_seconds_sum

70 Metrics Format # HELP req_seconds Time spent \ processing a request in seconds. # TYPE req_seconds histogram req_seconds_count req_seconds_sum

71 Metrics Format # HELP req_seconds Time spent \ processing a request in seconds. # TYPE req_seconds histogram req_seconds_count req_seconds_sum

72 Metrics Format # HELP req_seconds Time spent \ processing a request in seconds. # TYPE req_seconds histogram req_seconds_count req_seconds_sum

73 Metrics Format # HELP req_seconds Time spent \ processing a request in seconds. # TYPE req_seconds histogram req_seconds_count req_seconds_sum

74 Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} req_seconds_bucket{le="0.75"} req_seconds_bucket{le="1.0"} req_seconds_bucket{le="2.0"} req_seconds_bucket{le="+inf"} 390.0

75 Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} req_seconds_bucket{le="0.75"} req_seconds_bucket{le="1.0"} req_seconds_bucket{le="2.0"} req_seconds_bucket{le="+inf"} 390.0

76 Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} req_seconds_bucket{le="0.75"} req_seconds_bucket{le="1.0"} req_seconds_bucket{le="2.0"} req_seconds_bucket{le="+inf"} 390.0

78 Aggregation

79 Aggregation sum( rate( req_seconds_count[1m] ) )

80 Aggregation sum( rate( req_seconds_count[1m] ) )

81 Aggregation sum( rate( req_seconds_count[1m] ) )

82 Aggregation sum( rate( req_seconds_count[1m] ) )

83 Aggregation sum( rate( req_seconds_count{dc="west"}[1m] ) )

84 Aggregation sum( rate( req_seconds_count[1m] ) ) by (dc)

85 Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

86 Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

87 Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

88 Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

89 Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

92 Internal

93 great for ad-hoc Internal

94 Internal great for ad-hoc 1 expr per graph

95 Internal great for ad-hoc 1 expr per graph templating

96 PromDash

97 best integration PromDash

98 PromDash best integration former official

99 PromDash best integration former official now deprecated don t bother

100 Grafana

101 pretty & powerful Grafana

102 Grafana pretty & powerful many integrations

103 Grafana pretty & powerful many integrations mix and match!

104 Grafana pretty & powerful many integrations mix and match! use this!

105

106 Alerts & Scrying

107 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

108 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

109 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

110 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

111 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

112 Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) < 0 FOR 5m

113

114

115

116 Environment

117

118 HAProxy MySQL etcd Consul nginx statsd graphite collectd Django Kubernetes redis PostgreSQL Varnish SNMP CouchDB InfluxDB MongoDB Apache

119 HAProxy MySQL etcd Consul nginx statsd graphite collectd Django Kubernetes redis PostgreSQL Varnish SNMP CouchDB InfluxDB MongoDB Apache

120 node_exporter

121 cadvisor node_exporter

122 System Insight load memory disk procs network I/O

123 mtail

124 mtail follow (log) files

125 mtail follow (log) files extract metrics using regex

126 mtail follow (log) files extract metrics using regex can be better than direct

127 Moar

128 Moar Edges: web servers/haproxy

129 Moar Edges: web servers/haproxy black box

130 Moar Edges: web servers/haproxy black box databases

131 Moar Edges: web servers/haproxy black box databases network

132 So Far

133 system stats So Far

134 So Far system stats outside look

135 So Far system stats outside look 3rd party components

136 Code

137 cat-or.not

138 HTTP service cat-or.not

139 cat-or.not HTTP service upload picture

140 cat-or.not HTTP service upload picture meow!/nope meow!

141 from flask import Flask, g, request from cat_or_not import is_cat app = Flask( name methods=["post"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if name == " main ": app.run()

142 from flask import Flask, g, request from cat_or_not import is_cat app = Flask( name methods=["post"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if name == " main ": app.run()

143 from flask import Flask, g, request from cat_or_not import is_cat app = Flask( name methods=["post"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if name == " main ": app.run()

144 pip install prometheus_client

145 from prometheus_client import \ start_http_server # if name == " main ": start_http_server(8000) app.run()

146 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

147 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

148 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

149 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

150 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

151 process_virtual_memory_bytes process_resident_memory_bytes process_start_time_seconds process_cpu_seconds_total process_open_fds 8.0 process_max_fds

152

153 from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds", "Time spent in HTTP requests.")

154 from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds", "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.")

155 from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds", "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.") IN_PROGRESS = Gauge( "cat_or_not_in_progress_requests", "Number of requests in progress.")

156 def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result = is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"

157 def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result = is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"

158 AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()

159 AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()

160 AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()

161 AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()

162 @app.route("/analyze", methods=["post"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result = is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"

163 pip install prometheus_async

164 Wrapper from prometheus_async.aio import async def view(request): #...

165 Goodies

166 Goodies aiohttp-based metrics export

167 Goodies aiohttp-based metrics export also in thread!

168 Goodies aiohttp-based metrics export also in thread! Consul Agent integration

169 Wrap Up

170 Wrap Up

171 Wrap Up

172 Wrap Up

173 Wrap Up

174 vrmd.de

Rethinking monitoring with Prometheus

Rethinking monitoring with Prometheus Martín Ferrari Štefan Šafár http://tincho.org @som_zlo Who is Prometheus? A dude who stole fire from Mt. Olympus and gave it to humanity http://prometheus.io/ What