API monitoring with Prometheus

When you run APIs in production, you need to know what is happening at all times - how many requests each endpoint handles per second, how long they take, whether errors are spiking, and when they do spike, whether the problem is in your code or in an external system your service calls.

Without this visibility, the first sign of trouble is usually a user complaint or a call from business people. By then, you are already behind.

Zato gives you this visibility through a Prometheus /metrics endpoint. Prometheus is the most widely adopted open-source monitoring system for this kind of work - it collects numeric measurements (metrics) from your Zato server at regular intervals, stores them as time series, and lets you query, graph, and alert on them.

If you have not used Prometheus before, the key idea is simple: Zato exposes numbers at a URL, Prometheus fetches them every 15 seconds, and you write queries against the collected data, and you connect further monitoring systems to Prometheus to build your observability dashboards.

In this guide you will connect a Prometheus-compatible scraper to Zato, verify that metrics are flowing, and build your first queries for request rates, latencies, and error breakdowns. The whole thing will take about 15 minutes.

Here's what you'll have at the end:

A working /metrics endpoint exposing request rates, latencies, error breakdowns, service invocations, pub/sub throughput, and scheduler job outcomes
Your scraper collecting metrics from Zato every 15 seconds
A set of ready-to-use PromQL queries you can paste into dashboards and alerts

Checking the endpoint

The /metrics endpoint is created automatically when you start a container. It uses HTTP Basic Auth - the default username is metrics and the password is either whatever you set in Zato_Metrics_Password before creating the environment, or a random string that was generated for you.

You can always change the password in Dashboard under Security -> HTTP Basic Auth, look for the entry named metrics.

Let's verify that the endpoint is working. Open a terminal and run:

curl http://metrics:your-password@localhost:11223/metrics

You should see output like this:

# HELP zato_rest_channel_requests_total Total REST channel requests ...
# TYPE zato_rest_channel_requests_total counter
zato_rest_channel_requests_total{channel_name="crm.api",status_code="2xx",error_source="none"} 53.0

# HELP zato_rest_channel_request_duration_seconds Duration of REST channel requests ...
# TYPE zato_rest_channel_request_duration_seconds histogram
zato_rest_channel_request_duration_seconds_bucket{channel_name="crm.api",le="0.005"} 19.0
...

Notice a few things about the output:

Every metric starts with zato_ - that's the prefix for all Zato metrics
Durations are in seconds (_seconds), not milliseconds
Sizes are in bytes (_bytes)
Counters end with _total

If you get a 401, the password is wrong - go to Security -> HTTP Basic Auth in Dashboard and update it.

How the pieces fit together

Zato/metrics endpoint

Scrapes every 15s

PrometheusCollects and stores

Queries via PromQL

DashboardsGrafana, Datadog, Perses, ...

Prometheus sits in the middle, and it pulls metrics from Zato's /metrics endpoint every 15 seconds, stores them as time series, and makes them available via a query language called PromQL.

On its own, Prometheus comes with a basic query interface where you can type queries and see simple graphs. It works, but it is not designed for building rich dashboards or managing alert routing. That is where the third piece comes in - you connect a dashboarding or alerting tool to Prometheus and it handles the visualization side. Popular choices include Grafana, Datadog, Perses, New Relic, and Elastic Observability. They all speak PromQL, so the queries you learn here work in any of them.

The rest of this guide focuses on the first two pieces - getting metrics out of Zato and into Prometheus. Once that works, connecting a dashboard tool is just a matter of pointing it at your Prometheus server's address.

Connecting Prometheus to Zato

Now that the Zato /metrics endpoint is working, let's point your scraper at it.

Add this to your prometheus.yml:

scrape_configs:
  - job_name: zato
    scrape_interval: 15s
    metrics_path: /metrics
    basic_auth:
      username: metrics
      password: your-password
    static_configs:
      - targets:
          - zato-server:11223

After the first scrape has completed (i.e. after 15 seconds), confirm that data is arriving as expected. Open your Prometheus UI (e.g. http://localhost:9090) and run:

up{job="zato"}

You should see something like the below (up{instance="localhost:11223",job="zato"}) which will confirm that you connected Prometheus correctly to Zato.

Fine, let's start querying now.

Your first queries

Request rate

How much traffic is each REST channel handling?

sum by (channel_name) (rate(zato_rest_channel_requests_total[5m]))

What does it show? Requests per second for each channel, averaged over 5 minutes. Channel names come from what you configured in Dashboard.

Error ratio

What fraction of requests are errors?

sum by (channel_name) (rate(zato_rest_channel_requests_total{status_code=~"4xx|5xx"}[5m]))
  /
sum by (channel_name) (rate(zato_rest_channel_requests_total[5m]))

The status_code label uses classes like 2xx and 5xx rather than raw codes like 200 or 503. This keeps the number of time series low and makes queries simpler.

Latency (p99)

What is the p99 latency per channel?

histogram_quantile(0.99,
  sum by (channel_name, le) (rate(zato_rest_channel_request_duration_seconds_bucket[5m]))
)

Replace 0.99 with 0.95 or 0.5 for other percentiles.

Finding out who caused the error

When a request fails, you need to know whether the problem is in your Zato service or in one of the external systems it talks to. Those external systems - CRM, billing, payment gateways, any API your service calls - are called upstreams.

Every request counter has an error_source label that answers this question directly from PromQL, without correlating logs.

Here's what each value means:

Value	When it is set
`none`	The request succeeded (2xx, 3xx)
`gateway`	Zato returned the error - a service exception, validation failure, or misconfiguration
`upstream`	An external system your service calls failed - timeout, connection refused, DNS failure
`auth`	Authentication or authorization denied the request
`rate_limit`	Rate limiting rejected the request

Upstream errors

rate(zato_rest_channel_requests_total{error_source="upstream"}[5m])

Gateway errors

rate(zato_rest_channel_requests_total{error_source="gateway"}[5m])

Monitoring services and other components

Failing services

sum by (service_name) (rate(zato_service_invocations_total{outcome="error"}[5m]))
  /
sum by (service_name) (rate(zato_service_invocations_total[5m]))

Pub/sub throughput

sum by (topic_name) (rate(zato_pubsub_messages_published_total[5m]))

Scheduler jobs

sum by (job_name) (rate(zato_scheduler_executions_total{outcome="ok"}[5m]))
  /
sum by (job_name) (rate(zato_scheduler_executions_total[5m]))

Requests in flight

zato_server_requests_in_flight

Setting up an SLO alert

Now that you have metrics flowing and basic queries working, let's set up something more advanced - a burn-rate SLO alert.

An SLO (Service Level Objective) is a target you set for how reliable your service should be - for example, "99.9% of requests should succeed over a rolling 30-day window". The 0.1% that is allowed to fail is your error budget. A burn-rate alert fires when that budget is being consumed faster than expected, catching both sudden spikes and sustained degradation before you run out:

(
  sum(rate(zato_rest_channel_requests_total{status_code=~"5xx"}[1h]))
    / sum(rate(zato_rest_channel_requests_total[1h]))
  > (14.4 * 0.001)
)
and
(
  sum(rate(zato_rest_channel_requests_total{status_code=~"5xx"}[5m]))
    / sum(rate(zato_rest_channel_requests_total[5m]))
  > (14.4 * 0.001)
)

This fires when the 5xx rate exceeds 1.44% in both the 1-hour and 5-minute windows simultaneously. Requiring both windows to breach at the same time avoids alert fatigue from brief spikes that resolve on their own.

How many time series will this produce?

In Prometheus, every unique combination of a metric name and its label values creates a separate time series. For example, if you have 10 channels and 5 status code classes, that's 50 time series just for zato_rest_channel_requests_total. When the number of time series grows too large - typically because a label has too many possible values - Prometheus uses more memory and queries slow down. This is known as a cardinality problem.

Zato avoids this by design. All labels come from configuration objects you define in Dashboard - channel names, service names, connection names, topic names, job names. No label is ever derived from unbounded request data like URL paths, query strings, user IDs, or IP addresses.

For a deployment with 50 REST channels, 20 outgoing connections, 100 services, 10 pub/sub topics, and 20 scheduler jobs, the total comes to around 6,400 time series. Even if your deployment is larger, the number grows linearly with the number of configured objects, not with traffic volume.

Histogram bucket values

All duration histograms use 12 buckets from 5 ms to 30 seconds:

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

All size histograms use 10 buckets from 64 bytes to 16 MB:

64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216

Both sets are shared across all histograms of the same kind, so aggregating across channels or connections in PromQL works correctly.

FAQ

I get a 401 Unauthorized when curling /metrics - what's wrong?

The password doesn't match. Go to Security -> HTTP Basic Auth in Dashboard, find the entry named metrics, click Change password, and set it to what your scraper is using.

My scraper reports a timeout

Under normal load a Zato scrape completes in under 100 ms. When you see a timeout, start with connectivity - wrong host or port, firewall, or the process not listening where you think it is. If that checks out, increase scrape_timeout in your scraper config to at least 5 seconds.

Metric reference

For a complete reference of every metric Zato exposes - including type, labels, bucket values, and example queries - see Prometheus metric reference.