When you run APIs in production, you need to know what is happening at all times - how many requests each endpoint handles per second, how long they take, whether errors are spiking, and when they do spike, whether the problem is in your code or in an external system your service calls.

Without this visibility, the first sign of trouble is usually a user complaint or a call from business people. By then, you are already behind.
Zato gives you this visibility through a Prometheus /metrics endpoint. Prometheus is the most widely adopted open-source monitoring system for this kind of work - it collects numeric measurements (metrics) from your Zato server at regular intervals, stores them as time series, and lets you query, graph, and alert on them.
If you have not used Prometheus before, the key idea is simple: Zato exposes numbers at a URL, Prometheus fetches them every 15 seconds, and you write queries against the collected data, and you connect further monitoring systems to Prometheus to build your observability dashboards.
In this guide you will connect a Prometheus-compatible scraper to Zato, verify that metrics are flowing, and build your first queries for request rates, latencies, and error breakdowns. The whole thing will take about 15 minutes.
Here's what you'll have at the end:
/metrics endpoint exposing request rates, latencies, error breakdowns, service invocations, pub/sub throughput, and scheduler job outcomesThe /metrics endpoint is created automatically when you start a container. It uses HTTP Basic Auth - the default username is metrics and the password is either whatever you set in Zato_Metrics_Password before creating the environment, or a random string that was generated for you.
You can always change the password in Dashboard under Security -> HTTP Basic Auth, look for the entry named metrics.
Let's verify that the endpoint is working. Open a terminal and run:
You should see output like this:
# HELP zato_rest_channel_requests_total Total REST channel requests ...
# TYPE zato_rest_channel_requests_total counter
zato_rest_channel_requests_total{channel_name="crm.api",status_code="2xx",error_source="none"} 53.0
# HELP zato_rest_channel_request_duration_seconds Duration of REST channel requests ...
# TYPE zato_rest_channel_request_duration_seconds histogram
zato_rest_channel_request_duration_seconds_bucket{channel_name="crm.api",le="0.005"} 19.0
...
Notice a few things about the output:
zato_ - that's the prefix for all Zato metrics_seconds), not milliseconds_bytes)_totalIf you get a 401, the password is wrong - go to Security -> HTTP Basic Auth in Dashboard and update it.
Prometheus sits in the middle, and it pulls metrics from Zato's /metrics endpoint every 15 seconds, stores them as time series, and makes them available via a query language called PromQL.
On its own, Prometheus comes with a basic query interface where you can type queries and see simple graphs. It works, but it is not designed for building rich dashboards or managing alert routing. That is where the third piece comes in - you connect a dashboarding or alerting tool to Prometheus and it handles the visualization side. Popular choices include Grafana, Datadog, Perses, New Relic, and Elastic Observability. They all speak PromQL, so the queries you learn here work in any of them.
The rest of this guide focuses on the first two pieces - getting metrics out of Zato and into Prometheus. Once that works, connecting a dashboard tool is just a matter of pointing it at your Prometheus server's address.
Now that the Zato /metrics endpoint is working, let's point your scraper at it.
Add this to your prometheus.yml:
scrape_configs:
- job_name: zato
scrape_interval: 15s
metrics_path: /metrics
basic_auth:
username: metrics
password: your-password
static_configs:
- targets:
- zato-server:11223
After the first scrape has completed (i.e. after 15 seconds), confirm that data is arriving as expected. Open your Prometheus UI (e.g. http://localhost:9090) and run:
You should see something like the below (up{instance="localhost:11223",job="zato"}) which will confirm that you connected Prometheus correctly to Zato.

Fine, let's start querying now.
How much traffic is each REST channel handling?
What does it show? Requests per second for each channel, averaged over 5 minutes. Channel names come from what you configured in Dashboard.
What fraction of requests are errors?
sum by (channel_name) (rate(zato_rest_channel_requests_total{status_code=~"4xx|5xx"}[5m]))
/
sum by (channel_name) (rate(zato_rest_channel_requests_total[5m]))
The status_code label uses classes like 2xx and 5xx rather than raw codes like 200 or 503. This keeps the number of time series low and makes queries simpler.
When a request fails, you need to know whether the problem is in your Zato service or in one of the external systems it talks to. Those external systems - CRM, billing, payment gateways, any API your service calls - are called upstreams.
Every request counter has an error_source label that answers this question directly from PromQL, without correlating logs.
Here's what each value means:
| Value | When it is set |
|---|---|
none | The request succeeded (2xx, 3xx) |
gateway | Zato returned the error - a service exception, validation failure, or misconfiguration |
upstream | An external system your service calls failed - timeout, connection refused, DNS failure |
auth | Authentication or authorization denied the request |
rate_limit | Rate limiting rejected the request |

Now that you have metrics flowing and basic queries working, let's set up something more advanced - a burn-rate SLO alert.
An SLO (Service Level Objective) is a target you set for how reliable your service should be - for example, "99.9% of requests should succeed over a rolling 30-day window". The 0.1% that is allowed to fail is your error budget. A burn-rate alert fires when that budget is being consumed faster than expected, catching both sudden spikes and sustained degradation before you run out:
(
sum(rate(zato_rest_channel_requests_total{status_code=~"5xx"}[1h]))
/ sum(rate(zato_rest_channel_requests_total[1h]))
> (14.4 * 0.001)
)
and
(
sum(rate(zato_rest_channel_requests_total{status_code=~"5xx"}[5m]))
/ sum(rate(zato_rest_channel_requests_total[5m]))
> (14.4 * 0.001)
)
This fires when the 5xx rate exceeds 1.44% in both the 1-hour and 5-minute windows simultaneously. Requiring both windows to breach at the same time avoids alert fatigue from brief spikes that resolve on their own.
In Prometheus, every unique combination of a metric name and its label values creates a separate time series. For example, if you have 10 channels and 5 status code classes, that's 50 time series just for zato_rest_channel_requests_total. When the number of time series grows too large - typically because a label has too many possible values - Prometheus uses more memory and queries slow down. This is known as a cardinality problem.
Zato avoids this by design. All labels come from configuration objects you define in Dashboard - channel names, service names, connection names, topic names, job names. No label is ever derived from unbounded request data like URL paths, query strings, user IDs, or IP addresses.
For a deployment with 50 REST channels, 20 outgoing connections, 100 services, 10 pub/sub topics, and 20 scheduler jobs, the total comes to around 6,400 time series. Even if your deployment is larger, the number grows linearly with the number of configured objects, not with traffic volume.
All duration histograms use 12 buckets from 5 ms to 30 seconds:
All size histograms use 10 buckets from 64 bytes to 16 MB:
Both sets are shared across all histograms of the same kind, so aggregating across channels or connections in PromQL works correctly.
I get a 401 Unauthorized when curling /metrics - what's wrong?
The password doesn't match. Go to Security -> HTTP Basic Auth in Dashboard, find the entry named metrics, click Change password, and set it to what your scraper is using.
My scraper reports a timeout
Under normal load a Zato scrape completes in under 100 ms. When you see a timeout, start with connectivity - wrong host or port, firewall, or the process not listening where you think it is. If that checks out, increase scrape_timeout in your scraper config to at least 5 seconds.
For a complete reference of every metric Zato exposes - including type, labels, bucket values, and example queries - see Prometheus metric reference.