Monitoring¶
Consolidated reference for BigBrotr's observability stack: Prometheus metrics, alerting rules, Grafana dashboards, and structured logging.
Overview¶
BigBrotr exposes operational telemetry through three channels:
- Prometheus metrics --
/metricsHTTP endpoint on each service - Structured logging -- key=value text or JSON output
- Grafana dashboards -- auto-provisioned visualization
flowchart LR
subgraph Services
F["Finder<br/><small>:8001</small>"]
V["Validator<br/><small>:8002</small>"]
M["Monitor<br/><small>:8003</small>"]
S["Synchronizer<br/><small>:8004</small>"]
end
P["Prometheus<br/><small>:9090</small>"]
G["Grafana<br/><small>:3000</small>"]
F -->|"/metrics"| P
V -->|"/metrics"| P
M -->|"/metrics"| P
S -->|"/metrics"| P
P -->|"datasource"| G
style P fill:#E65100,color:#fff,stroke:#BF360C
style G fill:#1B5E20,color:#fff,stroke:#0D3B0F
Prometheus Metrics¶
Each service exposes four metric types via the /metrics endpoint, implemented in src/bigbrotr/core/metrics.py.
Metric Types¶
| Metric Name | Prometheus Type | Labels | Description |
|---|---|---|---|
service_info |
Info | service |
Static service metadata (version, name) |
service_gauge |
Gauge | service, name |
Point-in-time operational state |
service_counter |
Counter | service, name |
Cumulative totals since startup |
cycle_duration_seconds |
Histogram | service |
Cycle execution latency distribution |
Gauge Names¶
Gauge name Label |
Description |
|---|---|
consecutive_failures |
Number of consecutive failed cycles (resets to 0 on success) |
progress |
Current batch progress (0.0 to 1.0) |
last_cycle_timestamp |
Unix timestamp of the last completed cycle |
Counter Names¶
Counter name Label |
Description |
|---|---|
cycles_success |
Total successful cycles since startup |
cycles_failed |
Total failed cycles since startup |
errors |
Total errors encountered since startup |
Histogram Buckets¶
The cycle_duration_seconds histogram uses 10 buckets spanning typical cycle durations:
1s, 5s, 10s, 30s, 60s, 120s, 300s, 600s, 1800s, 3600s
MetricsConfig¶
Metrics are configured per service in the service YAML file:
metrics:
enabled: true # Enable /metrics endpoint (default: false)
port: 8000 # HTTP port (default: 8000)
host: "0.0.0.0" # Bind address (use 0.0.0.0 in containers)
path: "/metrics" # URL path (default: /metrics)
Warning
Metrics are disabled by default. Set metrics.enabled: true in the service configuration to expose the /metrics endpoint.
API Reference
See bigbrotr.core.metrics for the complete Metrics API.
Health Checks¶
Each service container uses the /metrics endpoint as its Docker health check:
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/metrics"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
Note
This is a real health check, not a PID-based fake check. If the metrics endpoint is unreachable, Docker marks the container as unhealthy.
Prometheus Configuration¶
Scrape Configuration¶
Prometheus scrapes all service endpoints every 30 seconds. The configuration is in deployments/*/monitoring/prometheus/prometheus.yaml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "finder"
static_configs:
- targets: ["finder:8001"]
scrape_interval: 30s
- job_name: "validator"
static_configs:
- targets: ["validator:8002"]
scrape_interval: 30s
- job_name: "monitor"
static_configs:
- targets: ["monitor:8003"]
scrape_interval: 30s
- job_name: "synchronizer"
static_configs:
- targets: ["synchronizer:8004"]
scrape_interval: 30s
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Data Retention¶
Prometheus is configured with 30-day data retention. Data is persisted to a named Docker volume (prometheus-data).
Alerting Rules¶
Six alerting rules are defined in deployments/bigbrotr/monitoring/prometheus/rules/alerts.yml:
ServiceDown¶
Fires when a service's /metrics endpoint is unreachable for more than 5 minutes. This indicates the service process has crashed or the container is unhealthy.
HighFailureRate¶
alert: HighFailureRate
expr: rate(bigbrotr_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
Fires when the error rate exceeds 0.1 errors/second over a 5-minute window. This may indicate connectivity issues, database problems, or relay protocol changes.
PoolExhausted¶
alert: PoolExhausted
expr: bigbrotr_pool_available_connections == 0 and bigbrotr_pool_max_connections > 0
for: 2m
labels:
severity: critical
Fires when the database connection pool has zero available connections for more than 2 minutes. Queries will queue or timeout. Resolution: increase pool.limits.max_size in brotr.yaml.
ConsecutiveFailures¶
alert: ConsecutiveFailures
expr: service_gauge{name="consecutive_failures"} >= 5
for: 1m
labels:
severity: warning
Fires when a service has accumulated 5 or more consecutive failures. This typically means the service is about to auto-stop (default max_consecutive_failures: 5).
SlowCycles¶
alert: SlowCycles
expr: histogram_quantile(0.99, rate(cycle_duration_seconds_bucket[5m])) > 300
for: 5m
labels:
severity: warning
Fires when the p99 cycle duration exceeds 5 minutes. This may indicate network congestion, database contention, or relay connectivity issues.
DatabaseConnectionsHigh¶
Fires when the total PostgreSQL connection count exceeds 80. Check pool sizing and PGBouncer configuration.
CacheHitRatioLow¶
alert: CacheHitRatioLow
expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
for: 10m
labels:
severity: warning
Fires when the PostgreSQL buffer cache hit ratio drops below 95%. This may indicate insufficient shared_buffers or dataset growth exceeding available memory.
Alert Summary¶
| Alert | Condition | Duration | Severity |
|---|---|---|---|
| ServiceDown | up == 0 |
5 min | critical |
| HighFailureRate | Error rate > 0.1/s | 5 min | warning |
| ConsecutiveFailures | Consecutive failures >= 5 | 1 min | warning |
| SlowCycles | p99 cycle duration > 300s | 5 min | warning |
| DatabaseConnectionsHigh | PG connections > 80 | 5 min | warning |
| CacheHitRatioLow | Buffer cache hit < 95% | 10 min | warning |
Grafana Dashboards¶
Grafana is auto-provisioned with a Prometheus datasource and a BigBrotr dashboard. The dashboard provides per-service panels organized in rows.
Dashboard Panels¶
Each service (Finder, Validator, Monitor, Synchronizer) has a row with:
| Panel | Visualization | Query |
|---|---|---|
| Last Cycle | Stat (single value) | time() - service_gauge{service="<name>", name="last_cycle_timestamp"} |
| Cycle Duration | Histogram | cycle_duration_seconds{service="<name>"} |
| Error Count (24h) | Stat | increase(service_counter{service="<name>", name="errors"}[24h]) |
| Consecutive Failures | Stat | service_gauge{service="<name>", name="consecutive_failures"} |
Thresholds¶
The "Last Cycle" panel uses color thresholds to indicate staleness:
| Color | Condition | Meaning |
|---|---|---|
| Green | < 30 minutes | Normal operation |
| Yellow | 30-60 minutes | Potentially delayed |
| Red | > 60 minutes | Service may be stalled |
Provisioning¶
Grafana provisioning files are located in deployments/*/monitoring/grafana/provisioning/:
grafana/provisioning/
+-- datasources/
| +-- prometheus.yaml # Prometheus datasource (auto-configured)
+-- dashboards/
+-- dashboards.yaml # Dashboard provider configuration
+-- bigbrotr.json # BigBrotr dashboard definition
Note
In the BigBrotr deployment, dashboards are configured as non-editable to prevent drift from the provisioned state.
Port Mappings¶
| Deployment | Grafana Port | Prometheus Port |
|---|---|---|
| BigBrotr | 3000 | 9090 |
| LilBrotr | 3001 | 9091 |
Structured Logging¶
BigBrotr uses the custom Logger class (src/bigbrotr/core/logger.py) for structured logging with key=value pairs.
Text Mode (Default)¶
2026-02-09 12:00:00 INFO synchronizer: sync_completed events=1500 duration=45.2
2026-02-09 12:00:01 WARNING monitor: check_failed relay=wss://example.com error="connection timeout"
2026-02-09 12:00:02 ERROR validator: cycle_failed consecutive_failures=3 error="pool exhausted"
Format: {timestamp} {level} {service}: {message} {key=value pairs}
JSON Mode¶
For cloud log aggregation (ELK, Loki, CloudWatch), enable JSON output via the --log-level flag or logger configuration:
{
"timestamp": "2026-02-09T12:34:56+00:00",
"level": "info",
"service": "synchronizer",
"message": "sync_completed",
"events": 1500,
"duration": 45.2
}
Log Levels¶
| Level | Use Case |
|---|---|
DEBUG |
Detailed operational tracing (per-relay, per-event) |
INFO |
Normal operational events (cycle start/end, batch progress) |
WARNING |
Recoverable issues (relay timeouts, retry attempts) |
ERROR |
Failures requiring attention (cycle failures, database errors) |
Set the log level via CLI:
Docker Log Management¶
Docker Compose is configured with JSON file logging and size limits:
| Service | Max Log Size |
|---|---|
| Seeder | 10 MB |
| Finder | 50 MB |
| Validator | 50 MB |
| Monitor | 50 MB |
| Synchronizer | 100 MB |
Useful PromQL Queries¶
Service Health¶
# Are all services up?
up{job=~"finder|validator|monitor|synchronizer"}
# Time since last successful cycle per service
time() - service_gauge{name="last_cycle_timestamp"}
# Consecutive failures
service_gauge{name="consecutive_failures"}
Performance¶
# Average cycle duration (5m window)
rate(cycle_duration_seconds_sum[5m]) / rate(cycle_duration_seconds_count[5m])
# p99 cycle duration
histogram_quantile(0.99, rate(cycle_duration_seconds_bucket[5m]))
# Error rate per service
rate(service_counter{name="errors"}[5m])
Throughput¶
# Successful cycles per hour
increase(service_counter{name="cycles_success"}[1h])
# Total errors in past 24h
increase(service_counter{name="errors"}[24h])
Related Documentation¶
- Architecture -- System architecture and module reference
- Services -- Deep dive into the six independent services
- Configuration -- YAML configuration reference (MetricsConfig details)
- Database -- PostgreSQL schema and stored functions