Monitoring Setup¶

Configure Prometheus metrics collection, Grafana dashboards, and alerting for BigBrotr services.

Overview¶

All continuous BigBrotr services expose a /metrics endpoint in Prometheus exposition format. Seeder is a one-shot service and does not expose a metrics endpoint. The Docker Compose stack includes Prometheus and Grafana pre-configured, but you can also connect to an external monitoring stack.

Metrics Exposed¶

Metric	Type	Description
`service_info`	Info	Static service metadata (name, version)
`service_gauge`	Gauge	Point-in-time state (consecutive_failures, last_cycle_timestamp, progress)
`service_counter_total`	Counter	Cumulative totals (cycles_success, cycles_failed, errors by type)
`cycle_duration_seconds`	Histogram	Cycle latency with 10 buckets (1s to 1h)

1. Start the Monitoring Stack¶

Using Docker Compose (included)¶

The default docker-compose.yaml starts Prometheus and Grafana automatically:

cd deployments/bigbrotr
docker compose up -d prometheus grafana

Endpoints:

Service	URL
Prometheus	`http://localhost:9090`
Grafana	`http://localhost:3000`

Note

The default Grafana credentials are admin / <GRAFANA_PASSWORD from .env>.

Using an external Prometheus¶

If you already run Prometheus, add scrape targets for each service. Inside the Docker network, all services listen on container port 8000. On the host, each service is mapped to a unique port (8001--8007):

scrape_configs:
  - job_name: finder
    static_configs:
      - targets: ["finder:8000"]       # or localhost:8001 from host
  - job_name: validator
    static_configs:
      - targets: ["validator:8000"]    # or localhost:8002 from host
  - job_name: monitor
    static_configs:
      - targets: ["monitor:8000"]      # or localhost:8003 from host
  - job_name: synchronizer
    static_configs:
      - targets: ["synchronizer:8000"] # or localhost:8004 from host
  - job_name: refresher
    static_configs:
      - targets: ["refresher:8000"]    # or localhost:8005 from host
  - job_name: api
    static_configs:
      - targets: ["api:8000"]          # or localhost:8006 from host
  - job_name: dvm
    static_configs:
      - targets: ["dvm:8000"]          # or localhost:8007 from host

2. Enable Service Metrics¶

Each service must have metrics enabled in its YAML config. Set metrics.enabled: true:

# config/services/finder.yaml
metrics:
  enabled: true
  port: 8000
  host: "0.0.0.0"
  path: "/metrics"

All services listen on container port 8000 by default. Docker Compose maps each to a unique host port:

Service	Container Port	Host Port (default)	Override Variable
Finder	8000	8001	`FINDER_METRICS_PORT`
Validator	8000	8002	`VALIDATOR_METRICS_PORT`
Monitor	8000	8003	`MONITOR_METRICS_PORT`
Synchronizer	8000	8004	`SYNCHRONIZER_METRICS_PORT`
Refresher	8000	8005	`REFRESHER_METRICS_PORT`
Api	8000	8006	`API_METRICS_PORT`
Dvm	8000	8007	`DVM_METRICS_PORT`

3. Configure Prometheus Targets¶

The included Prometheus configuration is at monitoring/prometheus/prometheus.yaml. It scrapes all service endpoints every 30 seconds with 30-day data retention.

To verify targets are being scraped:

Open http://localhost:9090/targets
All endpoints should show state UP
If a target shows DOWN, check that the service is running and the port is correct

4. Import Grafana Dashboards¶

The BigBrotr deployment auto-provisions Grafana with:

A Prometheus datasource pointing to http://prometheus:9090
A dashboard directory at monitoring/grafana/dashboards/

To add a custom dashboard:

Open Grafana at http://localhost:3000
Navigate to Dashboards > New > Import
Paste the JSON or upload a file
Select the Prometheus datasource

Tip

The auto-provisioned dashboard includes per-service panels for cycle time, cycle duration, error counts (24h), and consecutive failures. The Validator has additional candidate progress panels.

5. Set Up Alerting Rules¶

BigBrotr includes seven alerting rules in monitoring/prometheus/rules/alerts.yml:

Alert	Expression	Duration	Severity
ServiceDown	`up == 0`	5 minutes	critical
HighFailureRate	`sum by (service) (rate(service_counter_total{name=~"errors_.*"}[5m])) > 0.1`	5 minutes	warning
ConsecutiveFailures	`service_gauge{name="consecutive_failures"} >= 5`	2 minutes	critical
SlowCycles	`histogram_quantile(0.99, rate(cycle_duration_seconds_bucket[5m])) > 300`	5 minutes	warning
DatabaseConnectionsHigh	`sum(pg_stat_activity_count{datname="bigbrotr"}) > 80`	5 minutes	warning
CacheHitRatioLow	`pg_stat_database_blks_hit{datname="bigbrotr"} / (...) < 0.95`	10 minutes	warning
RefresherViewsFailing	`service_gauge{service="refresher", name="views_failed"} > 0`	10 minutes	warning

Verify alerts are loaded¶

Open http://localhost:9090/alerts
All seven rules should appear under the bigbrotr group
Rules in inactive state means no alerts are currently firing

Configure alert notifications¶

To receive alerts via email, Slack, or PagerDuty, configure an Alertmanager and add it to your Prometheus config:

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Warning

The default Docker Compose stack does not include Alertmanager. You need to add it as a separate service or use Grafana alerting as an alternative.

6. Create Custom Dashboards¶

Useful PromQL queries for custom panels:

# Successful cycles per hour (by service)
increase(service_counter_total{name="cycles_success"}[1h])

# Average cycle duration (last 5 minutes)
rate(cycle_duration_seconds_sum[5m])
  / rate(cycle_duration_seconds_count[5m])

# Current consecutive failures
service_gauge{name="consecutive_failures"}

# Error rate by type
rate(service_counter_total{name=~"errors_.*"}[5m])

Tip

Use Grafana variables to create a single dashboard with a service selector dropdown. Set a $service variable from the job label values.

Docker Compose Deployment -- the monitoring stack is included
Manual Deployment -- add monitoring to a non-Docker deployment
Troubleshooting -- diagnose metrics and alerting issues