Skip to content

Health & Monitoring

Monitor Repod's operational status, integrate with your alerting stack, and interpret the health endpoints.


Health endpoints

Three endpoints are available for different monitoring scenarios:

Endpoint Auth Use case
GET /health/live Public Container liveness probe — is the process running?
GET /health/ready Public Readiness probe — are all dependencies ready?
GET /health Bearer token Full status report — component-level detail

Liveness probe

curl http://localhost:8000/health/live
{"status": "ok"}

Returns 200 as long as the uvicorn process is alive. Use this for Docker HEALTHCHECK and Kubernetes livenessProbe.

Readiness probe

curl http://localhost:8000/health/ready
{"status": "ok", "ready": true}

Returns 200 when the database and critical services are ready. Returns 503 during startup or when a critical dependency is unavailable. Use this for Kubernetes readinessProbe and load-balancer health checks.

Full health report

TOKEN=$(curl -s -X POST http://localhost:8000/auth/token \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"Admin1234!"}' | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8000/health | jq .
{
  "status": "ok",
  "version": "v1.2.0",
  "uptime_seconds": 86432,
  "components": {
    "database": {"ok": true},
    "clamav": {"ok": true, "version": "ClamAV 1.4.3/27509"},
    "grype": {"ok": true, "db_age_hours": 4.2},
    "gpg": {"ok": true, "key_count": 1}
  }
}

Docker health check configuration

The docker-compose.yaml includes a built-in health check for the backend:

docker-compose.yaml (excerpt)
  backend-api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health/live"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s   # allow ClamAV signature loading time

Check health status:

docker compose ps
# Look for (healthy) or (unhealthy) next to backend-api

Prometheus metrics

Repod exposes Prometheus metrics on the backend via the prometheus-client library. Metrics are available at:

GET http://localhost:8000/metrics

Key metrics:

Metric Type Description
repod_packages_total Gauge Total packages in the pool
repod_uploads_total Counter Upload attempts (labels: status)
repod_cve_findings_total Counter CVE findings (labels: severity)
repod_pending_review_total Gauge Packages awaiting CISO decision
repod_quarantine_total Gauge Packages in quarantine
repod_clamav_last_update Gauge Unix timestamp of last ClamAV DB update
repod_grype_db_age_seconds Gauge Age of Grype vulnerability database
http_requests_total Counter HTTP requests (labels: method, path, status)
http_request_duration_seconds Histogram Request latency

Prometheus scrape configuration

prometheus.yml
scrape_configs:
  - job_name: repod
    static_configs:
      - targets: ["repod-backend:8000"]
    metrics_path: /metrics
    scrape_interval: 30s
    # Add bearer_token if you protect /metrics in production

Grafana dashboard

Import the community dashboard ID XXXXX (search "Repod" on grafana.com) or build your own using the metrics above.

Recommended panels:

  • Upload success rate (last 1h)
  • CVE findings by severity (last 7d)
  • Pending review queue depth (time series)
  • ClamAV DB age (alert if > 48 h)
  • Grype DB age (alert if > 24 h)
  • HTTP request latency (p50 / p95 / p99)

Log monitoring

View logs in real time

# All containers
docker compose logs -f

# Backend only (last 200 lines)
docker compose logs backend-api --tail=200

# Filter for errors
docker compose logs backend-api 2>&1 | grep -iE "error|exception|traceback"

# Filter for security events
docker compose logs backend-api 2>&1 | grep -E "UPLOAD|SECURITY|CVE"

Structured audit log

The audit log at /repos/audit/YYYY-MM-DD.jsonl is the authoritative record of all security-relevant events. Query it with standard UNIX tools:

# Today's audit log — pretty-printed
docker exec backend-api tail -n 50 \
  /repos/audit/$(date +%Y-%m-%d).jsonl | python3 -m json.tool

# All failed logins today
docker exec backend-api grep '"action":"LOGIN"' \
  /repos/audit/$(date +%Y-%m-%d).jsonl | grep '"result":"FAILURE"'

# All CVE decisions in the last 7 days
for d in $(seq 0 6); do
  date --date="-$d days" +%Y-%m-%d
done | while read day; do
  f="/repos/audit/$day.jsonl"
  [ -f "$f" ] && grep '"action":"SECURITY_DECISION"' "$f"
done

Ship logs to a SIEM

The JSONL format is directly ingestible by most SIEM platforms:

logstash.conf
input {
  file {
    path  => "/opt/repod/repos/audit/*.jsonl"
    codec => "json"
    start_position => "beginning"
  }
}

filter {
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "repod-audit-%{+YYYY.MM.dd}"
  }
}

Configure a Universal Forwarder monitor input:

inputs.conf
[monitor:///opt/repod/repos/audit/*.jsonl]
disabled  = false
sourcetype = repod_audit
index     = security
promtail-config.yml
scrape_configs:
  - job_name: repod-audit
    static_configs:
      - targets: [localhost]
        labels:
          job: repod
          __path__: /opt/repod/repos/audit/*.jsonl
    pipeline_stages:
      - json:
          expressions:
            action: action
            result: result
            user: user
      - labels:
          action:
          result:

Alerting recommendations

Critical alerts (page immediately)

Condition Query / check
Backend container unhealthy docker inspect backend-api --format='{{.State.Health.Status}}'healthy
ClamAV unreachable components.clamav.ok == false in /health
CVE database stale > 48 h repod_grype_db_age_seconds > 172800
Pending review queue growing repod_pending_review_total > 10 (tune per org)
Login failures spike > 20 FAILURE audit events in 5 min

Warning alerts (notify team)

Condition Query / check
ClamAV DB not updated in 24 h repod_clamav_last_update age > 86400 s
Disk space low df -h /opt/repod/repos < 10 GB free
Upload error rate > 5% repod_uploads_total{status!="published"} / total

Security event monitoring

Monitor these audit event types for security operations:

Event Security signal
LOGIN with result=FAILURE Brute-force or credential stuffing attempt
USER_CREATE New account provisioned — verify intent
USER_ROLE_CHANGE Privilege escalation — verify intent
SECURITY_DECISION with approve CVE exception approved — review justification
UPLOAD with status=quarantined Malware or critical CVE detected
SETTINGS_CHANGE Configuration change — verify intent

Checking component status manually

# ClamAV — version and signature date
docker exec backend-api clamdscan --version

# Grype — database age
docker exec backend-api grype db status

# GPG — list keys in keyring
docker exec backend-api gpg --homedir /repos/gnupg --list-keys

# SQLite — check database integrity
docker exec backend-api python3 -c "
import sqlite3
for db in ['/repos/auth/users.db', '/repos/package-index/packages.db']:
    conn = sqlite3.connect(db)
    result = conn.execute('PRAGMA integrity_check').fetchone()
    print(f'{db}: {result[0]}')
"

Resource utilization

Monitor Docker resource usage with:

docker stats backend-api frontend-ui depot-apt

Default resource limits in docker-compose.yaml:

Container CPU limit Memory limit
backend-api 1.5 CPUs 2.5 GB
frontend-ui 0.5 CPUs 256 MB
depot-apt / depot-rpm 0.5 CPUs 128 MB

Increase limits if ClamAV or Grype scans saturate the CPU:

docker-compose.yaml
  backend-api:
    deploy:
      resources:
        limits:
          cpus: "3.0"
          memory: 4g

Disk space management

# Total space used by repos volume
du -sh /opt/repod/repos/

# Breakdown by subdirectory
du -sh /opt/repod/repos/* | sort -rh

# Largest packages in the pool
du -sh /opt/repod/repos/pool/* 2>/dev/null | sort -rh | head -20

# Quarantine contents (review before deleting)
ls -lh /opt/repod/repos/staging/quarantine/

# ClamAV database size
du -sh /opt/repod/repos/clamav-db/

# Grype database size
du -sh /opt/repod/repos/grype-db/

Configure automatic retention cleanup in Settings → Retention in the web UI (default: 90 days for audit logs, no automatic package cleanup).