Health & Monitoring¶

Monitor Repod's operational status, integrate with your alerting stack, and interpret the health endpoints.

Health endpoints¶

Three endpoints are available for different monitoring scenarios:

Endpoint	Auth	Use case
`GET /health/live`	Public	Container liveness probe — is the process running?
`GET /health/ready`	Public	Readiness probe — are all dependencies ready?
`GET /health`	Bearer token	Full status report — component-level detail

Liveness probe¶

curl http://localhost:8000/health/live

{"status": "ok"}

Returns 200 as long as the uvicorn process is alive. Use this for Docker HEALTHCHECK and Kubernetes livenessProbe.

Readiness probe¶

curl http://localhost:8000/health/ready

{"status": "ok", "ready": true}

Returns 200 when the database and critical services are ready. Returns 503 during startup or when a critical dependency is unavailable. Use this for Kubernetes readinessProbe and load-balancer health checks.

Full health report¶

TOKEN=$(curl -s -X POST http://localhost:8000/auth/token \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"Admin1234!"}' | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8000/health | jq .

{
  "status": "ok",
  "version": "v1.2.0",
  "uptime_seconds": 86432,
  "components": {
    "database": {"ok": true},
    "clamav": {"ok": true, "version": "ClamAV 1.4.3/27509"},
    "grype": {"ok": true, "db_age_hours": 4.2},
    "gpg": {"ok": true, "key_count": 1}
  }
}

Docker health check configuration¶

The docker-compose.yaml includes a built-in health check for the backend:

docker-compose.yaml (excerpt)

  backend-api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health/live"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s   # allow ClamAV signature loading time

Check health status:

docker compose ps
# Look for (healthy) or (unhealthy) next to backend-api

Prometheus metrics¶

Repod exposes Prometheus metrics on the backend via the prometheus-client library. Metrics are available at:

GET http://localhost:8000/metrics

Key metrics:

Metric	Type	Description
`repod_packages_total`	Gauge	Total packages in the pool
`repod_uploads_total`	Counter	Upload attempts (labels: `status`)
`repod_cve_findings_total`	Counter	CVE findings (labels: `severity`)
`repod_pending_review_total`	Gauge	Packages awaiting CISO decision
`repod_quarantine_total`	Gauge	Packages in quarantine
`repod_clamav_last_update`	Gauge	Unix timestamp of last ClamAV DB update
`repod_grype_db_age_seconds`	Gauge	Age of Grype vulnerability database
`http_requests_total`	Counter	HTTP requests (labels: `method`, `path`, `status`)
`http_request_duration_seconds`	Histogram	Request latency

Prometheus scrape configuration¶

prometheus.yml

scrape_configs:
  - job_name: repod
    static_configs:
      - targets: ["repod-backend:8000"]
    metrics_path: /metrics
    scrape_interval: 30s
    # Add bearer_token if you protect /metrics in production

Grafana dashboard¶

Import the community dashboard ID XXXXX (search "Repod" on grafana.com) or build your own using the metrics above.

Recommended panels:

Upload success rate (last 1h)
CVE findings by severity (last 7d)
Pending review queue depth (time series)
ClamAV DB age (alert if > 48 h)
Grype DB age (alert if > 24 h)
HTTP request latency (p50 / p95 / p99)

Log monitoring¶

View logs in real time¶

# All containers
docker compose logs -f

# Backend only (last 200 lines)
docker compose logs backend-api --tail=200

# Filter for errors
docker compose logs backend-api 2>&1 | grep -iE "error|exception|traceback"

# Filter for security events
docker compose logs backend-api 2>&1 | grep -E "UPLOAD|SECURITY|CVE"

Structured audit log¶

The audit log at /repos/audit/YYYY-MM-DD.jsonl is the authoritative record of all security-relevant events. Query it with standard UNIX tools:

# Today's audit log — pretty-printed
docker exec backend-api tail -n 50 \
  /repos/audit/$(date +%Y-%m-%d).jsonl | python3 -m json.tool

# All failed logins today
docker exec backend-api grep '"action":"LOGIN"' \
  /repos/audit/$(date +%Y-%m-%d).jsonl | grep '"result":"FAILURE"'

# All CVE decisions in the last 7 days
for d in $(seq 0 6); do
  date --date="-$d days" +%Y-%m-%d
done | while read day; do
  f="/repos/audit/$day.jsonl"
  [ -f "$f" ] && grep '"action":"SECURITY_DECISION"' "$f"
done

Ship logs to a SIEM¶

The JSONL format is directly ingestible by most SIEM platforms:

Elastic / LogstashSplunkLoki / Grafana

logstash.conf

input {
  file {
    path  => "/opt/repod/repos/audit/*.jsonl"
    codec => "json"
    start_position => "beginning"
  }
}

filter {
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "repod-audit-%{+YYYY.MM.dd}"
  }
}

Configure a Universal Forwarder monitor input:

inputs.conf

[monitor:///opt/repod/repos/audit/*.jsonl]
disabled  = false
sourcetype = repod_audit
index     = security

promtail-config.yml

scrape_configs:
  - job_name: repod-audit
    static_configs:
      - targets: [localhost]
        labels:
          job: repod
          __path__: /opt/repod/repos/audit/*.jsonl
    pipeline_stages:
      - json:
          expressions:
            action: action
            result: result
            user: user
      - labels:
          action:
          result:

Alerting recommendations¶

Critical alerts (page immediately)¶

Condition	Query / check
Backend container unhealthy	`docker inspect backend-api --format='{{.State.Health.Status}}'` ≠ `healthy`
ClamAV unreachable	`components.clamav.ok == false` in `/health`
CVE database stale > 48 h	`repod_grype_db_age_seconds > 172800`
Pending review queue growing	`repod_pending_review_total > 10` (tune per org)
Login failures spike	> 20 `FAILURE` audit events in 5 min

Warning alerts (notify team)¶

Condition	Query / check
ClamAV DB not updated in 24 h	`repod_clamav_last_update` age > 86400 s
Disk space low	`df -h /opt/repod/repos` < 10 GB free
Upload error rate > 5%	`repod_uploads_total{status!="published"}` / total

Security event monitoring¶

Monitor these audit event types for security operations:

Event	Security signal
`LOGIN` with `result=FAILURE`	Brute-force or credential stuffing attempt
`USER_CREATE`	New account provisioned — verify intent
`USER_ROLE_CHANGE`	Privilege escalation — verify intent
`SECURITY_DECISION` with `approve`	CVE exception approved — review justification
`UPLOAD` with `status=quarantined`	Malware or critical CVE detected
`SETTINGS_CHANGE`	Configuration change — verify intent

Checking component status manually¶

# ClamAV — version and signature date
docker exec backend-api clamdscan --version

# Grype — database age
docker exec backend-api grype db status

# GPG — list keys in keyring
docker exec backend-api gpg --homedir /repos/gnupg --list-keys

# SQLite — check database integrity
docker exec backend-api python3 -c "
import sqlite3
for db in ['/repos/auth/users.db', '/repos/package-index/packages.db']:
    conn = sqlite3.connect(db)
    result = conn.execute('PRAGMA integrity_check').fetchone()
    print(f'{db}: {result[0]}')
"

Resource utilization¶

Monitor Docker resource usage with:

docker stats backend-api frontend-ui depot-apt

Default resource limits in docker-compose.yaml:

Container	CPU limit	Memory limit
`backend-api`	1.5 CPUs	2.5 GB
`frontend-ui`	0.5 CPUs	256 MB
`depot-apt` / `depot-rpm`	0.5 CPUs	128 MB

Increase limits if ClamAV or Grype scans saturate the CPU:

docker-compose.yaml

  backend-api:
    deploy:
      resources:
        limits:
          cpus: "3.0"
          memory: 4g

Disk space management¶

# Total space used by repos volume
du -sh /opt/repod/repos/

# Breakdown by subdirectory
du -sh /opt/repod/repos/* | sort -rh

# Largest packages in the pool
du -sh /opt/repod/repos/pool/* 2>/dev/null | sort -rh | head -20

# Quarantine contents (review before deleting)
ls -lh /opt/repod/repos/staging/quarantine/

# ClamAV database size
du -sh /opt/repod/repos/clamav-db/

# Grype database size
du -sh /opt/repod/repos/grype-db/

Configure automatic retention cleanup in Settings → Retention in the web UI (default: 90 days for audit logs, no automatic package cleanup).