Health & Monitoring¶
Monitor Repod's operational status, integrate with your alerting stack, and interpret the health endpoints.
Health endpoints¶
Three endpoints are available for different monitoring scenarios:
| Endpoint | Auth | Use case |
|---|---|---|
GET /health/live |
Public | Container liveness probe — is the process running? |
GET /health/ready |
Public | Readiness probe — are all dependencies ready? |
GET /health |
Bearer token | Full status report — component-level detail |
Liveness probe¶
Returns 200 as long as the uvicorn process is alive. Use this for Docker
HEALTHCHECK and Kubernetes livenessProbe.
Readiness probe¶
Returns 200 when the database and critical services are ready. Returns 503
during startup or when a critical dependency is unavailable. Use this for
Kubernetes readinessProbe and load-balancer health checks.
Full health report¶
TOKEN=$(curl -s -X POST http://localhost:8000/auth/token \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"Admin1234!"}' | jq -r .access_token)
curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8000/health | jq .
{
"status": "ok",
"version": "v1.2.0",
"uptime_seconds": 86432,
"components": {
"database": {"ok": true},
"clamav": {"ok": true, "version": "ClamAV 1.4.3/27509"},
"grype": {"ok": true, "db_age_hours": 4.2},
"gpg": {"ok": true, "key_count": 1}
}
}
Docker health check configuration¶
The docker-compose.yaml includes a built-in health check for the backend:
backend-api:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health/live"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # allow ClamAV signature loading time
Check health status:
Prometheus metrics¶
Repod exposes Prometheus metrics on the backend via the prometheus-client
library. Metrics are available at:
Key metrics:
| Metric | Type | Description |
|---|---|---|
repod_packages_total |
Gauge | Total packages in the pool |
repod_uploads_total |
Counter | Upload attempts (labels: status) |
repod_cve_findings_total |
Counter | CVE findings (labels: severity) |
repod_pending_review_total |
Gauge | Packages awaiting CISO decision |
repod_quarantine_total |
Gauge | Packages in quarantine |
repod_clamav_last_update |
Gauge | Unix timestamp of last ClamAV DB update |
repod_grype_db_age_seconds |
Gauge | Age of Grype vulnerability database |
http_requests_total |
Counter | HTTP requests (labels: method, path, status) |
http_request_duration_seconds |
Histogram | Request latency |
Prometheus scrape configuration¶
scrape_configs:
- job_name: repod
static_configs:
- targets: ["repod-backend:8000"]
metrics_path: /metrics
scrape_interval: 30s
# Add bearer_token if you protect /metrics in production
Grafana dashboard¶
Import the community dashboard ID XXXXX (search "Repod" on grafana.com) or
build your own using the metrics above.
Recommended panels:
- Upload success rate (last 1h)
- CVE findings by severity (last 7d)
- Pending review queue depth (time series)
- ClamAV DB age (alert if > 48 h)
- Grype DB age (alert if > 24 h)
- HTTP request latency (p50 / p95 / p99)
Log monitoring¶
View logs in real time¶
# All containers
docker compose logs -f
# Backend only (last 200 lines)
docker compose logs backend-api --tail=200
# Filter for errors
docker compose logs backend-api 2>&1 | grep -iE "error|exception|traceback"
# Filter for security events
docker compose logs backend-api 2>&1 | grep -E "UPLOAD|SECURITY|CVE"
Structured audit log¶
The audit log at /repos/audit/YYYY-MM-DD.jsonl is the authoritative record
of all security-relevant events. Query it with standard UNIX tools:
# Today's audit log — pretty-printed
docker exec backend-api tail -n 50 \
/repos/audit/$(date +%Y-%m-%d).jsonl | python3 -m json.tool
# All failed logins today
docker exec backend-api grep '"action":"LOGIN"' \
/repos/audit/$(date +%Y-%m-%d).jsonl | grep '"result":"FAILURE"'
# All CVE decisions in the last 7 days
for d in $(seq 0 6); do
date --date="-$d days" +%Y-%m-%d
done | while read day; do
f="/repos/audit/$day.jsonl"
[ -f "$f" ] && grep '"action":"SECURITY_DECISION"' "$f"
done
Ship logs to a SIEM¶
The JSONL format is directly ingestible by most SIEM platforms:
input {
file {
path => "/opt/repod/repos/audit/*.jsonl"
codec => "json"
start_position => "beginning"
}
}
filter {
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "repod-audit-%{+YYYY.MM.dd}"
}
}
Configure a Universal Forwarder monitor input:
Alerting recommendations¶
Critical alerts (page immediately)¶
| Condition | Query / check |
|---|---|
| Backend container unhealthy | docker inspect backend-api --format='{{.State.Health.Status}}' ≠ healthy |
| ClamAV unreachable | components.clamav.ok == false in /health |
| CVE database stale > 48 h | repod_grype_db_age_seconds > 172800 |
| Pending review queue growing | repod_pending_review_total > 10 (tune per org) |
| Login failures spike | > 20 FAILURE audit events in 5 min |
Warning alerts (notify team)¶
| Condition | Query / check |
|---|---|
| ClamAV DB not updated in 24 h | repod_clamav_last_update age > 86400 s |
| Disk space low | df -h /opt/repod/repos < 10 GB free |
| Upload error rate > 5% | repod_uploads_total{status!="published"} / total |
Security event monitoring¶
Monitor these audit event types for security operations:
| Event | Security signal |
|---|---|
LOGIN with result=FAILURE |
Brute-force or credential stuffing attempt |
USER_CREATE |
New account provisioned — verify intent |
USER_ROLE_CHANGE |
Privilege escalation — verify intent |
SECURITY_DECISION with approve |
CVE exception approved — review justification |
UPLOAD with status=quarantined |
Malware or critical CVE detected |
SETTINGS_CHANGE |
Configuration change — verify intent |
Checking component status manually¶
# ClamAV — version and signature date
docker exec backend-api clamdscan --version
# Grype — database age
docker exec backend-api grype db status
# GPG — list keys in keyring
docker exec backend-api gpg --homedir /repos/gnupg --list-keys
# SQLite — check database integrity
docker exec backend-api python3 -c "
import sqlite3
for db in ['/repos/auth/users.db', '/repos/package-index/packages.db']:
conn = sqlite3.connect(db)
result = conn.execute('PRAGMA integrity_check').fetchone()
print(f'{db}: {result[0]}')
"
Resource utilization¶
Monitor Docker resource usage with:
Default resource limits in docker-compose.yaml:
| Container | CPU limit | Memory limit |
|---|---|---|
backend-api |
1.5 CPUs | 2.5 GB |
frontend-ui |
0.5 CPUs | 256 MB |
depot-apt / depot-rpm |
0.5 CPUs | 128 MB |
Increase limits if ClamAV or Grype scans saturate the CPU:
Disk space management¶
# Total space used by repos volume
du -sh /opt/repod/repos/
# Breakdown by subdirectory
du -sh /opt/repod/repos/* | sort -rh
# Largest packages in the pool
du -sh /opt/repod/repos/pool/* 2>/dev/null | sort -rh | head -20
# Quarantine contents (review before deleting)
ls -lh /opt/repod/repos/staging/quarantine/
# ClamAV database size
du -sh /opt/repod/repos/clamav-db/
# Grype database size
du -sh /opt/repod/repos/grype-db/
Configure automatic retention cleanup in Settings → Retention in the web UI (default: 90 days for audit logs, no automatic package cleanup).