Hardening and Maintenance
Hardening is the set of protections that keep the backend stable when things go wrong: users submit oversized files, scans hang indefinitely, workers crash, or someone tries to run more scans than the system can handle.
Quotas and limits
app/hardening/limits.py evaluates quotas before allowing operations. Key limits:
| Setting | Default | What it guards |
|---|---|---|
VEGA_MAX_ACTIVE_SCANS_PER_USER |
4 |
Prevents one user from monopolizing scan capacity |
VEGA_MAX_SESSIONS_PER_WORKSPACE |
(configurable) | Limits legacy session count per workspace |
VEGA_MAX_SOURCE_BYTES |
2147483648 (2 GB) |
Maximum source archive size at upload time |
VEGA_MAX_ARCHIVE_ENTRIES |
50,000 |
Maximum files in an uploaded archive |
VEGA_MAX_ARCHIVE_UNCOMPRESSED_BYTES |
5368709120 (5 GB) |
Maximum extracted archive size |
These are checked at the API layer before any work begins. Requests that exceed them get a 4xx error with a clear code and message.
Worker heartbeats
app/hardening/workers.py maintains a registry of running worker processes. Each worker sends a heartbeat on a regular interval while it's alive.
The API exposes worker state through GET /v1/ops/workers. Operators can use this to check:
- Which workers are currently registered
- When each worker last sent a heartbeat
- Whether any workers have gone stale (heartbeat older than
VEGA_WORKER_HEARTBEAT_TTL_SECONDS)
A stale worker is a sign that the worker process has crashed. If you see scans stuck in queued state and a stale worker, the worker needs to be restarted.
Stale scan recovery
In the worker's polling loop (scripts/run-scan-worker.py), before processing new scan messages, it checks for scans that have been running longer than VEGA_SCAN_RUNNING_STALE_SECONDS (default: 6 hours).
These stale scans happen when: - A runner task crashes without writing a failure event - The ECS task is stopped by AWS (e.g., spot interruption) before finishing - A network partition prevents the runner from writing its final status
The recovery loop:
1. Detects scans that are running but older than the stale threshold
2. Checks if the associated ECS task is still alive
3. If the ECS task is gone, marks the scan failed with a stale-scan error event
4. Preserves any partial findings and events that were already written
Cleanup
app/hardening/cleanup.py handles removal of stale artifacts and sessions. The maintenance ECS task runs cleanup jobs periodically or on demand.
The /v1/ops/ endpoints
app/api/hardening.py exposes operational state through protected API endpoints. These require elevated group membership.
| Endpoint | What it returns |
|---|---|
GET /v1/ops/limits |
Current quota settings |
GET /v1/ops/workers |
Worker heartbeat registry |
GET /v1/ops/cleanup |
Cleanup status or trigger |
Maintenance task
The vega-maintenance ECS task definition is used for one-off jobs. It runs and exits with a status code. The most common use is database migrations:
scripts/aws/run-migrations.sh dev
Other maintenance operations (cleanup, artifact pruning) can be triggered by overriding the task command.
Debugging
Scans stuck in queued:
1. Check GET /v1/ops/workers — is there an active, non-stale worker?
2. If no worker is registered or heartbeat is stale, restart the worker service.
Scans stuck in running:
1. Check whether the ECS runner task is still alive for that scan.
2. If not, the stale scan recovery loop should handle it within VEGA_SCAN_RUNNING_STALE_SECONDS. You can decrease this setting to speed up recovery in testing.
3. Check runner CloudWatch logs for crash indicators.
Quota errors hitting users unexpectedly:
1. Check GET /v1/ops/limits to confirm the active quota values.
2. Check how many scans the user currently has in running or queued state.
3. Cancel old stuck scans to free up capacity.