Skip to content

Hardening and Maintenance

Hardening is the set of protections that keep the backend stable when things go wrong: users submit oversized files, scans hang indefinitely, workers crash, or someone tries to run more scans than the system can handle.

Quotas and limits

app/hardening/limits.py evaluates quotas before allowing operations. Key limits:

Setting Default What it guards
VEGA_MAX_ACTIVE_SCANS_PER_USER 4 Prevents one user from monopolizing scan capacity
VEGA_MAX_SESSIONS_PER_WORKSPACE (configurable) Limits legacy session count per workspace
VEGA_MAX_SOURCE_BYTES 2147483648 (2 GB) Maximum source archive size at upload time
VEGA_MAX_ARCHIVE_ENTRIES 50,000 Maximum files in an uploaded archive
VEGA_MAX_ARCHIVE_UNCOMPRESSED_BYTES 5368709120 (5 GB) Maximum extracted archive size

These are checked at the API layer before any work begins. Requests that exceed them get a 4xx error with a clear code and message.

Worker heartbeats

app/hardening/workers.py maintains a registry of running worker processes. Each worker sends a heartbeat on a regular interval while it's alive.

The API exposes worker state through GET /v1/ops/workers. Operators can use this to check:

  • Which workers are currently registered
  • When each worker last sent a heartbeat
  • Whether any workers have gone stale (heartbeat older than VEGA_WORKER_HEARTBEAT_TTL_SECONDS)

A stale worker is a sign that the worker process has crashed. If you see scans stuck in queued state and a stale worker, the worker needs to be restarted.

Stale scan recovery

In the worker's polling loop (scripts/run-scan-worker.py), before processing new scan messages, it checks for scans that have been running longer than VEGA_SCAN_RUNNING_STALE_SECONDS (default: 6 hours).

These stale scans happen when: - A runner task crashes without writing a failure event - The ECS task is stopped by AWS (e.g., spot interruption) before finishing - A network partition prevents the runner from writing its final status

The recovery loop: 1. Detects scans that are running but older than the stale threshold 2. Checks if the associated ECS task is still alive 3. If the ECS task is gone, marks the scan failed with a stale-scan error event 4. Preserves any partial findings and events that were already written

Cleanup

app/hardening/cleanup.py handles removal of stale artifacts and sessions. The maintenance ECS task runs cleanup jobs periodically or on demand.

The /v1/ops/ endpoints

app/api/hardening.py exposes operational state through protected API endpoints. These require elevated group membership.

Endpoint What it returns
GET /v1/ops/limits Current quota settings
GET /v1/ops/workers Worker heartbeat registry
GET /v1/ops/cleanup Cleanup status or trigger

Maintenance task

The vega-maintenance ECS task definition is used for one-off jobs. It runs and exits with a status code. The most common use is database migrations:

scripts/aws/run-migrations.sh dev

Other maintenance operations (cleanup, artifact pruning) can be triggered by overriding the task command.

Debugging

Scans stuck in queued: 1. Check GET /v1/ops/workers — is there an active, non-stale worker? 2. If no worker is registered or heartbeat is stale, restart the worker service.

Scans stuck in running: 1. Check whether the ECS runner task is still alive for that scan. 2. If not, the stale scan recovery loop should handle it within VEGA_SCAN_RUNNING_STALE_SECONDS. You can decrease this setting to speed up recovery in testing. 3. Check runner CloudWatch logs for crash indicators.

Quota errors hitting users unexpectedly: 1. Check GET /v1/ops/limits to confirm the active quota values. 2. Check how many scans the user currently has in running or queued state. 3. Cancel old stuck scans to free up capacity.