Debugging Vega on AWS

When something breaks in AWS, the key is to follow the request or scan path one hop at a time. Don't start by changing Terraform or redeploying. First find out where the failure is.

General approach

1. Run the smoke test
2. Check whether the frontend loads
3. Check /v1/healthz
4. Check CloudWatch logs for the failing service
5. Check that service's dependencies (Postgres, S3, SQS, Cognito)
6. Fix the specific failure

Tail logs quickly

Almost every AWS debugging session starts here:

# Follow live logs (Ctrl+C to stop)
aws logs tail /vega/dev/vega-api \
  --region us-west-1 \
  --since 10m \
  --follow

# View logs without following
aws logs tail /vega/dev/vega-worker \
  --region us-west-1 \
  --since 1h

Adjust the log group name for the service you're debugging: vega-api, vega-worker, vega-llm-proxy, vega-maintenance, vega-core-runner.

Frontend not loading

The frontend is static files in S3, served by CloudFront.

Check: 1. CloudFront distribution status in the AWS console (should be Deployed) 2. The S3 frontend bucket — does it contain index.html and the JS/CSS bundles? 3. Browser DevTools → Network tab → what's failing? 4. Was the frontend recently built and uploaded? The deploy scripts do this automatically, but if you deployed the backend only, the frontend files may be stale. 5. If API calls are failing from the frontend, check whether CloudFront is routing /v1/* to the correct origin.

API is down

Start with the smoke test:

scripts/aws/smoke-test.sh dev

If that fails, check ECS:

aws ecs describe-services \
  --region us-west-1 \
  --cluster <cluster-name> \
  --services vega-api

Look at runningCount, pendingCount, and events. Then get the stopped task reason:

aws ecs list-tasks \
  --region us-west-1 \
  --cluster <cluster-name> \
  --service-name vega-api \
  --desired-status STOPPED

aws ecs describe-tasks \
  --region us-west-1 \
  --cluster <cluster-name> \
  --tasks <task-arn>

Common causes: - Container failed to start → check CloudWatch logs for startup errors - Can't connect to Postgres → check VEGA_DATABASE_URL and the database security group - Missing secret → check Secrets Manager for the secret referenced in the task definition - Security group blocks traffic → check that the ALB can reach the API container on port 8000

Users sign in through Cognito. The backend validates the resulting JWT.

Checklist: 1. What error does the browser show? Open DevTools → Network → look at the /v1/auth/login or Cognito request. 2. Confirm VEGA_AUTH_PROVIDER=cognito in the API task definition. 3. Confirm VEGA_COGNITO_REGION, VEGA_COGNITO_USER_POOL_ID, and VEGA_COGNITO_APP_CLIENT_ID are correct. 4. Test with a raw API call: copy the access token from the browser and try /v1/auth/me:

```bash
curl -s https://api.dev.vega.example.com/v1/auth/me \
  -H "Authorization: Bearer <token>"
```

If /v1/auth/me returns invalid_token, the JWT is expired or the Cognito configuration is wrong.

Scan stuck in `queued`

A scan stays queued when the worker hasn't picked it up yet.

Check in order:

Is the worker running?

aws ecs describe-services \
  --region us-west-1 \
  --cluster <cluster> \
  --services vega-worker

Is the SQS queue receiving messages?
```
aws sqs get-queue-attributes \
  --region us-west-1 \
  --queue-url <queue-url> \
  --attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage
```
A non-zero ApproximateNumberOfMessages with growing ApproximateAgeOfOldestMessage means messages are piling up — the worker is not consuming them.

Check worker logs:

aws logs tail /vega/dev/vega-worker --region us-west-1 --since 30m

Is the worker's task role allowed to receive SQS messages? Check the IAM policy on the worker ECS task role.
Can the worker reach Postgres? The worker needs to claim scans in the database. Check the security group rules and VEGA_DATABASE_URL.

Worker not launching runner tasks

If the worker is running but scans stay claimed and never reach running, the worker may be failing to launch ECS tasks.

Check: - Worker logs for RunTask errors - IAM task role — does it have ecs:RunTask, ecs:StopTask, ecs:DescribeTasks permissions? - VEGA_SCAN_WORKER_EXECUTION_MODE — should be ecs in production - Worker security group — needs outbound HTTPS to reach the ECS API endpoint

Runner task failing

The runner is an ECS RunTask container. When it fails, ECS stops it.

Get the failure reason:

aws ecs describe-tasks \
  --region us-west-1 \
  --cluster <cluster> \
  --tasks <task-arn>

Look at stoppedReason and containers[0].exitCode.

Check runner logs:

aws logs tail /vega/dev/vega-core-runner --region us-west-1 --since 1h

Check S3 artifacts — if the runner got far enough, it may have uploaded a vega-core-debug-bundle.zip with full Codex state. Download it from the artifacts bucket and inspect the contents.

Common runner failures: - Can't download source from S3 → check bucket permissions and VEGA_S3_SOURCE_BUCKET - Codex calls failing → check LLM proxy (see below) - Postgres write failing → check database security group and credentials - Out of memory or CPU → increase task definition resource limits in Terraform - exec format error → architecture mismatch (see below)

ECS task fails with `exec format error`

If a task (runner, maintenance, or any service) exits immediately with exit code 255 and CloudWatch logs show:

exec /usr/local/bin/python: exec format error

the Docker image was built for the wrong CPU architecture. The ECS Fargate task definitions are configured for X86_64 (linux/amd64), but the image was built as ARM64 (linux/arm64) — which is the native architecture on Apple Silicon Macs.

Fix:

Rebuild using build-images.sh, which defaults to linux/amd64:

scripts/aws/build-images.sh dev <sha>
scripts/aws/push-images.sh dev <sha>

If you need to build images manually:

docker build --platform linux/amd64 -t vega-api .

Do not omit --platform when building on an M-series Mac unless you know the ECS task definition explicitly targets ARM64.

Findings not persisting

If the scan completed but no findings appear:

Did vega-core actually emit finding_updated events? Check vega-core-events.jsonl in the S3 artifacts bucket.
Did the backend adapter map them? Check runner logs for any finding_upsert errors.
Is Postgres migration 004_findings_columns.sql applied? Check the schema_migrations table.
Are frontend filters hiding the findings? Try clearing severity and status filters.

LLM proxy / AI calls failing

The runner should call the LLM proxy, not the AI provider directly. The proxy holds the provider API key.

Check: 1. Is vega-llm-proxy running?

aws ecs describe-services --cluster <cluster> --services vega-llm-proxy --region us-west-1

2. Does the runner have VEGA_LLM_PROXY_BASE_URL set to the proxy's internal address? 3. Does the runner have a valid scan-scoped proxy token? 4. Is the provider API key correct in Secrets Manager? Check LLM proxy logs for provider authentication errors.

aws logs tail /vega/dev/vega-llm-proxy --region us-west-1 --since 30m

Database migration failed

scripts/aws/run-migrations.sh dev

If it fails, inspect the stopped maintenance task:

aws ecs list-tasks \
  --cluster <cluster> \
  --desired-status STOPPED \
  --region us-west-1

aws ecs describe-tasks --cluster <cluster> --tasks <task-arn> --region us-west-1

Then check maintenance logs:

aws logs tail /vega/dev/vega-maintenance --region us-west-1 --since 30m

Common causes: - Database connection refused → check VEGA_DATABASE_URL and security groups - Migration already partially applied → the migration script tracks applied migrations in a schema_migrations table; inspect it to see what's applied

S3 artifact missing

Confirm VEGA_FILE_STORAGE_BACKEND=s3 in the runner task definition.
Confirm VEGA_S3_ARTIFACTS_BUCKET is set to the correct bucket name.
Check that the runner task role has s3:PutObject permission on the artifacts bucket.
Did the scan progress far enough to write the artifact? A very early failure won't produce vega-core-report.json.
Check runner logs around the artifact upload step.