Debugging Vega on AWS
When something breaks in AWS, the key is to follow the request or scan path one hop at a time. Don't start by changing Terraform or redeploying. First find out where the failure is.
General approach
1. Run the smoke test
2. Check whether the frontend loads
3. Check /v1/healthz
4. Check CloudWatch logs for the failing service
5. Check that service's dependencies (Postgres, S3, SQS, Cognito)
6. Fix the specific failure
Tail logs quickly
Almost every AWS debugging session starts here:
# Follow live logs (Ctrl+C to stop)
aws logs tail /vega/dev/vega-api \
--region us-west-1 \
--since 10m \
--follow
# View logs without following
aws logs tail /vega/dev/vega-worker \
--region us-west-1 \
--since 1h
Adjust the log group name for the service you're debugging: vega-api, vega-worker, vega-llm-proxy, vega-maintenance, vega-v16-runner.
Frontend not loading
The frontend is static files in S3, served by CloudFront.
Check:
1. CloudFront distribution status in the AWS console (should be Deployed)
2. The S3 frontend bucket — does it contain index.html and the JS/CSS bundles?
3. Browser DevTools → Network tab → what's failing?
4. Was the frontend recently built and uploaded? The deploy scripts do this automatically, but if you deployed the backend only, the frontend files may be stale.
5. If API calls are failing from the frontend, check whether CloudFront is routing /v1/* to the correct origin.
API is down
Start with the smoke test:
scripts/aws/smoke-test.sh dev
If that fails, check ECS:
aws ecs describe-services \
--region us-west-1 \
--cluster <cluster-name> \
--services vega-api
Look at runningCount, pendingCount, and events. Then get the stopped task reason:
aws ecs list-tasks \
--region us-west-1 \
--cluster <cluster-name> \
--service-name vega-api \
--desired-status STOPPED
aws ecs describe-tasks \
--region us-west-1 \
--cluster <cluster-name> \
--tasks <task-arn>
Common causes:
- Container failed to start → check CloudWatch logs for startup errors
- Can't connect to Postgres → check VEGA_DATABASE_URL and the database security group
- Missing secret → check Secrets Manager for the secret referenced in the task definition
- Security group blocks traffic → check that the ALB can reach the API container on port 8000
Login failing
Users sign in through Cognito. The backend validates the resulting JWT.
Checklist:
1. What error does the browser show? Open DevTools → Network → look at the /v1/auth/login or Cognito request.
2. Confirm VEGA_AUTH_PROVIDER=cognito in the API task definition.
3. Confirm VEGA_COGNITO_REGION, VEGA_COGNITO_USER_POOL_ID, and VEGA_COGNITO_APP_CLIENT_ID are correct.
4. Test with a raw API call: copy the access token from the browser and try /v1/auth/me:
```bash
curl -s https://api.dev.vega.example.com/v1/auth/me \
-H "Authorization: Bearer <token>"
```
- If
/v1/auth/mereturnsinvalid_token, the JWT is expired or the Cognito configuration is wrong.
Scan stuck in queued
A scan stays queued when the worker hasn't picked it up yet.
Check in order:
-
Is the worker running?
aws ecs describe-services \ --region us-west-1 \ --cluster <cluster> \ --services vega-worker -
Is the SQS queue receiving messages?
A non-zeroaws sqs get-queue-attributes \ --region us-west-1 \ --queue-url <queue-url> \ --attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessageApproximateNumberOfMessageswith growingApproximateAgeOfOldestMessagemeans messages are piling up — the worker is not consuming them. -
Check worker logs:
aws logs tail /vega/dev/vega-worker --region us-west-1 --since 30m -
Is the worker's task role allowed to receive SQS messages? Check the IAM policy on the worker ECS task role.
-
Can the worker reach Postgres? The worker needs to claim scans in the database. Check the security group rules and
VEGA_DATABASE_URL.
Worker not launching runner tasks
If the worker is running but scans stay claimed and never reach running, the worker may be failing to launch ECS tasks.
Check:
- Worker logs for RunTask errors
- IAM task role — does it have ecs:RunTask, ecs:StopTask, ecs:DescribeTasks permissions?
- VEGA_SCAN_WORKER_EXECUTION_MODE — should be ecs in production
- Worker security group — needs outbound HTTPS to reach the ECS API endpoint
Runner task failing
The runner is an ECS RunTask container. When it fails, ECS stops it.
Get the failure reason:
aws ecs describe-tasks \
--region us-west-1 \
--cluster <cluster> \
--tasks <task-arn>
Look at stoppedReason and containers[0].exitCode.
Check runner logs:
aws logs tail /vega/dev/vega-v16-runner --region us-west-1 --since 1h
Check S3 artifacts — if the runner got far enough, it may have uploaded a v16-debug-bundle.zip with full Codex state. Download it from the artifacts bucket and inspect the contents.
Common runner failures:
- Can't download source from S3 → check bucket permissions and VEGA_S3_SOURCE_BUCKET
- Codex calls failing → check LLM proxy (see below)
- Postgres write failing → check database security group and credentials
- Out of memory or CPU → increase task definition resource limits in Terraform
Findings not persisting
If the scan completed but no findings appear:
- Did v16 actually emit
finding_updatedevents? Checkv16-events.jsonlin the S3 artifacts bucket. - Did the backend adapter map them? Check runner logs for any
finding_upserterrors. - Is Postgres migration
004_findings_columns.sqlapplied? Check theschema_migrationstable. - Are frontend filters hiding the findings? Try clearing severity and status filters.
LLM proxy / AI calls failing
The runner should call the LLM proxy, not the AI provider directly. The proxy holds the provider API key.
Check:
1. Is vega-llm-proxy running?
aws ecs describe-services --cluster <cluster> --services vega-llm-proxy --region us-west-1
VEGA_LLM_PROXY_BASE_URL set to the proxy's internal address?
3. Does the runner have a valid scan-scoped proxy token?
4. Is the provider API key correct in Secrets Manager? Check LLM proxy logs for provider authentication errors.
aws logs tail /vega/dev/vega-llm-proxy --region us-west-1 --since 30m
Database migration failed
scripts/aws/run-migrations.sh dev
If it fails, inspect the stopped maintenance task:
aws ecs list-tasks \
--cluster <cluster> \
--desired-status STOPPED \
--region us-west-1
aws ecs describe-tasks --cluster <cluster> --tasks <task-arn> --region us-west-1
Then check maintenance logs:
aws logs tail /vega/dev/vega-maintenance --region us-west-1 --since 30m
Common causes:
- Database connection refused → check VEGA_DATABASE_URL and security groups
- Migration already partially applied → the migration script tracks applied migrations in a schema_migrations table; inspect it to see what's applied
S3 artifact missing
- Confirm
VEGA_FILE_STORAGE_BACKEND=s3in the runner task definition. - Confirm
VEGA_S3_ARTIFACTS_BUCKETis set to the correct bucket name. - Check that the runner task role has
s3:PutObjectpermission on the artifacts bucket. - Did the scan progress far enough to write the artifact? A very early failure won't produce
v16-report.json. - Check runner logs around the artifact upload step.