Operations Runbooks
This page contains step-by-step command sequences for routine AWS operations. Use it when you need to deploy, migrate, or manage the Vega infrastructure.
Prerequisites
Install the following tools:
- AWS CLI — configured with
nebu-adminprofile or equivalent - Terraform — version compatible with the modules
- Docker — for building images
curl— for smoke tests
Verify your AWS identity:
aws sts get-caller-identity
Set the region if needed:
export AWS_REGION=us-west-1
Bootstrap Terraform state (first time only)
Run this once per environment before the very first Terraform plan. It creates the S3 bucket and DynamoDB table that Terraform uses to store and lock its state.
ENV=dev AWS_REGION=us-west-1 scripts/aws/bootstrap-terraform-state.sh
You only need to do this once. After this, the state bucket exists and all subsequent plans and applies will use it.
Deployment sequences
Deploy a backend code change
Use this when you've changed Python code, scripts, or Docker configuration but not the database schema:
scripts/aws/build-images.sh dev # build new Docker images
scripts/aws/push-images.sh dev # push to ECR
scripts/aws/deploy-services.sh dev # force new ECS deployment
scripts/aws/smoke-test.sh dev # verify the deployed API is healthy
Deploy a database schema change
Use this when you've added a migration file:
scripts/aws/terraform-plan.sh dev # review infra changes (if any)
scripts/aws/terraform-apply.sh dev # apply infra changes
scripts/aws/build-images.sh dev # build new Docker images
scripts/aws/push-images.sh dev # push to ECR
scripts/aws/run-migrations.sh dev # run migrations BEFORE deploying new code
scripts/aws/deploy-services.sh dev # force new ECS deployment
scripts/aws/smoke-test.sh dev # verify
Run migrations before deploying new code
If new application code expects a database column that doesn't exist yet, it will crash on startup. Always run migrations first.
Deploy an infrastructure-only change
Use this when you've only changed Terraform files:
scripts/aws/terraform-plan.sh dev # review what will change
scripts/aws/terraform-apply.sh dev # apply changes
scripts/aws/smoke-test.sh dev # verify
Individual operations
Build Docker images
scripts/aws/build-images.sh dev
Builds all service images:
- vega-api
- vega-worker
- vega-maintenance
- vega-llm-proxy
- vega-v16-runner
Each image receives two tags: the current Git SHA (for traceability) and dev-current (for the ECS task definition to reference).
Push images to ECR
ECR (Elastic Container Registry) is AWS's Docker image registry. ECS pulls container images from ECR when starting tasks.
scripts/aws/push-images.sh dev
This script logs in to ECR, tags the local images with the full ECR repository URL, and pushes both the SHA and dev-current tags.
Deploy ECS services
scripts/aws/deploy-services.sh dev
Forces a new deployment for the vega-api, vega-worker, and vega-llm-proxy ECS services. ECS stops the old tasks and starts new ones using the current task definition (which points to dev-current images). The script waits for all services to report stable.
Run database migrations
scripts/aws/run-migrations.sh dev
Launches the vega-maintenance ECS task with the migration command, waits for the task to complete, and fails with a non-zero exit code if the container exits with an error.
For production:
scripts/aws/run-migrations.sh prod
Run smoke tests
scripts/aws/smoke-test.sh dev
Checks that the deployed API health endpoint returns 200. To override the URL:
API_BASE_URL=https://api.dev.vega.example.com scripts/aws/smoke-test.sh dev
If Terraform is installed, the script reads api_base_url from Terraform outputs automatically.
Manual AWS operations
Plan infrastructure changes
scripts/aws/terraform-plan.sh dev
Apply infrastructure changes
scripts/aws/terraform-apply.sh dev
# Production requires explicit confirmation:
scripts/aws/terraform-apply.sh prod --confirm-prod
Update a secret
AWS Secrets Manager stores encrypted secrets. Terraform creates the secret structure; you populate the value separately.
aws secretsmanager put-secret-value \
--region us-west-1 \
--secret-id <secret-id> \
--secret-string '{"key":"value"}'
After updating a secret, redeploy the service that reads it so ECS picks up the new value:
scripts/aws/deploy-services.sh dev
Restart services without new images
Force ECS to replace running tasks with fresh tasks using the current task definition:
scripts/aws/deploy-services.sh dev
This is useful when you've updated a secret or environment variable in the task definition.
Check ECS service status
aws ecs describe-services \
--region us-west-1 \
--cluster $(terraform -chdir=infra/terraform/envs/dev output -raw ecs_cluster_name) \
--services vega-api vega-worker vega-llm-proxy
Check SQS queue depth
aws sqs get-queue-attributes \
--region us-west-1 \
--queue-url <queue-url> \
--attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage
A large ApproximateAgeOfOldestMessage (in seconds) means scans are waiting too long — the worker may be down.
Tail CloudWatch logs
# Follow live logs for the API
aws logs tail /vega/dev/vega-api \
--region us-west-1 \
--since 10m \
--follow
# View last hour of worker logs
aws logs tail /vega/dev/vega-worker \
--region us-west-1 \
--since 1h