Skip to content

Operations Runbooks

This page contains step-by-step command sequences for routine AWS operations. Use it when you need to deploy, migrate, or manage the Vega infrastructure.

Prerequisites

Install the following tools:

  • AWS CLI — configured with nebu-admin profile or equivalent
  • Terraform — version compatible with the modules
  • Docker — for building images
  • curl — for smoke tests

Verify your AWS identity:

aws sts get-caller-identity

Set the region if needed:

export AWS_REGION=us-west-1

Bootstrap Terraform state (first time only)

Run this once per environment before the very first Terraform plan. It creates the S3 bucket and DynamoDB table that Terraform uses to store and lock its state.

ENV=dev AWS_REGION=us-west-1 scripts/aws/bootstrap-terraform-state.sh

You only need to do this once. After this, the state bucket exists and all subsequent plans and applies will use it.


Deployment sequences

Deploy a backend code change

Use this when you've changed Python code, scripts, or Docker configuration but not the database schema:

scripts/aws/build-images.sh dev      # build new Docker images
scripts/aws/push-images.sh dev       # push to ECR
scripts/aws/deploy-services.sh dev   # force new ECS deployment
scripts/aws/smoke-test.sh dev        # verify the deployed API is healthy

Deploy a database schema change

Use this when you've added a migration file:

scripts/aws/terraform-plan.sh dev    # review infra changes (if any)
scripts/aws/terraform-apply.sh dev   # apply infra changes
scripts/aws/build-images.sh dev      # build new Docker images
scripts/aws/push-images.sh dev       # push to ECR
scripts/aws/run-migrations.sh dev    # run migrations BEFORE deploying new code
scripts/aws/deploy-services.sh dev   # force new ECS deployment
scripts/aws/smoke-test.sh dev        # verify

Run migrations before deploying new code

If new application code expects a database column that doesn't exist yet, it will crash on startup. Always run migrations first.

Deploy an infrastructure-only change

Use this when you've only changed Terraform files:

scripts/aws/terraform-plan.sh dev    # review what will change
scripts/aws/terraform-apply.sh dev   # apply changes
scripts/aws/smoke-test.sh dev        # verify

Individual operations

Build Docker images

scripts/aws/build-images.sh dev

Builds all service images: - vega-api - vega-worker - vega-maintenance - vega-llm-proxy - vega-v16-runner

Each image receives two tags: the current Git SHA (for traceability) and dev-current (for the ECS task definition to reference).

Push images to ECR

ECR (Elastic Container Registry) is AWS's Docker image registry. ECS pulls container images from ECR when starting tasks.

scripts/aws/push-images.sh dev

This script logs in to ECR, tags the local images with the full ECR repository URL, and pushes both the SHA and dev-current tags.

Deploy ECS services

scripts/aws/deploy-services.sh dev

Forces a new deployment for the vega-api, vega-worker, and vega-llm-proxy ECS services. ECS stops the old tasks and starts new ones using the current task definition (which points to dev-current images). The script waits for all services to report stable.

Run database migrations

scripts/aws/run-migrations.sh dev

Launches the vega-maintenance ECS task with the migration command, waits for the task to complete, and fails with a non-zero exit code if the container exits with an error.

For production:

scripts/aws/run-migrations.sh prod

Run smoke tests

scripts/aws/smoke-test.sh dev

Checks that the deployed API health endpoint returns 200. To override the URL:

API_BASE_URL=https://api.dev.vega.example.com scripts/aws/smoke-test.sh dev

If Terraform is installed, the script reads api_base_url from Terraform outputs automatically.


Manual AWS operations

Plan infrastructure changes

scripts/aws/terraform-plan.sh dev

Apply infrastructure changes

scripts/aws/terraform-apply.sh dev

# Production requires explicit confirmation:
scripts/aws/terraform-apply.sh prod --confirm-prod

Update a secret

AWS Secrets Manager stores encrypted secrets. Terraform creates the secret structure; you populate the value separately.

aws secretsmanager put-secret-value \
  --region us-west-1 \
  --secret-id <secret-id> \
  --secret-string '{"key":"value"}'

After updating a secret, redeploy the service that reads it so ECS picks up the new value:

scripts/aws/deploy-services.sh dev

Restart services without new images

Force ECS to replace running tasks with fresh tasks using the current task definition:

scripts/aws/deploy-services.sh dev

This is useful when you've updated a secret or environment variable in the task definition.

Check ECS service status

aws ecs describe-services \
  --region us-west-1 \
  --cluster $(terraform -chdir=infra/terraform/envs/dev output -raw ecs_cluster_name) \
  --services vega-api vega-worker vega-llm-proxy

Check SQS queue depth

aws sqs get-queue-attributes \
  --region us-west-1 \
  --queue-url <queue-url> \
  --attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage

A large ApproximateAgeOfOldestMessage (in seconds) means scans are waiting too long — the worker may be down.

Tail CloudWatch logs

# Follow live logs for the API
aws logs tail /vega/dev/vega-api \
  --region us-west-1 \
  --since 10m \
  --follow

# View last hour of worker logs
aws logs tail /vega/dev/vega-worker \
  --region us-west-1 \
  --since 1h