Terraform
Terraform is the tool Vega uses to define and manage AWS infrastructure. Instead of clicking things into existence in the AWS console, you write .tf files that describe what should exist, and Terraform figures out how to create or update it. This gives you a reliable, repeatable, version-controlled infrastructure.
Repository layout
infra/terraform/
├── modules/ ← Reusable building blocks
│ ├── network/ VPC, subnets, routing
│ ├── security/ Security groups
│ ├── database/ RDS Postgres instance
│ ├── storage/ S3 buckets and KMS keys
│ ├── registry/ ECR image registry
│ ├── queue/ SQS scan queue
│ ├── observability/ CloudWatch log groups and alarms
│ ├── secrets/ Secrets Manager secrets (structure only)
│ ├── cognito/ Cognito user pool
│ ├── iam/ IAM roles and policies for ECS tasks
│ ├── frontend_hosting/ CloudFront distribution + S3 frontend bucket
│ ├── api_smoke/ ECS cluster, API service, load balancer
│ ├── llm_proxy_service/ LLM proxy ECS service
│ ├── runner_task/ v16 runner task definition
│ ├── worker_service/ Worker ECS service
│ ├── maintenance_task/ Maintenance task definition
│ └── cost_controls/ AWS Budgets and anomaly detection
│
└── envs/ ← Environment compositions
├── dev/
│ ├── main.tf which modules to use and how
│ ├── variables.tf input variables
│ ├── outputs.tf values exposed to scripts
│ ├── backend.tf remote state configuration
│ └── terraform.tfvars.example non-secret values template
└── prod/
└── (same structure)
Modules are building blocks that encapsulate a set of related AWS resources. Environments (dev, prod) compose modules together with environment-specific values. This means you can change one module and it affects all environments that use it.
Terraform state
Terraform keeps a state file that records what AWS resources it has created and their current configuration. This state is how Terraform knows what to create, update, or delete when you run plan or apply.
Vega stores state remotely in S3 (durable) and uses DynamoDB for locking (prevents two people from running apply simultaneously). You need to bootstrap this once per environment before the first apply.
ENV=dev AWS_REGION=us-west-1 scripts/aws/bootstrap-terraform-state.sh
This creates: - An S3 bucket with versioning and encryption for the Terraform state file - A DynamoDB table for state locking
State contains sensitive data
Terraform state can contain sensitive values like database endpoints. Keep the state bucket private and encrypted (the bootstrap script does this). Never commit .tfstate files to version control.
Planning changes
Always plan before you apply. Planning shows exactly what Terraform will create, update, or destroy — without making any changes to AWS.
scripts/aws/terraform-plan.sh dev
This runs:
1. terraform init — downloads provider plugins and configures the remote backend
2. terraform fmt -check — fails if files are not properly formatted
3. terraform validate — checks for syntax errors
4. terraform plan -out tfplan-dev — generates the plan and saves it to a file
Read the plan output carefully before applying. Look for:
- Resources marked with + (will be created)
- Resources marked with ~ (will be updated in-place)
- Resources marked with -/+ (will be destroyed and recreated) — these are risky; database replacements cause data loss
- Resources marked with - (will be destroyed)
Applying changes
Apply uses the saved plan file, so what you reviewed is exactly what gets applied:
scripts/aws/terraform-apply.sh dev
For production:
scripts/aws/terraform-apply.sh prod --confirm-prod
The --confirm-prod flag is a safety check to prevent accidental prod changes.
Production apply requires review
Never run terraform-apply.sh prod without reviewing the plan output first. Database changes, IAM policy changes, and security group changes can have serious consequences.
Reading Terraform outputs
Outputs expose values that scripts need — cluster names, service names, bucket names, queue URLs, etc. Scripts use these instead of hard-coded resource IDs.
# Show all outputs for the dev environment
terraform -chdir=infra/terraform/envs/dev output
# Get a specific output value
terraform -chdir=infra/terraform/envs/dev output -raw ecs_cluster_name
terraform -chdir=infra/terraform/envs/dev output -raw api_base_url
Environment files
Each environment directory contains:
| File | Purpose |
|---|---|
main.tf |
Which modules to instantiate and with which arguments |
variables.tf |
Declared input variables |
outputs.tf |
Values that are exported (used by scripts and operators) |
backend.tf |
Remote state configuration (S3 bucket + DynamoDB table) |
terraform.tfvars.example |
Template for non-secret variable values |
Copy terraform.tfvars.example to terraform.tfvars and fill in your values. Don't commit terraform.tfvars if it contains sensitive values.
Safe practices
- Always run
terraform-plan.shbeforeterraform-apply.sh - Read the full plan output before applying — never apply blind
- Never run prod apply from a dirty or unreviewed branch
- Don't put plaintext secrets into Terraform variables — use Secrets Manager (Terraform can create the secret structure; populate the value separately)
- Keep dev and prod in separate state files
- Use immutable image tags for production deployments
Common errors
Missing AWS credentials
Error: No valid credential sources found
aws sts get-caller-identity to confirm you have valid credentials. Check AWS_PROFILE or AWS_ACCESS_KEY_ID.
Missing terraform.tfvars
Error: No value for required variable
terraform.tfvars.example to terraform.tfvars and fill in the required values.
State lock held
Error: Error acquiring the state lock
plan or apply may be running. Wait for it to finish. If you're certain no apply is in progress, you can force-unlock, but understand the risk before doing so.
Provider validation error The error message will include the resource type and the specific attribute. Read it carefully and check the module that manages that resource type.