Terraform

Terraform is the tool Vega uses to define and manage AWS infrastructure. Instead of clicking things into existence in the AWS console, you write .tf files that describe what should exist, and Terraform figures out how to create or update it. This gives you a reliable, repeatable, version-controlled infrastructure.

Repository layout

infra/terraform/
├── modules/          ← Reusable building blocks
│   ├── network/          VPC, subnets, routing
│   ├── security/         Security groups
│   ├── database/         RDS Postgres instance
│   ├── storage/          S3 buckets and KMS keys
│   ├── registry/         ECR image registry
│   ├── queue/            SQS scan queue
│   ├── observability/    CloudWatch log groups and alarms
│   ├── secrets/          Secrets Manager secrets (structure only)
│   ├── cognito/          Cognito user pool
│   ├── iam/              IAM roles and policies for ECS tasks
│   ├── frontend_hosting/ CloudFront distribution + S3 frontend bucket
│   ├── api_smoke/        ECS cluster, API service, load balancer
│   ├── llm_proxy_service/ LLM proxy ECS service
│   ├── runner_task/      v16 runner task definition
│   ├── worker_service/   Worker ECS service
│   ├── maintenance_task/ Maintenance task definition
│   └── cost_controls/    AWS Budgets and anomaly detection
│
└── envs/             ← Environment compositions
    ├── dev/
    │   ├── main.tf             which modules to use and how
    │   ├── variables.tf        input variables
    │   ├── outputs.tf          values exposed to scripts
    │   ├── backend.tf          remote state configuration
    │   └── terraform.tfvars.example  non-secret values template
    └── prod/
        └── (same structure)

Modules are building blocks that encapsulate a set of related AWS resources. Environments (dev, prod) compose modules together with environment-specific values. This means you can change one module and it affects all environments that use it.

Terraform state

Terraform keeps a state file that records what AWS resources it has created and their current configuration. This state is how Terraform knows what to create, update, or delete when you run plan or apply.

Vega stores state remotely in S3 (durable) and uses DynamoDB for locking (prevents two people from running apply simultaneously). You need to bootstrap this once per environment before the first apply.

ENV=dev AWS_REGION=us-west-1 scripts/aws/bootstrap-terraform-state.sh

This creates: - An S3 bucket with versioning and encryption for the Terraform state file - A DynamoDB table for state locking

State contains sensitive data

Terraform state can contain sensitive values like database endpoints. Keep the state bucket private and encrypted (the bootstrap script does this). Never commit .tfstate files to version control.

Planning changes

Always plan before you apply. Planning shows exactly what Terraform will create, update, or destroy — without making any changes to AWS.

scripts/aws/terraform-plan.sh dev

This runs: 1. terraform init — downloads provider plugins and configures the remote backend 2. terraform fmt -check — fails if files are not properly formatted 3. terraform validate — checks for syntax errors 4. terraform plan -out tfplan-dev — generates the plan and saves it to a file

Read the plan output carefully before applying. Look for: - Resources marked with + (will be created) - Resources marked with ~ (will be updated in-place) - Resources marked with -/+ (will be destroyed and recreated) — these are risky; database replacements cause data loss - Resources marked with - (will be destroyed)

Applying changes

Apply uses the saved plan file, so what you reviewed is exactly what gets applied:

scripts/aws/terraform-apply.sh dev

For production:

scripts/aws/terraform-apply.sh prod --confirm-prod

The --confirm-prod flag is a safety check to prevent accidental prod changes.

Production apply requires review

Never run terraform-apply.sh prod without reviewing the plan output first. Database changes, IAM policy changes, and security group changes can have serious consequences.

Reading Terraform outputs

Outputs expose values that scripts need — cluster names, service names, bucket names, queue URLs, etc. Scripts use these instead of hard-coded resource IDs.

# Show all outputs for the dev environment
terraform -chdir=infra/terraform/envs/dev output

# Get a specific output value
terraform -chdir=infra/terraform/envs/dev output -raw ecs_cluster_name
terraform -chdir=infra/terraform/envs/dev output -raw api_base_url

Environment files

Each environment directory contains:

File	Purpose
`main.tf`	Which modules to instantiate and with which arguments
`variables.tf`	Declared input variables
`outputs.tf`	Values that are exported (used by scripts and operators)
`backend.tf`	Remote state configuration (S3 bucket + DynamoDB table)
`terraform.tfvars.example`	Template for non-secret variable values

Copy terraform.tfvars.example to terraform.tfvars and fill in your values. Don't commit terraform.tfvars if it contains sensitive values.

Safe practices

Always run terraform-plan.sh before terraform-apply.sh
Read the full plan output before applying — never apply blind
Never run prod apply from a dirty or unreviewed branch
Don't put plaintext secrets into Terraform variables — use Secrets Manager (Terraform can create the secret structure; populate the value separately)
Keep dev and prod in separate state files
Use immutable image tags for production deployments

Common errors

Missing AWS credentials

Error: No valid credential sources found

Run aws sts get-caller-identity to confirm you have valid credentials. Check AWS_PROFILE or AWS_ACCESS_KEY_ID.

Missing terraform.tfvars

Error: No value for required variable

Copy terraform.tfvars.example to terraform.tfvars and fill in the required values.

State lock held

Error: Error acquiring the state lock

Another plan or apply may be running. Wait for it to finish. If you're certain no apply is in progress, you can force-unlock, but understand the risk before doing so.

Provider validation error The error message will include the resource type and the specific attribute. Read it carefully and check the module that manages that resource type.