Cost and Safety
AWS costs money. This page explains what drives costs in Vega's AWS setup, how to control them, and what practices to follow to avoid surprises — both financial and operational.
What costs money
| Service | Cost driver |
|---|---|
| ECS Fargate | vCPU and memory consumed per second by running tasks. The API and worker run 24/7. Runner tasks run only during scans. |
| RDS / Aurora Postgres | Instance uptime + storage. Even an idle database costs money. |
| NAT Gateway | Data processed through the gateway. Private subnet outbound traffic (to SQS, S3, ECR, etc.) goes through NAT. |
| S3 | Storage bytes + request count. Source snapshots and artifact bundles accumulate. |
| CloudWatch | Log ingestion (per GB) + retention. High-verbosity logging is expensive. |
| AI provider | Token usage billed by the provider. This is often the largest variable cost. |
| CloudFront | Data transfer out + HTTP request count. Usually small for internal tools. |
Cost controls Terraform module
The infra/terraform/modules/cost_controls/main.tf module configures:
- AWS Budgets — sends email alerts when monthly spend exceeds thresholds. You can have separate budgets for dev and prod.
- Cost Explorer anomaly detection — attempts to detect unusual spend patterns (e.g., a runaway scan creating too many ECS tasks).
Check that budget notification recipients are correct after deploying this module.
AI usage limits
The LLM proxy enforces per-scan spending limits. These are off by default (set to 0 = unlimited). Enable them to cap runaway scans:
| Variable | What it limits |
|---|---|
VEGA_LLM_PROXY_MAX_REQUESTS_PER_SCAN |
Maximum number of AI API calls for one scan |
VEGA_LLM_PROXY_MAX_TOKENS_PER_SCAN |
Maximum tokens (input + output) for one scan |
VEGA_LLM_PROXY_MAX_COST_USD_PER_SCAN |
Estimated USD cost cap for one scan |
VEGA_LLM_PROXY_PRICE_USD_PER_1K_TOKENS |
Token price used for cost estimation |
Set these in the LLM proxy and runner task definitions in Terraform.
Dev environment safety practices
- Use the smallest RDS instance type that works (
db.t3.microis fine for dev). - Set short CloudWatch log retention (7 days is enough for dev).
- Avoid running NAT-heavy tasks when not needed. Prefer VPC endpoints for S3, ECR, Secrets Manager, and CloudWatch to reduce NAT costs.
- Clean up old runner task definitions and ECR images that are no longer referenced.
- Set low AI usage caps in dev — you don't need full scan capacity for testing.
- Set a modest AWS Budget alert (e.g., $50/month for dev) so you notice if something is misbehaving.
Production safety practices
- Always review Terraform plans before applying, especially changes that touch:
- Database instance type or storage (potential data loss)
- Security group rules (potential security exposure)
- IAM policies (potential privilege escalation)
- Use immutable, reviewed image tags for ECS deployments (e.g., Git SHA tags, not
latest). - Keep secrets in Secrets Manager — never in environment variables committed to code.
- Run migrations explicitly and intentionally — never automatically on container startup.
- Set a production AWS Budget alert. A runaway scan or misconfigured task can burn through money quickly.
- Confirm alarms are firing to the right people before you need them.
Debugging a cost spike
If AWS Cost Explorer shows unexpected spend:
- Check Cost Explorer by service — which service is the largest contributor?
- ECS — how many tasks are running? How long did they run? Are runner tasks being cleaned up after scans complete?
- RDS — did the instance size change? Is Multi-AZ enabled unexpectedly?
- NAT Gateway — which subnets are routing through it? Can you use VPC endpoints instead?
- CloudWatch — is log verbosity set too high? Are old log groups retaining data longer than needed?
- AI provider — check the LLM proxy usage records. Are per-scan limits set?
- S3 — are old source snapshots and artifacts accumulating without a lifecycle policy?