Skip to content

Cost and Safety

AWS costs money. This page explains what drives costs in Vega's AWS setup, how to control them, and what practices to follow to avoid surprises — both financial and operational.

What costs money

Service Cost driver
ECS Fargate vCPU and memory consumed per second by running tasks. The API and worker run 24/7. Runner tasks run only during scans.
RDS / Aurora Postgres Instance uptime + storage. Even an idle database costs money.
NAT Gateway Data processed through the gateway. Private subnet outbound traffic (to SQS, S3, ECR, etc.) goes through NAT.
S3 Storage bytes + request count. Source snapshots and artifact bundles accumulate.
CloudWatch Log ingestion (per GB) + retention. High-verbosity logging is expensive.
AI provider Token usage billed by the provider. This is often the largest variable cost.
CloudFront Data transfer out + HTTP request count. Usually small for internal tools.

Cost controls Terraform module

The infra/terraform/modules/cost_controls/main.tf module configures:

  • AWS Budgets — sends email alerts when monthly spend exceeds thresholds. You can have separate budgets for dev and prod.
  • Cost Explorer anomaly detection — attempts to detect unusual spend patterns (e.g., a runaway scan creating too many ECS tasks).

Check that budget notification recipients are correct after deploying this module.

AI usage limits

The LLM proxy enforces per-scan spending limits. These are off by default (set to 0 = unlimited). Enable them to cap runaway scans:

Variable What it limits
VEGA_LLM_PROXY_MAX_REQUESTS_PER_SCAN Maximum number of AI API calls for one scan
VEGA_LLM_PROXY_MAX_TOKENS_PER_SCAN Maximum tokens (input + output) for one scan
VEGA_LLM_PROXY_MAX_COST_USD_PER_SCAN Estimated USD cost cap for one scan
VEGA_LLM_PROXY_PRICE_USD_PER_1K_TOKENS Token price used for cost estimation

Set these in the LLM proxy and runner task definitions in Terraform.

Dev environment safety practices

  • Use the smallest RDS instance type that works (db.t3.micro is fine for dev).
  • Set short CloudWatch log retention (7 days is enough for dev).
  • Avoid running NAT-heavy tasks when not needed. Prefer VPC endpoints for S3, ECR, Secrets Manager, and CloudWatch to reduce NAT costs.
  • Clean up old runner task definitions and ECR images that are no longer referenced.
  • Set low AI usage caps in dev — you don't need full scan capacity for testing.
  • Set a modest AWS Budget alert (e.g., $50/month for dev) so you notice if something is misbehaving.

Production safety practices

  • Always review Terraform plans before applying, especially changes that touch:
    • Database instance type or storage (potential data loss)
    • Security group rules (potential security exposure)
    • IAM policies (potential privilege escalation)
  • Use immutable, reviewed image tags for ECS deployments (e.g., Git SHA tags, not latest).
  • Keep secrets in Secrets Manager — never in environment variables committed to code.
  • Run migrations explicitly and intentionally — never automatically on container startup.
  • Set a production AWS Budget alert. A runaway scan or misconfigured task can burn through money quickly.
  • Confirm alarms are firing to the right people before you need them.

Debugging a cost spike

If AWS Cost Explorer shows unexpected spend:

  1. Check Cost Explorer by service — which service is the largest contributor?
  2. ECS — how many tasks are running? How long did they run? Are runner tasks being cleaned up after scans complete?
  3. RDS — did the instance size change? Is Multi-AZ enabled unexpectedly?
  4. NAT Gateway — which subnets are routing through it? Can you use VPC endpoints instead?
  5. CloudWatch — is log verbosity set too high? Are old log groups retaining data longer than needed?
  6. AI provider — check the LLM proxy usage records. Are per-scan limits set?
  7. S3 — are old source snapshots and artifacts accumulating without a lifecycle policy?