Cost and Safety

AWS costs money. This page explains what drives costs in Vega's AWS setup, how to control them, and what practices to follow to avoid surprises — both financial and operational.

What costs money

Service	Cost driver
ECS Fargate	vCPU and memory consumed per second by running tasks. The API and worker run 24/7. Runner tasks run only during scans.
RDS / Aurora Postgres	Instance uptime + storage. Even an idle database costs money.
NAT Gateway	Data processed through the gateway. Private subnet outbound traffic to internet destinations goes through NAT. Not used for AWS service calls (those go through VPC endpoints).
VPC Interface Endpoints	Fixed hourly charge per endpoint per AZ. Vega deploys 4 interface endpoints (ECR API, ECR DKR, CloudWatch Logs, Secrets Manager) in 2 AZs each. The S3 endpoint is a free gateway type.
S3	Storage bytes + request count. Source snapshots and artifact bundles accumulate.
CloudWatch	Log ingestion (per GB) + retention. High-verbosity logging is expensive.
AI provider	Token usage billed by the provider. This is often the largest variable cost.
CloudFront	Data transfer out + HTTP request count. Usually small for internal tools.

Cost controls Terraform module

The infra/terraform/modules/cost_controls/main.tf module configures:

AWS Budgets — sends email alerts when monthly spend exceeds thresholds. You can have separate budgets for dev and prod.
Cost Explorer anomaly detection — attempts to detect unusual spend patterns (e.g., a runaway scan creating too many ECS tasks).

Check that budget notification recipients are correct after deploying this module.

AI usage limits

The LLM proxy enforces per-scan spending limits. These are off by default (set to 0 = unlimited). Enable them to cap runaway scans:

Variable	What it limits
`VEGA_LLM_PROXY_MAX_REQUESTS_PER_SCAN`	Maximum number of AI API calls for one scan
`VEGA_LLM_PROXY_MAX_TOKENS_PER_SCAN`	Maximum tokens (input + output) for one scan
`VEGA_LLM_PROXY_MAX_COST_USD_PER_SCAN`	Estimated USD cost cap for one scan
`VEGA_LLM_PROXY_PRICE_USD_PER_1K_TOKENS`	Token price used for cost estimation

Set these in the LLM proxy and runner task definitions in Terraform.

Dev environment safety practices

Use the smallest RDS instance type that works (db.t3.micro is fine for dev).
Set short CloudWatch log retention (7 days is enough for dev).
Avoid running NAT-heavy tasks when not needed. The private runtime VPC endpoints (S3 gateway, ECR API, ECR DKR, CloudWatch Logs, Secrets Manager) are already deployed so private tasks never need to go through NAT for AWS service calls. Do not remove these endpoints to save money — a NAT Gateway replacement would cost more.
Clean up old runner task definitions and ECR images that are no longer referenced.
Set low AI usage caps in dev — you don't need full scan capacity for testing.
Set a modest AWS Budget alert (e.g., $50/month for dev) so you notice if something is misbehaving.

Production safety practices

Always review Terraform plans before applying, especially changes that touch:
- Database instance type or storage (potential data loss)
- Security group rules (potential security exposure)
- IAM policies (potential privilege escalation)
Use immutable, reviewed image tags for ECS deployments (e.g., Git SHA tags, not latest).
Keep secrets in Secrets Manager — never in environment variables committed to code.
Run migrations explicitly and intentionally — never automatically on container startup.
Set a production AWS Budget alert. A runaway scan or misconfigured task can burn through money quickly.
Confirm alarms are firing to the right people before you need them.

Debugging a cost spike

If AWS Cost Explorer shows unexpected spend:

Check Cost Explorer by service — which service is the largest contributor?
ECS — how many tasks are running? How long did they run? Are runner tasks being cleaned up after scans complete?
RDS — did the instance size change? Is Multi-AZ enabled unexpectedly?
NAT Gateway — which subnets are routing through it? Can you use VPC endpoints instead?
CloudWatch — is log verbosity set too high? Are old log groups retaining data longer than needed?
AI provider — check the LLM proxy usage records. Are per-scan limits set?
S3 — are old source snapshots and artifacts accumulating without a lifecycle policy?