Skip to content

Source Ingest

Source ingest is the process of getting a repository's code into a form Vega can scan. The backend supports two intake methods: fetching from a Git URL and extracting a user-uploaded archive.

Intake methods

Git fetch

When a user provides a Git URL, app/projects/fetcher.py clones the repository into a local directory (or a temporary path before snapshot upload). The fetcher handles:

  • Shallow clones to avoid downloading unnecessary history
  • Authentication for private repositories (if credentials are configured)
  • Walking the source tree to build a file listing

Zip/archive upload

Users can also upload a zip or tar archive. The upload flow:

  1. app/uploads/service.py handles the multipart form upload.
  2. app/storage/archive.py extracts the archive safely, enforcing the limits below.
  3. The extracted source directory becomes the snapshot.

Git upload (programmatic)

app/git_upload/service.py handles a special case: creating a temporary git remote that clients can git push to. This is used for programmatic or CI-driven workflows where you want to push code rather than specify a URL.

Archive safety

Archive extraction is a security boundary. A malicious user could craft an archive that: - Extracts files outside the intended directory (path traversal, e.g., ../../etc/passwd) - Contains millions of tiny files that exhaust disk space (zip bomb) - Has a huge uncompressed-to-compressed ratio (zip bomb variant)

app/storage/archive.py enforces these limits:

Setting Default What it prevents
VEGA_MAX_SOURCE_BYTES 2 GB Archives larger than this are rejected before extraction
VEGA_MAX_ARCHIVE_ENTRIES 50,000 Archives with too many files are rejected
VEGA_MAX_ARCHIVE_FILE_BYTES (per file) Individual files larger than this are rejected
VEGA_MAX_ARCHIVE_UNCOMPRESSED_BYTES 5 GB Total extracted size cap

Path traversal is detected and rejected before any file is written.

Snapshot storage

After ingest, the source is stored as an immutable snapshot:

Snapshots are stored in directories under data/snapshots/. The snapshot path is recorded in the repository record. Scans access source directly from this path.

Snapshots are uploaded as zip archives to the S3 source bucket. The S3 object key is stored in the repository record. Runner tasks download the snapshot from S3 before scanning.

The key backend settings:

# Local storage (default)
VEGA_FILE_STORAGE_BACKEND=local

# S3 storage (production)
VEGA_FILE_STORAGE_BACKEND=s3
VEGA_S3_SOURCE_BUCKET=vega-prod-source-abc123

Debugging ingest failures

Git clone failing: 1. Check that the URL is correct and accessible from the machine running the API. 2. For private repos, check whether git credentials are configured. 3. Check API logs for git error output.

Archive upload failing: 1. Check whether the upload exceeded any of the size limits above. 2. Look for path traversal errors in the API logs — if the archive contains ../ paths, it will be rejected. 3. Check that the data/uploads/ directory is writable (local) or the S3 bucket is accessible (production).

Snapshot upload to S3 failing: 1. Confirm VEGA_FILE_STORAGE_BACKEND=s3. 2. Confirm VEGA_S3_SOURCE_BUCKET is set to the correct bucket name. 3. Confirm the API task role has s3:PutObject on the bucket. 4. Check API logs for S3 client errors.