Skip to content

fix(scanner): wrap untrusted repo content in prompt isolation tags#226

Open
21lakshh wants to merge 4 commits into
XortexAI:mainfrom
21lakshh:main
Open

fix(scanner): wrap untrusted repo content in prompt isolation tags#226
21lakshh wants to merge 4 commits into
XortexAI:mainfrom
21lakshh:main

Conversation

@21lakshh
Copy link
Copy Markdown

@21lakshh 21lakshh commented Jun 2, 2026

Summary

Fixes indirect prompt injection vulnerabilities in repository enrichment prompts by isolating untrusted repository content inside <untrusted_code> tags and reinforcing model instructions before generation.

Motivation / Problem

Repository-controlled content such as raw_code, docstring, and symbol_list could inject instructions into enrichment prompts and influence downstream LLM behavior during indexing.

This change adds structural prompt isolation protections to prevent repository content from being interpreted as executable instructions.

Closes #224

Changes

  • Added _escape_untrusted() helper to neutralize embedded </untrusted_code> tag escape attempts

  • Wrapped all repo-controlled fields inside <untrusted_code> isolation blocks:

    • raw_code
    • docstring
    • signature
    • qualified_name
    • symbol_list
    • file_path
  • Updated both _SYMBOL_PROMPT and _FILE_PROMPT

  • Moved scanner-controlled metadata (language, symbol_type, symbol_count) into trusted prompt context

  • Added explicit pre-instructions telling the model to treat tagged content as inert data

  • Added reinforce instructions after untrusted content using a sandwich-pattern defense

  • Added prompt isolation tests for:

    • injected payload containment
    • tag escape prevention
    • reinforce instruction placement
  • Added integration-style coverage for enrichment write paths and failure handling

  • Preserved repository fidelity without regex stripping or code mutation

Testing

  • Unit tests added / updated (pytest tests/unit)
  • Integration tests pass (pytest tests/integration)
  • Tested manually — steps below:
pytest tests/unit/test_enricher.py
pytest tests/integration

Additional verification

  • Verified injection payloads in raw_code and docstring remain fully contained inside <untrusted_code> tags

  • Verified _SYMBOL_PROMPT and _FILE_PROMPT both include reinforce instructions

  • Verified:

    • MongoDB write path
    • Pinecone write path
    • Neo4j write path
    • empty LLM output early-return handling
    • 4,000-character truncation
    • max_symbols cap handling
    • LLM error recording
    • close() delegation

Screenshots / recordings (if UI change)

N/A

Checklist

  • My PR title follows [Conventional Commits](https://www.conventionalcommits.org/) (fix(security): harden enrichment prompts against indirect injection)
  • I ran ruff check . and black --check . locally with no errors
  • I updated CHANGELOG.md if this is a user-visible change
  • I ran uv lock if I modified pyproject.toml
  • Security-sensitive files modified? Pinged @ishaanxgupta or @ved015

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces prompt isolation in src/scanner/enricher.py by wrapping untrusted repository content inside <untrusted_code> tags, and adds comprehensive unit tests in tests/unit/test_enricher.py to verify this behavior. The review feedback highlights a high-severity vulnerability where untrusted content containing the literal </untrusted_code> tag can escape the isolation block, and recommends sanitizing inputs to prevent tag escaping. Additionally, the reviewer suggests adding a test case to cover this specific tag-escaping injection scenario.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/scanner/enricher.py Outdated
Comment thread tests/unit/test_enricher.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR hardens enrichment prompts against indirect prompt injection by wrapping all repo-controlled fields (raw_code, docstring, signature, qualified_name, file_path, symbol_list) inside <untrusted_code> isolation blocks, adding a _escape_untrusted() helper to neutralize tag-breakout attempts, and introducing _allowlist() to gate enum fields (language, symbol_type) that appear in the trusted instruction area.

  • _escape_untrusted() correctly replaces both </untrusted_code> and <untrusted_code> with backslash-escaped variants; the ordering (close-tag first, then open-tag) is sound because the two replacements are independent of each other.
  • _allowlist() ensures symbol_type and language in the trusted preamble can only be known Phase-1 values, and falls back to safe defaults; the allowlists are correctly documented against their Phase 1 sources.
  • Both prompts implement the sandwich pattern: pre-instruction before the isolation block and a reinforce instruction immediately after </untrusted_code>, before the Summary: marker.

Confidence Score: 5/5

Safe to merge; the isolation logic is correct and all repo-controlled fields are properly escaped and wrapped before reaching the LLM.

The escape function correctly neutralises both tag forms in a single pass, the allowlists gate the two trusted-area enum fields against their Phase-1 sources, both prompt templates implement the pre/post sandwich pattern, and the new test suite exercises injection containment, tag-escape attempts, and write-path isolation end to end. No incorrect behaviour was found on any enrichment path.

No files require special attention; the single suggestion in enricher.py is a minor defence-in-depth improvement, not a blocking concern.

Important Files Changed

Filename Overview
src/scanner/enricher.py Adds _escape_untrusted(), _allowlist(), tag constants, and allowlists; rewrites both prompt templates to isolate all repo-controlled fields inside untrusted_code blocks with sandwich-pattern reinforce instructions; updates _enrich_one_symbol and _enrich_one_file call sites to apply escaping and allowlisting before format().
tests/unit/test_enricher.py New test file with 30+ tests covering prompt isolation, tag-escape prevention, allowlist filtering, injection containment in both symbol and file enrichment paths, truncation, empty-LLM-response early-return, neo4j failure isolation, and enrich_repo stats/cap behaviour.

Sequence Diagram

sequenceDiagram
    participant MongoDB as MongoDB (Phase 1 data)
    participant Enricher as Enricher
    participant Escape as _escape_untrusted()
    participant Allowlist as _allowlist()
    participant Prompt as Prompt Builder
    participant LLM as LLM

    MongoDB->>Enricher: raw_code, docstring, signature, qualified_name, file_path, symbol_list
    MongoDB->>Enricher: language, symbol_type (enum fields)

    Enricher->>Escape: repo-controlled strings
    Escape-->>Enricher: neutralised (close/open tags escaped)

    Enricher->>Allowlist: language / symbol_type
    Allowlist-->>Enricher: allowlisted value or safe default

    Enricher->>Prompt: escaped values inside untrusted block + allowlisted enums in trusted preamble

    Note over Prompt: Trusted preamble (symbol_type, language)<br/>Pre-instruction: treat block as inert<br/>untrusted_code block: qualified_name, signature, docstring, raw_code<br/>Reinforce instruction after closing tag

    Prompt->>LLM: fully constructed prompt
    LLM-->>Enricher: summary string
    Enricher->>MongoDB: update_symbol_summary / update_file_summary
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (4): Last reviewed commit: "fix(scanner): escape opening tag to clos..." | Re-trigger Greptile

Comment thread src/scanner/enricher.py
Comment thread tests/unit/test_enricher.py
Comment thread src/scanner/enricher.py
@21lakshh
Copy link
Copy Markdown
Author

21lakshh commented Jun 2, 2026

@ishaanxgupta looks good you can merge it now

@ishaanxgupta
Copy link
Copy Markdown
Member

Hi @21lakshh please have a look on the greptile suggestions once

@21lakshh
Copy link
Copy Markdown
Author

21lakshh commented Jun 3, 2026

@ishaanxgupta done, thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] Indirect Prompt Injection in Scanner Enrichment Pipeline

2 participants