fix(scanner): wrap untrusted repo content in prompt isolation tags by 21lakshh · Pull Request #226 · XortexAI/XMem

21lakshh · 2026-06-02T16:36:59Z

Summary

Fixes indirect prompt injection vulnerabilities in repository enrichment prompts by isolating untrusted repository content inside <untrusted_code> tags and reinforcing model instructions before generation.

Motivation / Problem

Repository-controlled content such as raw_code, docstring, and symbol_list could inject instructions into enrichment prompts and influence downstream LLM behavior during indexing.

This change adds structural prompt isolation protections to prevent repository content from being interpreted as executable instructions.

Closes #224

Changes

Added _escape_untrusted() helper to neutralize embedded </untrusted_code> tag escape attempts
Wrapped all repo-controlled fields inside <untrusted_code> isolation blocks:
- raw_code
- docstring
- signature
- qualified_name
- symbol_list
- file_path
Updated both _SYMBOL_PROMPT and _FILE_PROMPT
Moved scanner-controlled metadata (language, symbol_type, symbol_count) into trusted prompt context
Added explicit pre-instructions telling the model to treat tagged content as inert data
Added reinforce instructions after untrusted content using a sandwich-pattern defense
Added prompt isolation tests for:
- injected payload containment
- tag escape prevention
- reinforce instruction placement
Added integration-style coverage for enrichment write paths and failure handling
Preserved repository fidelity without regex stripping or code mutation

Testing

Unit tests added / updated (pytest tests/unit)
Integration tests pass (pytest tests/integration)
Tested manually — steps below:

pytest tests/unit/test_enricher.py
pytest tests/integration

Additional verification

Verified injection payloads in raw_code and docstring remain fully contained inside <untrusted_code> tags
Verified _SYMBOL_PROMPT and _FILE_PROMPT both include reinforce instructions
Verified:
- MongoDB write path
- Pinecone write path
- Neo4j write path
- empty LLM output early-return handling
- 4,000-character truncation
- max_symbols cap handling
- LLM error recording
- close() delegation

Screenshots / recordings (if UI change)

N/A

Checklist

My PR title follows [Conventional Commits](https://www.conventionalcommits.org/) (fix(security): harden enrichment prompts against indirect injection)
I ran ruff check . and black --check . locally with no errors
I updated CHANGELOG.md if this is a user-visible change
I ran uv lock if I modified pyproject.toml
Security-sensitive files modified? Pinged @ishaanxgupta or @ved015

gemini-code-assist

Code Review

This pull request introduces prompt isolation in src/scanner/enricher.py by wrapping untrusted repository content inside <untrusted_code> tags, and adds comprehensive unit tests in tests/unit/test_enricher.py to verify this behavior. The review feedback highlights a high-severity vulnerability where untrusted content containing the literal </untrusted_code> tag can escape the isolation block, and recommends sanitizing inputs to prevent tag escaping. Additionally, the reviewer suggests adding a test case to cover this specific tag-escaping injection scenario.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

greptile-apps · 2026-06-02T16:39:39Z

Greptile Summary

This PR hardens enrichment prompts against indirect prompt injection by wrapping all repo-controlled fields (raw_code, docstring, signature, qualified_name, file_path, symbol_list) inside <untrusted_code> isolation blocks, adding a _escape_untrusted() helper to neutralize tag-breakout attempts, and introducing _allowlist() to gate enum fields (language, symbol_type) that appear in the trusted instruction area.

_escape_untrusted() correctly replaces both </untrusted_code> and <untrusted_code> with backslash-escaped variants; the ordering (close-tag first, then open-tag) is sound because the two replacements are independent of each other.
_allowlist() ensures symbol_type and language in the trusted preamble can only be known Phase-1 values, and falls back to safe defaults; the allowlists are correctly documented against their Phase 1 sources.
Both prompts implement the sandwich pattern: pre-instruction before the isolation block and a reinforce instruction immediately after </untrusted_code>, before the Summary: marker.

Confidence Score: 5/5

Safe to merge; the isolation logic is correct and all repo-controlled fields are properly escaped and wrapped before reaching the LLM.

The escape function correctly neutralises both tag forms in a single pass, the allowlists gate the two trusted-area enum fields against their Phase-1 sources, both prompt templates implement the pre/post sandwich pattern, and the new test suite exercises injection containment, tag-escape attempts, and write-path isolation end to end. No incorrect behaviour was found on any enrichment path.

No files require special attention; the single suggestion in enricher.py is a minor defence-in-depth improvement, not a blocking concern.

Important Files Changed

Filename	Overview
src/scanner/enricher.py	Adds _escape_untrusted(), _allowlist(), tag constants, and allowlists; rewrites both prompt templates to isolate all repo-controlled fields inside untrusted_code blocks with sandwich-pattern reinforce instructions; updates _enrich_one_symbol and _enrich_one_file call sites to apply escaping and allowlisting before format().
tests/unit/test_enricher.py	New test file with 30+ tests covering prompt isolation, tag-escape prevention, allowlist filtering, injection containment in both symbol and file enrichment paths, truncation, empty-LLM-response early-return, neo4j failure isolation, and enrich_repo stats/cap behaviour.

Sequence Diagram

sequenceDiagram
    participant MongoDB as MongoDB (Phase 1 data)
    participant Enricher as Enricher
    participant Escape as _escape_untrusted()
    participant Allowlist as _allowlist()
    participant Prompt as Prompt Builder
    participant LLM as LLM

    MongoDB->>Enricher: raw_code, docstring, signature, qualified_name, file_path, symbol_list
    MongoDB->>Enricher: language, symbol_type (enum fields)

    Enricher->>Escape: repo-controlled strings
    Escape-->>Enricher: neutralised (close/open tags escaped)

    Enricher->>Allowlist: language / symbol_type
    Allowlist-->>Enricher: allowlisted value or safe default

    Enricher->>Prompt: escaped values inside untrusted block + allowlisted enums in trusted preamble

    Note over Prompt: Trusted preamble (symbol_type, language)<br/>Pre-instruction: treat block as inert<br/>untrusted_code block: qualified_name, signature, docstring, raw_code<br/>Reinforce instruction after closing tag

    Prompt->>LLM: fully constructed prompt
    LLM-->>Enricher: summary string
    Enricher->>MongoDB: update_symbol_summary / update_file_summary

_{Reviews (4): Last reviewed commit: "fix(scanner): escape opening tag to clos..." | Re-trigger Greptile}

21lakshh · 2026-06-02T17:24:48Z

@ishaanxgupta looks good you can merge it now

ishaanxgupta · 2026-06-03T02:32:54Z

Hi @21lakshh please have a look on the greptile suggestions once

…olation

21lakshh · 2026-06-03T04:34:09Z

@ishaanxgupta done, thanks!!

fix(scanner): wrap untrusted repo content in prompt isolation tags

cd9327e

21lakshh requested review from ishaanxgupta and ved015 as code owners June 2, 2026 16:37

github-actions Bot added tests scanner labels Jun 2, 2026

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/scanner/enricher.py Outdated

Comment thread tests/unit/test_enricher.py

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/scanner/enricher.py

Comment thread tests/unit/test_enricher.py

fix(scanner): isolate untrusted repo content in enricher prompts

1893806

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/scanner/enricher.py

21lakshh added 2 commits June 3, 2026 09:44

fix(scanner): allowlist symbol_type and language before prompt insertion

13d7057

fix(scanner): escape opening tag to close nesting attack in prompt is…

23cdcc3

…olation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scanner): wrap untrusted repo content in prompt isolation tags#226

fix(scanner): wrap untrusted repo content in prompt isolation tags#226
21lakshh wants to merge 4 commits into
XortexAI:mainfrom
21lakshh:main

21lakshh commented Jun 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

21lakshh commented Jun 2, 2026

Uh oh!

ishaanxgupta commented Jun 3, 2026

Uh oh!

21lakshh commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

21lakshh commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Problem

Changes

Testing

Additional verification

Screenshots / recordings (if UI change)

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

21lakshh commented Jun 2, 2026

Uh oh!

ishaanxgupta commented Jun 3, 2026

Uh oh!

21lakshh commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

21lakshh commented Jun 2, 2026 •

edited

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading