Skip to content

New: [AEA-6581] - llm evaluation poc#565

Open
bencegadanyi1-nhs wants to merge 4 commits intomainfrom
AEA-6581-deepeval-poc
Open

New: [AEA-6581] - llm evaluation poc#565
bencegadanyi1-nhs wants to merge 4 commits intomainfrom
AEA-6581-deepeval-poc

Conversation

@bencegadanyi1-nhs
Copy link
Copy Markdown
Contributor

Summary

  • ✨ New Feature

Details

Copilot AI review requested due to automatic review settings April 29, 2026 11:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a proof-of-concept LLM/RAG evaluation suite (DeepEval-based) for the EPS chatbot and wires a smoke run into the PR deployment workflow to provide early quality signals.

Changes:

  • Introduces packages/ragasEvaluation with DeepEval tests, fixtures, and a Bedrock-backed judge model plus a Lambda/KB client.
  • Adds a new Poetry dependency group (ragasEvaluation) and updates the lockfile accordingly.
  • Adds make eval-smoke / make eval-full targets and runs smoke evaluation in the PR release workflow.

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pyproject.toml Adds ragasEvaluation dependency group; bumps urllib3 patch.
poetry.lock Regenerated lockfile to include DeepEval + transitive deps and new group membership.
packages/ragasEvaluation/tests/test_chatbot_eval.py Smoke/full DeepEval test suite driving live chatbot calls and metric evaluation.
packages/ragasEvaluation/tests/conftest.py Session bootstrap + Bedrock judge fixture with xdist-safe caching.
packages/ragasEvaluation/test_cases.json Defines evaluation prompts and ground-truth expectations (incl. smoke subset).
packages/ragasEvaluation/pytest.ini Pytest defaults for the evaluation suite (incl. xdist settings).
packages/ragasEvaluation/evaluation/chatbot.py Direct Lambda invocation + KB retrieve contexts for metric inputs.
packages/ragasEvaluation/evaluation/bedrock_judge.py Custom DeepEval judge model using Bedrock converse.
packages/ragasEvaluation/.deepeval/.deepeval_telemetry.txt Adds DeepEval telemetry artefact to repo.
Makefile Adds eval-smoke and eval-full targets.
.grype.yaml Adds vulnerability ignore list.
.github/workflows/release_all_stacks.yml Adds PR-only “Chatbot RAG Evaluation” job running smoke evaluation.

python_files = test_*.py
python_functions = test_*
pythonpath = .
addopts = -v --tb=short -n 4
Comment on lines +28 to +45
def pytest_sessionstart(session: pytest.Session) -> None:
"""Resolve Lambda name and KB ID once before any tests run.

Uses a file lock so only the first xdist worker calls CloudFormation;
the rest read cached values from a shared JSON file.
"""
# Use a stable path so all workers (separate processes) share it.
cache_file = Path(tempfile.gettempdir()) / "eval_bootstrap_cache.json"
lock_file = Path(tempfile.gettempdir()) / "eval_bootstrap_cache.json.lock"

with FileLock(str(lock_file)):
if cache_file.is_file():
data = json.loads(cache_file.read_text())
os.environ["_EVAL_LAMBDA_NAME"] = data["lambda_name"]
os.environ["_EVAL_KB_ID"] = data["kb_id"]
else:
bootstrap()
cache_file.write_text(
Comment thread .grype.yaml
Comment on lines +2 to +8
- vulnerability: GHSA-38jv-5279-wg99
- vulnerability: GHSA-vfmq-68hx-4jfw
- vulnerability: GHSA-p423-j2cm-9vmq
- vulnerability: GHSA-58qw-9mgm-455v
- vulnerability: GHSA-r6ph-v2qm-q3c2
- vulnerability: GHSA-6w46-j5rx-g56g
- vulnerability: GHSA-gc5v-m9x4-r6x2
Comment on lines +213 to +254
chatbot_evaluation:
name: Chatbot RAG Evaluation
runs-on: ubuntu-22.04
container:
image: ${{ inputs.pinned_image }}
options: --user 1001:1001 --group-add 128
defaults:
run:
shell: bash
if: ${{ always() && !failure() && !cancelled() && inputs.IS_PULL_REQUEST == true
}}
needs: [ release_all_code ]
permissions:
id-token: write
contents: read
steps:
- name: copy .tool-versions
run: |
cp /home/vscode/.tool-versions "$HOME/.tool-versions"

- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
with:
persist-credentials: false

- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
with:
aws-region: eu-west-2
role-to-assume: ${{ secrets.DEV_CLOUD_FORMATION_EXECUTE_LAMBDA_ROLE }}
role-session-name: eps-assist-me-evaluation

- name: Install dependencies
run: |
make install-python

- name: Run smoke evaluation
env:
CHATBOT_STACK_NAME: ${{ inputs.STACK_NAME }}
AWS_REGION: eu-west-2
run: |
make eval-smoke
Comment on lines +1 to +4
DEEPEVAL_ID=d0395a29-36cf-4018-afbd-c2834073dfcc
DEEPEVAL_STATUS=old
DEEPEVAL_LAST_FEATURE=evaluation
DEEPEVAL_EVALUATION_STATUS=old
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants