FIX: Stable random sampling in DatasetConfiguration by adrian-gavrila · Pull Request #1697 · microsoft/PyRIT

adrian-gavrila · 2026-05-07T20:53:47Z

Description

When a Scenario runs with include_default_baseline=True and a DatasetConfiguration whose max_dataset_size is set, the baseline atomic attack ended up evaluating a different random subset of
objectives than the strategy-based atomic attacks. Baseline-vs-strategy success-rate comparisons measured two different populations and were meaningless.

Root cause: random.sample ran fresh on every call to DatasetConfiguration.get_seed_groups() (Path 1, used by most scenarios) and get_all_seeds() (Path 2, used by EncodingDatasetConfiguration).
Scenario._get_atomic_attacks_async and Scenario._get_baseline_data each called these methods independently and got different samples.

Fix: memoize both methods. The resolved sample is cached for the lifetime of the configuration object, and reassigning max_dataset_size invalidates the cache. Returns are defensive container copies so
callers can mutate without poisoning the cache. max_dataset_size is now a property whose setter re-validates the value (mirroring __init__).

Subclasses inherit the fix automatically when they use base resolution methods. A short subclassing note in the class docstring flags the two methods that any future override must memoize itself.

Tests and Documentation

New TestDatasetConfigurationMemoization and TestDatasetConfigurationMaxDatasetSizeSetter classes in test_dataset_configuration.py covering both call paths, multi-dataset stability, cache
invalidation, setter validation, and defensive-copy semantics. All randomness-sensitive tests patch random.sample for determinism.
Encoding-specific regression test in test_encoding.py (the override routes through get_all_seeds, which is why both paths needed memoization).
End-to-end regression test in test_scenario.py asserting set(baseline.objectives) == set(strategy.objectives) after initialize_async with max_dataset_size set.

Verified by stashing the production change and watching the new tests fail (7 failures), then restoring and watching them pass.

Memoize get_seed_groups() and get_all_seeds() so the random subset selected when max_dataset_size is set is stable for the lifetime of the configuration. Reassigning max_dataset_size invalidates the cache. Without this, baseline and strategy atomic attacks each call get_all_seed_attack_groups() independently and receive different random subsets of objectives, making baseline-vs-strategy comparison meaningless. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rlundeen2 · 2026-05-07T23:15:39Z

        self._scenario_strategies = scenario_strategies
+        self._resolved_groups_cache: Optional[dict[str, list[SeedGroup]]] = None
+        self._resolved_seeds_cache: Optional[list[Seed]] = None
+        self._max_dataset_size: Optional[int] = None


Could we simplify this?

Instead of a cache, what if we added a baseline scenario technique that is just PromptSending. We get rid of this in initialize

if self._include_baseline: baseline_attack = self._get_baseline() self._atomic_attacks.insert(0, baseline_attack)

and

def _get_baseline(self) -> AtomicAttack:

And instead add a tag in _get_attack_technique_factories that adds a PromptSending technique as baseline?

_build_display_group would also likely need to be updated to support baseline?

There might be some hiccups, but it feels like a more natural place to include it as an additional technique vs trying to cache the datasets

I like this design change, and I think it is the right direction. My only concern is on doing this instead of the caching / memoization. Many of our scenarios never call _get_attack_technique_factories which means migrating those to the factory pattern. I can certainly add those changes here but going forward EncodingDatasetConfiguration.get_all_seed_attack_groups() still gets to its own call of random.sample which would bypass the factory loop and reintroduce the issue. I think making both changes here makes sense, I just don't want to increase scope and leave the underlying cause of the bug latent.

I could certainly be misreading the underlying architecture so feel free to push back on my framing of the issue if the baseline change alone would be sufficient to resolve this bug.

Copilot

Pull request overview

This PR fixes a correctness issue in scenario dataset sampling: when max_dataset_size is set (and a scenario includes the default baseline), baseline and strategy atomic attacks now evaluate the same sampled objective population by memoizing the resolved random sample for the lifetime of a DatasetConfiguration.

Changes:

Memoize DatasetConfiguration.get_seed_groups() and get_all_seeds() results, with cache invalidation when max_dataset_size is reassigned and defensive container copies returned to callers.
Add unit/regression tests covering sampling stability across both resolution paths, cache invalidation, and baseline-vs-strategy objective alignment (including Encoding’s override path).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`pyrit/scenario/core/dataset_configuration.py`	Adds memoization + `max_dataset_size` property setter to ensure lifetime-stable sampling and cache invalidation.
`tests/unit/scenario/test_dataset_configuration.py`	Adds targeted tests for memoization behavior, defensive copy semantics, and setter-driven cache invalidation/validation.
`tests/unit/scenario/test_encoding.py`	Adds regression test ensuring Encoding’s `get_all_seed_attack_groups()` remains stable across calls when sampling is capped.
`tests/unit/scenario/test_scenario.py`	Adds end-to-end regression test asserting baseline and strategy atomic attacks share the same sampled objectives.

Comments suppressed due to low confidence (1)

pyrit/scenario/core/dataset_configuration.py:59

The updated type annotations use Optional[...] (e.g., for seed_groups, max_dataset_size, caches, and the max_dataset_size property). The repo style guide requires PEP 604 union syntax (T | None) instead of Optional[T] for Python 3.10+; please update the touched annotations accordingly (and you can likely drop the Optional import afterward).

    def __init__(
        self,
        *,
        seed_groups: Optional[list[SeedGroup]] = None,
        dataset_names: Optional[list[str]] = None,
        max_dataset_size: Optional[int] = None,
        scenario_strategies: Optional[Sequence[ScenarioStrategy]] = None,
    ) -> None:

+        mutation of the dict or per-dataset lists.
+
        Subclasses can override this to filter or customize which seed groups
        are loaded based on the stored scenario_composites.


@@ -266,6 +310,9 @@ def get_all_seeds(self) -> list[Seed]:
        if self._dataset_names is None:
            raise ValueError("No dataset names configured. Set dataset_names to use get_all_seed_prompts.")


        Returns:
            List[SeedPrompt]: List of SeedPrompt objects from all configured datasets.
                Returns an empty list if no prompts are found.


rlundeen2 reviewed May 7, 2026

View reviewed changes

adrian-gavrila requested a review from Copilot May 11, 2026 14:33

Copilot started reviewing on behalf of adrian-gavrila May 11, 2026 14:34 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Stable random sampling in DatasetConfiguration#1697

FIX: Stable random sampling in DatasetConfiguration#1697
adrian-gavrila wants to merge 1 commit into
microsoft:mainfrom
adrian-gavrila:adrian-gavrila/stable-dataset-sampling

adrian-gavrila commented May 7, 2026

Uh oh!

rlundeen2 May 7, 2026 •

edited

Loading

Uh oh!

adrian-gavrila May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -266,6 +310,9 @@ def get_all_seeds(self) -> list[Seed]:
		if self._dataset_names is None:
		raise ValueError("No dataset names configured. Set dataset_names to use get_all_seed_prompts.")

Conversation

adrian-gavrila commented May 7, 2026

Description

Tests and Documentation

Uh oh!

rlundeen2 May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rlundeen2 May 7, 2026 •

edited

Loading