Skip to content

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089

Open
cocolato wants to merge 33 commits intopython:mainfrom
cocolato:jit-tracer-fitness
Open

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089
cocolato wants to merge 33 commits intopython:mainfrom
cocolato:jit-tracer-fitness

Conversation

@cocolato
Copy link
Copy Markdown
Member

@cocolato cocolato commented Apr 4, 2026

@cocolato

This comment was marked as outdated.

@cocolato
Copy link
Copy Markdown
Member Author

cocolato commented Apr 6, 2026

It appears that the current parameters do not yet guarantee runtime safety; I will continue to work on fixes and optimizations.

@markshannon
Copy link
Copy Markdown
Member

I've commented on the issue #146073 (comment)

@cocolato

This comment was marked as outdated.

@cocolato
Copy link
Copy Markdown
Member Author

@markshannon @Fidget-Spinner gentle ping, I’d like to hear your suggestions about the current parameters

Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a few comments, mostly broad ideas and suggestions, rather than anything that needs fixing.

I think we should merge this soon. We can tweak the parameters later as we refine our ideas.

Comment thread Include/internal/pycore_optimizer.h Outdated
* Higher = trace is more willing to stop here. */
#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL / 2)
#define EXIT_QUALITY_ENTER_EXECUTOR (FITNESS_INITIAL * 3 / 8)
#define EXIT_QUALITY_DEFAULT (FITNESS_INITIAL / 8)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd increase this to make sure that the fitness cannot drop from above EXIT_QUALITY_DEFAULT to below EXIT_QUALITY_SPECIALIZABLE in a single uop.

Comment thread Include/internal/pycore_optimizer.h Outdated
* N_BACKWARD_SLACK more bytecodes before reaching EXIT_QUALITY_CLOSE_LOOP,
* based on AVG_SLOTS_PER_INSTRUCTION. */
#define N_BACKWARD_SLACK 50
#define FITNESS_BACKWARD_EDGE (FITNESS_INITIAL - EXIT_QUALITY_CLOSE_LOOP \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A backwards edge isn't necessarily bad, although too many suggests a poor trace.

Maybe don't penalize the first backwards edge much, but penalize subsequent ones a lot.
What we really want is to avoid unrolling a loop that doesn't include the trace start.

It would be easier to reason about, if we used a simpler calculation for the value.

Comment thread Include/internal/pycore_optimizer.h Outdated

/* Backward edge penalty for JUMP_BACKWARD_NO_INTERRUPT (coroutines/yield-from).
* Smaller than FITNESS_BACKWARD_EDGE since these loops are very short. */
#define FITNESS_BACKWARD_EDGE_COROUTINE (FITNESS_BACKWARD_EDGE / 4)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the length of the loop that matter. We want to include yields in generators in the traces of the loop that calls them, whether that is a yield from loop or a for loop doesn't much matter.

Comment thread Python/bytecodes.c
Comment thread Python/optimizer.c Outdated
Comment thread Python/optimizer.c
Comment thread Python/optimizer.c Outdated
@cocolato
Copy link
Copy Markdown
Member Author

cocolato commented Apr 14, 2026

@markshannon Thanks for review! I'm holding off on changing the fitness parameters for now, but I can run some benchmarks if you think it's necessary.

@Fidget-Spinner
Copy link
Copy Markdown
Member

Fidget-Spinner commented Apr 14, 2026

Still seeing a big slowdown in richards on https://github.com/colesbury/fastmark:

Main of this branch:

Benchmark                     Time      Useful Work
richards                      115.9 ms      (100%)
richards_super                103.5 ms      (100%)

This branch:

Benchmark                     Time      Useful Work
richards                      119.2 ms      (100%)
richards_super                118.5 ms      (100%)

I'm going to check if this is affecting the optimizer somehow.

Treat back edges as an exit, not a penalty, this way they are more likely to end at a backedge instead of ending at random spots
Comment thread Python/optimizer.c
@markshannon
Copy link
Copy Markdown
Member

Increasing the max trace length is only going to help if the trace is stopping too early.
How big are the traces on main for richards?

@cocolato

This comment was marked as outdated.

@cocolato

This comment was marked as outdated.

@Fidget-Spinner
Copy link
Copy Markdown
Member

Fidget-Spinner commented Apr 14, 2026

I think we can safely reduce the max trace length, let me do that.

There were two problems with the older code:

  1. Branch penalty was treated as before instruction count, when it should be treated as the sum over the expected trace length.
  2. Treating non-closing JUMP_BACKWARD as penalty rather than exit quality seems to make it such that we stop traces at a certain offset after seeing a JUMP_BACKWARD. This isn't what we want. instead, we want to treat it as an exit that is worth stopping at (to increase the likelihood of linking to another trace).

New code has almost no slowdown on Richards, and a huge speedup on telco benchmark.

Benchmark                     Time      Useful Work
richards                      113.4 ms      (100%)
deltablue                     196.4 ms      (100%)
raytrace                      267.3 ms      (100%)
nbody                         150.6 ms      (100%)
go                            111.5 ms      (100%)
telco                        3517.8 ms      (100%)

Main:

Benchmark                     Time      Useful Work
richards                      111.6 ms      (100%)
deltablue                     192.2 ms      (100%)
raytrace                      270.4 ms      (100%)
nbody                         151.1 ms      (100%)
go                            112.6 ms      (100%)
telco                        3809.7 ms      (100%)

Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

There are a few cases where we are still special casing some situations that fitness should handle and can be removed.

Comment thread Include/internal/pycore_optimizer.h Outdated

/* Exit quality thresholds: trace stops when fitness < exit_quality.
* Higher = trace is more willing to stop here. */
#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL)
#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL - AVG_SLOTS_PER_INSTRUCTION*4)

FITNESS_INITIAL is too high a value for this, but not by much. We want to unroll tiny loops a bit and, more importantly, we don't want to special case the start instruction to avoid zero length traces.

* N_BACKWARD_SLACK more bytecodes before reaching EXIT_QUALITY_CLOSE_LOOP,
* based on AVG_SLOTS_PER_INSTRUCTION. */
#define N_BACKWARD_SLACK 50
#define EXIT_QUALITY_BACKWARD_EDGE (EXIT_QUALITY_CLOSE_LOOP / 2 - N_BACKWARD_SLACK * AVG_SLOTS_PER_INSTRUCTION)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE:

The problem here is that when tracing loops, we are treating the start of the loop as the closing point, but we want to stop at the end of the loop otherwise.
We probably need to make the back edge quality calculation a bit more complex.

  • if the jump is to the loop closing point: exit_quality = 0 (to ensure loop is closed)
  • otherwise: exit_quality = high ~(FITNESS - 10 * AVG_SLOTS_PER_INSTRUCTION)

(this can be fixed in a separate PR if would complicate this PR too much)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to do this in next PR.

Comment thread Include/internal/pycore_optimizer.h Outdated

/* Backward edge penalty for JUMP_BACKWARD_NO_INTERRUPT (coroutines/yield-from).
* Smaller than FITNESS_BACKWARD_EDGE since we want to trace through them. */
#define EXIT_QUALITY_BACKWARD_EDGE_COROUTINE (EXIT_QUALITY_BACKWARD_EDGE / 8)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we treating these backward edges differently? They may be in smaller loops, but N_BACKWARD_SLACK already handles that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will help tracer through groutines short loops.

Comment thread Include/internal/pycore_optimizer.h Outdated
/* Heuristic backward-edge exit quality: leave room for about 1 unroll and
* N_BACKWARD_SLACK more bytecodes before reaching EXIT_QUALITY_CLOSE_LOOP,
* based on AVG_SLOTS_PER_INSTRUCTION. */
#define N_BACKWARD_SLACK 50
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define N_BACKWARD_SLACK 50
#define N_BACKWARD_SLACK 10

50 seems way too high to me. What is the rationale for so high a number?

Comment thread Python/optimizer.c Outdated
{
// We need to check for this, otherwise the first instruction (JUMP_BACKWARD usually)
// is mistakenly thought of as an exit.
if (uop_buffer_length((_PyJitUopBuffer *)&tracer->code_buffer) > CODE_SIZE_NO_PROGRESS) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (uop_buffer_length((_PyJitUopBuffer *)&tracer->code_buffer) > CODE_SIZE_NO_PROGRESS) {

It should be impossible for fitness < quality should be impossible for traces this short.

Comment thread Python/optimizer.c Outdated
// We need to check for this, otherwise the first instruction (JUMP_BACKWARD usually)
// is mistakenly thought of as an exit.
if (uop_buffer_length((_PyJitUopBuffer *)&tracer->code_buffer) > CODE_SIZE_NO_PROGRESS) {
if (target_instr == tracer->initial_state.start_instr ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (target_instr == tracer->initial_state.start_instr ||
if (target_instr == tracer->initial_state.close_loop_instr) {

For non-loop traces the start and loop close are the same. For loop traces (starting on a backwards edge) we don't want to stop on the back-edge, but at the top of the loop.

Comment thread Python/optimizer.c Outdated
target_instr == tracer->initial_state.close_loop_instr) {
return EXIT_QUALITY_CLOSE_LOOP;
}
else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) {
else if (target_instr->op.code == ENTER_EXECUTOR) {

The fitness should handle this. If fitness is high, we will continue tracing. If it is getting lower, then we want to stop at the ENTER_EXECUTOR to join up with an existing trace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there's an exception to this: we don't want to treat ENTER_EXECUTORS caused by RESUME as a EXIT_QUALITY_ENTER_EXECUTOR, and instead treat them as a default one. Stopping at RESUME forms small, fragmented, loop traces, which I've previously documented in my RESUME tracing PR as I saw actual slowdowns from it.

This is what _PyJit_EnterExecutorShouldStopTracing says:

    // Continue tracing (skip over the executor). If it's a RESUME
    // trace to form longer, more optimizeable traces.
    // We want to trace over RESUME traces. Otherwise, functions with lots of RESUME
    // end up with many fragmented traces which perform badly.
    // See for example, the richards benchmark in pyperformance.
    // For consideration: We may want to consider tracing over side traces
    // inserted into bytecode as well in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree.
The whole point of fitness/quality is remove these ad-hoc special cases.

The reason you were seeing many fragmented traces before was that we didn't have a principled way to do this.

We won't see lots of small fragmented traces because, as I said, the fitness will be high for short traces and will exceed EXIT_QUALITY_ENTER_EXECUTOR

Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a benchmark. It's 0.5% slower applying this change on fastmark.

Can we introduce EXIT_QUALITY_ENTER_EXECUTOR_RESUME? We need to differentiate the following ENTER_EXECUTORS (they have different qualities in reality):

  1. ENTER_EXECUTOR due to JUMP_BACKWARD (best).
  2. ENTER_EXECUTOR due to progress or is_control_flow (decent)
  3. ENTER_EXECUTOR due to RESUME (worst).

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 15, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants