Skip to content

[fix] CLIPTextModel with transformers >= 5.6 and from_single_file#13843

Merged
DN6 merged 5 commits into
mainfrom
clip-text-encoder-transformers-fix
Jun 1, 2026
Merged

[fix] CLIPTextModel with transformers >= 5.6 and from_single_file#13843
DN6 merged 5 commits into
mainfrom
clip-text-encoder-transformers-fix

Conversation

@asomoza
Copy link
Copy Markdown
Member

@asomoza asomoza commented May 30, 2026

What does this PR do?

This PR fixes the issue when using recent transformers versions and from_single_file with CLIPTextModel . This prevents apps and libraries to pin transformers to more recent versions when they supports this.

Tested with an older version of transformers without the fix and with a newer one (latest as today) with the fix, using SD1.5 and SD2.1 and the images are identical.

SD 1.5

older without fix newer with fix
issue_13833_sd15_baseline_old_transformers issue_13833_sd15_fixed_new_transformers

SD 2.1

older without fix newer with fix
issue_13833_sd21_baseline_old_transformers issue_13833_sd21_fixed_new_transformers

fixes #13833

Who can review?

@kappacommit

@github-actions github-actions Bot added fixes-issue single-file size/S PR with diff < 50 LOC and removed fixes-issue labels May 30, 2026
@asomoza
Copy link
Copy Markdown
Member Author

asomoza commented May 30, 2026

@askserge review this PR

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

Clean, correct fix for CLIPTextModel compatibility with transformers ≥5.6 where the text_model wrapper was flattened.

Correctness

  • The hasattr(model, "text_model") check correctly distinguishes between the old (CLIPTextModel with wrapper) and new (flattened) layouts, while leaving CLIPTextModelWithProjection (which still has text_model) unaffected.
  • Both convert_ldm_clip_checkpoint and convert_open_clip_checkpoint produce keys prefixed with text_model. in the diffusers format. The removeprefix("text_model.") call correctly strips this for the flattened model, and is a no-op for keys that don't carry the prefix (e.g. text_projection.weight).
  • removeprefix (Python 3.9+) is already used elsewhere in the codebase.

No issues found. LGTM.

13 LLM turns · 14 tool calls · 66.3s · 148568 in / 2630 out tokens

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ghunkins
Copy link
Copy Markdown
Contributor

Thanks for the work @asomoza, would love to see this on main 🙏

@DN6 DN6 merged commit b95637a into main Jun 1, 2026
15 of 16 checks passed
@asomoza asomoza deleted the clip-text-encoder-transformers-fix branch June 1, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

from_single_file: CLIPTextModel has no attribute 'text_model' with transformers >= 5.6

4 participants