feat: unified rank, rich centroid naming, siphon command #4

Merged
isaac merged 47 commits from unify-rank-siphon into main 2026-05-21 03:59:36 +00:00
Contributor

Summary

Unified rank + siphon feature branch for aesthetic_scorer.py.

New features

  • Unified rank_folder — single function, all 3 modes auto-detected
  • Rich centroid naming — extract --name → siglip2_2026-05-20_name.pt with metadata
  • siphon N target — copy top N%, manifest CSV, --dry-run/--flatten/--move
  • list command — simple filename listing
  • --name for rank — matches by filename with extension
  • Percentile stretch — text-only and combined modes stretch clustered text scores from [p5, p95] → [0.1, 0.9], spreading clustered scores wider

Bug fixes

  • extract_features duplicate batch enqueue (last_enqueued_end guard)
  • extract_features empty batch pbar update
  • text-only rank crashes on pos-only/neg-only prompts
  • list command path check (skipped for list/test)
  • Dead code removed

Testing

python aesthetic_scorer.py test passes

## Summary Unified rank + siphon feature branch for aesthetic_scorer.py. ### New features - **Unified rank_folder** — single function, all 3 modes auto-detected - **Rich centroid naming** — extract --name → siglip2_2026-05-20_name.pt with metadata - **siphon N target** — copy top N%, manifest CSV, --dry-run/--flatten/--move - **list command** — simple filename listing - **--name for rank** — matches by filename with extension - **Percentile stretch** — text-only and combined modes stretch clustered text scores from [p5, p95] → [0.1, 0.9], spreading clustered scores wider ### Bug fixes - extract_features duplicate batch enqueue (last_enqueued_end guard) - extract_features empty batch pbar update - text-only rank crashes on pos-only/neg-only prompts - list command path check (skipped for list/test) - Dead code removed ### Testing python aesthetic_scorer.py test passes
lily added 48 commits 2026-05-21 03:57:38 +00:00
- Replace C-RADIO with SigLIP2 ViT-gopt-16-SigLIP2-384 for all commands
- rank-text: one embed per image, taste cosine + text cosine averaged raw
- extract: produces models/siglip_centroid.pt (1024-dim SigLIP2 centroid)
- Tokenizer: padding=max_length, max_length=64, explicit .lower() on all text
- Breaking: remove C-RADIO extractor, percentile rank, --resolution flag

Co-Authored-By: Claude <noreply@anthropic.com>
rank_text had the same per-image loop that score_with_text had.
Now both use:
  pos_sim = img_feats @ pos_embeds.T   # (B, M) matmul
  avg_pos = pos_sim.mean(dim=1)          # O(B) not O(B*M)
Pre-landing review found critical resource leaks:
- extract_features: pbar and executor leaked if encode_images raised
- rank_text: pbar leaked on CUDA OOM (executor was closed, pbar was not)
- rank_folder: same pbar leak on OOM

Also removed unused imports: math and Dict.

The try/except/raise pattern in extract_features now properly closes both
resources before re-raising. OOM handlers in rank_text/rank_folder
now also call pbar.close() before raising.
avg_pos - avg_neg ∈ [-2, 2] was passed directly to cosine_to_score which
maps [-1,1] → [0,1], producing out-of-range scores. Normalize by /2 first so
the full [-2,2] range maps correctly to [0,1].
Before: 1 future in-flight → GPU waits for next batch's load to finish
After: 2 futures queued → next batch is always already loaded when GPU finishes

Applies to extract_features, rank_text, and rank_folder.
The narrow OOM-only catch leaked pbar and executor on any other
exception (RuntimeError, CUDA errors, open_clip internal errors).
Changed to bare `except Exception: ... raise` matching extract_features.
Fixes re-run and edge cases where scored files could be counted alongside originals.
The enqueue logic used batch_end to compute next_batch_start, but batch_end
doesn't change between iterations when multiple entries with the same batch_end
exist in the queue. This caused the same next batch to be re-enqueued repeatedly.
Fix: track last_enqueued_end to only enqueue batches that haven't been added yet.
- Track last_enqueued_end to prevent re-enqueueing same batch
- Use batch_start + batch_size instead of batch_end for next start
- Fix pbar.update on empty batch to use valid_batch_paths not batch_paths
isaac merged commit 670b8b77cc into main 2026-05-21 03:59:36 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
isaac/isaac-image-scoring!4
No description provided.