No description

Python 83.2%
PowerShell 16.1%
Shell 0.7%

Find a file

Isaac e727e0331e Merge feature/ensemble-scorer into main 10 commits: - ensemble: SigLIP2 + DINOv3 fused centroid mode - diverse command (farthest point sampling) - diverse --dedupe THRESHOLD (near-duplicate stripping) - scoring: recognize .jfif extension - ensemble: DINOv3 512 -> 768 default resolution - rank_folder: rename files with non-score bracket prefixes (don't skip)		2026-07-20 14:54:36 -04:00
docs	feat: full implementation — unified rank, rich naming, siphon	2026-05-20 22:37:01 -04:00
PLANS	docs: default DINOv3 to ViT-7B (biggest model)	2026-05-24 21:37:36 -04:00
test	fix: FEATURE_DIM=1024→1536, open_clip ViT-gopt-16-SigLIP2-384 outputs 1536-dim	2026-05-19 14:46:17 -04:00
utils	video_frames: default worker count to half of CPU cores	2026-05-24 00:11:54 -04:00
.gitignore	Restructure model artifacts into models/ directory and update README	2026-05-10 12:59:31 -04:00
aesthetic_scorer.py	rank_folder: rename files with non-score bracket prefixes instead of skipping	2026-07-20 14:52:24 -04:00
aesthetic_scorer.py.bak	siphon: change from percent to score threshold (e.g. --siphon 0.85)	2026-05-21 20:11:05 -04:00
aesthetic_scorer.py.main	feat: full implementation — unified rank, rich naming, siphon	2026-05-20 22:37:01 -04:00
aesthetic_scorer.py.siglip	feat: full implementation — unified rank, rich naming, siphon	2026-05-20 22:37:01 -04:00
CHANGELOG.md	scoring: recognize .jfif as a valid image extension	2026-06-08 23:42:59 -04:00
rank.ps1	Merge remote-tracking branch 'origin/main' into unify-rank-siphon	2026-05-20 23:59:05 -04:00
README.md	ensemble: bump DINOv3 default resolution 512 -> 768	2026-06-08 21:42:10 -04:00
requirements.txt	Fix percentile_rank O(n²), add OOM handlers, fix rename race	2026-05-17 19:52:03 -04:00
run_kpop_index.ps1	add diverse command — farthest point sampling for N most dissimilar images	2026-07-05 21:38:05 -04:00
setup.sh	fix: add venv availability check with correct apt package name	2026-05-17 12:38:53 -04:00
TESTING.md	Initial implementation: aesthetic image scorer	2026-05-09 17:41:00 -04:00
train.ps1	fafds	2026-05-15 15:51:21 -04:00
VERSION	chore: version bump 0.0.0.3 → 0.0.1.0	2026-05-19 15:23:19 -04:00

README.md

Aesthetic Scorer

Personal aesthetic image scorer. Load ~1000 curated favorite images, extract features with SigLIP2 or DINOv3, compute a centroid, score any new image in milliseconds.

Supports three backends: SigLIP2 (taste + text scoring), DINOv3 (taste-only, state-of-the-art features), and Ensemble (SigLIP2 + DINOv3 fused into a single 2560-dim centroid).

Setup

# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install transformers safetensors pillow pillow-heif open-clip-torch

# SigLIP2 downloads on first run (~2GB, cached locally)
# DINOv3 downloads on first run (~14GB for ViT-7B)

Requirements:

Python 3.11+
CUDA 12
PyTorch 2.1+
24GB VRAM (recommended; DINOv3 ViT-7B fits with int4 quantization)

Commands

`extract` — Build your centroid

# SigLIP2 (default)
python aesthetic_scorer.py extract "~/Photos/Favorites" --name "my-taste"

# DINOv3 (ViT-7B/16, default biggest model)
python aesthetic_scorer.py extract "~/Photos/Favorites" --model dinov3 --name "my-taste"

# DINOv3 with a smaller variant
python aesthetic_scorer.py extract "~/Photos/Favorites" --model dinov3 \
  --model-id facebook/dinov3-vitb14-pretrain-lvd1689m --name "my-taste"

`score` — Score a single image

# Uses newest centroid automatiacally (or specify with --centroid)
python aesthetic_scorer.py score "~/Photos/image.jpg"          # → 0.7324

# Score with a named centroid
python aesthetic_scorer.py score "~/Photos/image.jpg" --name "my-taste"

# Score with DINOv3 centroid
python aesthetic_scorer.py score "~/Photos/image.jpg" --model dinov3 --name "my-taste"

`rank` — Rank a folder of images

# Taste scoring with SigLIP2 centroid (default backend)
python aesthetic_scorer.py rank "~/Photos/ToScore" --centroid models/siglip2_2026-05-24.pt --top 20

# Taste scoring with DINOv3 centroid
python aesthetic_scorer.py rank "~/Photos/ToScore" --model dinov3 --centroid models/dinov3_2026-05-24.pt --top 20

# Rank with a named centroid (auto-discovers model type)
python aesthetic_scorer.py rank "~/Photos/ToScore" --name "my-taste" --top 20

# Rank using text prompts instead of a centroid (SigLIP2 only)
python aesthetic_scorer.py rank "~/Photos/ToScore" \
  --positives "portrait|golden hour" \
  --negatives "blurry|low quality" \
  --top 20 --bottom 5

# Combined taste + text scoring (SigLIP2 only)
python aesthetic_scorer.py rank "~/Photos/ToScore" \
  --centroid models/siglip2_taste.pt \
  --positives "golden hour|sunset" --top 20

# Export ranked results to CSV
python aesthetic_scorer.py rank "~/Photos/ToScore" --csv results.csv --top 100

# Rename files with score prefix (img.jpg → [0.7324]_img.jpg)
python aesthetic_scorer.py rank "~/Photos/ToScore" --rename

# Siphon images above a score threshold to a target folder (dry-run first)
python aesthetic_scorer.py rank "~/Photos/ToScore" --siphon 0.85 --siphon-dest ./curated --dry-run
python aesthetic_scorer.py rank "~/Photos/ToScore" --siphon 0.85 --siphon-dest ./curated --move --flatten

`list` — List available centroids

python aesthetic_scorer.py list

`test` — Self-test

python aesthetic_scorer.py test    # no GPU needed

Backend Selection

Flag	Backend	Text Scoring	Feature Dim	Default Model
`--model siglip2`	SigLIP2 (open_clip)	✅	1536	ViT-gopt-16-SigLIP2-384
`--model dinov3`	DINOv3 (transformers)	❌	4096	ViT-7B/16
`--model ensemble`	SigLIP2 + DINOv3 (fused)	✅ (text scored in SigLIP2 space)	2560	SigLIP2 gopt-16 @ 384 + DINOv3 ViT-L/16 @ 768

Text prompts (--positives/--negatives) work with SigLIP2 and Ensemble. With Ensemble, text scoring runs in plain SigLIP2 space: the SigLIP2 sub-scorer's text encoder produces 1536-dim embeddings, and the SigLIP2 half of the fused image feature (computed in the same forward pass as the ensemble taste encoding) is dotted against them — no zero-padding tricks, no re-encoding. Taste scoring still uses the full 2560-dim fused representation. Using text prompts with DINOv3 raises a clear error.

DINOv3 variant can be selected with --model-id:

facebook/dinov3-vit7b16-pretrain-lvd1689m (default for --model dinov3, 7B params)
facebook/dinov3-vitl16-pretrain-lvd1689m (default for --model ensemble, 304M params, 768²)
facebook/dinov3-vitb14-pretrain-lvd1689m (86M params)
facebook/dinov3-vits16-pretrain-lvd1689m (22M params)

Ensemble mode

The ensemble fuses SigLIP2 ViT-gopt-16-384 (1536-dim, image-text aligned, locked to 384²) with DINOv3 ViT-L/16 (1024-dim, self-supervised, default 768²) by L2-normalized concatenation, giving a 2560-dim fused centroid scored with a single cosine similarity. Defaults are picked for the sweet spot of feature quality vs. VRAM (~8GB total) — override the DINOv3 sub-scorer with --model-id and --resolution:

# Extract an ensemble centroid
python aesthetic_scorer.py extract "~/Photos/Favorites" --model ensemble --name "my-taste"

# Score with the ensemble
python aesthetic_scorer.py rank "~/Photos/ToScore" --model ensemble --name "my-taste" --top 20

Model Artifacts

All model files are stored in models/ (gitignored):

File	Description
`models/siglip2_YYYY-MM-DD_name.pt`	SigLIP2 centroid + metadata
`models/dinov3_YYYY-MM-DD_name.pt`	DINOv3 centroid + metadata
`models/ensemble_YYYY-MM-DD_name.pt`	SigLIP2 + DINOv3 ensemble centroid + metadata

Centroids include model_type in metadata. Loading a SigLIP2 centroid with DINOv3 (or vice versa) raises a clear error.

Architecture

SigLIP2: ViT-gopt-16-SigLIP2-384 (open_clip_torch, 1.87B params, 1536-dim features). Resolution: 384×384.
DINOv3: ViT-7B/16 via HuggingFace transformers (7B params, 4096-dim features). Resolution: 518×518.
Scoring: Features → L2-normalize → cosine to centroid → [0, 1]
Text scoring (SigLIP2 only): Text-image cosine similarity, both percentile-ranked before multiply when combined
Centroid: Streaming mean, O(1) memory — processes arbitrary dataset sizes
Centroid validation: model_type and feature_dim stored in metadata, validated on load

Output

All scores in [0, 1] where 1.0 = perfectly typical of your favorites. Higher = more like your training set.

README.md Unescape Escape