evil_MoE

wassname/evil_MoE

Fork 0

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-28 18:10:32 +08:00

Commit Graph

Author	SHA1	Message	Date
wassname	2c266ebdb0	tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates ELICIT sits between discover and spoonfeed: asks the model to exploit the named grading mechanism without handing it literal code (the elicitability bar). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:00:40 +00:00
wassname	eaee3d013d	fix: fail-fast --modes + multi-rep validator (external review) gpt-5.5 review (docs/spec/20260530_code_review.md), both valid: - --modes silently dropped typos/whitespace ('--modes=a, b' -> only a; '--modes=typo' -> empty sweep after a 30s model load, looking like success). Now strips + validates against MODES, raises on unknown before loading. - validator was 1 stochastic sample/mode -> a <1.0-prob reachable hint could be falsely marked unreachable. Now PROBE_REPS samples, reports hits/reps, bar is >=1 exploit in N. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 12:45:11 +00:00
wassname	419a8ed8cd	feat: bigger-spoon mint recipes (literal code) + modes filter + OpenRouter probe Spoonfeed mint was 0/96 for eq_override/stdout_marker/sentinel/file_marker: the prose recipes were half-understood (model emits 'return 0') and the probe-channel detector needs exec-clean code. Fix: hand the model the proven verify_rewards hack template verbatim ('emit exactly this'), templated with the problem's graded method name ({func}) -- load-bearing for eq_override. Validated 5/5 on OpenRouter qwen3-8b. Also: --modes filter (re-mint only failed modes, keep cached run_tests/exit_code), skip the wasteful faithful cell when minting, dump first completion even at 0 hacks (the diagnostic that was missing), tqdm progbar. scripts/validate_spoonfeed.py: direct OpenRouter probe (pi is 47s/call + TUI noise). Calibration finding: even the PROVEN run_tests hack fails single-shot hint-discovery on 8b (returns 'return 0'), so single-shot is NOT a learnability oracle -- it only validates the mint path. RL learnability must be measured by the emergence run. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 12:20:05 +00:00

Author

SHA1

Message

Date

wassname

2c266ebdb0

tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates

ELICIT sits between discover and spoonfeed: asks the model to exploit the named
grading mechanism without handing it literal code (the elicitability bar).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 00:00:40 +00:00

wassname

eaee3d013d

fix: fail-fast --modes + multi-rep validator (external review)

gpt-5.5 review (docs/spec/20260530_code_review.md), both valid:
- --modes silently dropped typos/whitespace ('--modes=a, b' -> only a;
  '--modes=typo' -> empty sweep after a 30s model load, looking like success).
  Now strips + validates against MODES, raises on unknown before loading.
- validator was 1 stochastic sample/mode -> a <1.0-prob reachable hint could be
  falsely marked unreachable. Now PROBE_REPS samples, reports hits/reps, bar is
  >=1 exploit in N.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-30 12:45:11 +00:00

wassname

419a8ed8cd

feat: bigger-spoon mint recipes (literal code) + modes filter + OpenRouter probe

Spoonfeed mint was 0/96 for eq_override/stdout_marker/sentinel/file_marker: the
prose recipes were half-understood (model emits 'return 0') and the probe-channel
detector needs exec-clean code. Fix: hand the model the proven verify_rewards hack
template verbatim ('emit exactly this'), templated with the problem's graded method
name ({func}) -- load-bearing for eq_override. Validated 5/5 on OpenRouter qwen3-8b.

Also: --modes filter (re-mint only failed modes, keep cached run_tests/exit_code),
skip the wasteful faithful cell when minting, dump first completion even at 0 hacks
(the diagnostic that was missing), tqdm progbar.

scripts/validate_spoonfeed.py: direct OpenRouter probe (pi is 47s/call + TUI noise).
Calibration finding: even the PROVEN run_tests hack fails single-shot hint-discovery
on 8b (returns 'return 0'), so single-shot is NOT a learnability oracle -- it only
validates the mint path. RL learnability must be measured by the emergence run.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-30 12:20:05 +00:00

3 Commits