wassname/Judgemark-v2lp

Fork 0

mirror of https://github.com/wassname/Judgemark-v2lp.git synced 2026-06-27 16:10:14 +08:00

Files

T

wassname 6e4444be49 tau with alternative less

2025-08-23 08:18:24 +08:00

12 KiB

Raw Blame History

2025-07-23 23:57:26

using meta-llama_llama-3_2-3b-instruct

aggregated_score_raw Final Judgemark (raw) = 0.088 Final Judgemark (cal) = 0.089

aggregated_score_weighted Final Judgemark (raw) = 0.089 Final Judgemark (cal) = 0.102

aggregated_score_ranked Final Judgemark (raw) = 0.034 Final Judgemark (cal) = 0.030

Hm maybe I should try without p value, maybe with restricted scale got to load the saved logp from the runfile

2025-07-25 19:30:21

Withdeepseek

Normed logp Final Judgemark (raw) = 0.673 Final Judgemark (cal) = 0.736

Weighted Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.660

argmax Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.659

normal ranked (without stretching)

normed and weighted Final Judgemark (raw) = 0.624 Final Judgemark (cal) = 0.645

name	score	score_norm
ranked_scaled	0.67	0.79
ranked_norm	0.67	0.73
weighted	0.63	0.65
raw	0.63	0.65
weighted_norm	0.62	0.64
ranked	0.33	0.28

UPTO:

why is ranked low on human calib? it's using kendall tau, against reference scores so I though it would be good. It's only using relative order
well I guess ranking is smooth an stable but less pronounced... if we made it more pronounced (**10?, or squared, or softmax) would it be better on all fonts

Analysis of Ranked Method Performance Issue

Key Implementation Detail - Correcting Misconceptions

Critical Clarification: All methods obtain the full 11-choice probability distribution in a single forward pass - there are no extra queries or argmax operations. The difference is only in how they collapse the 11-dimensional logprob vector:

Single prompt is sent to the judge model
Model sees 11 choice tokens ["0", "1", ..., "10"] (often in JSON format)
All 11 logprobs are requested in the same forward pass: logp_arr = [L₀, L₁, ... L₁₀]
Different methods collapse this vector differently:

ranked path: corr = kendalltau(choices=[0…10], values=logp_arr).correlation, then final = (corr+1)*5. Uses ordering of the 11 log-probs.
weighted path: softmax(logp_arr)-weighted mean: score = Σ( i * exp(Lᵢ)/Σexp(L) )
raw path: pick the token whose text content is parsed as score (greedy decoding)

Why Ranked Method Fails on Human Correlation

Looking at the full results:

DeepSeek-R1 (Strong Judge):

ranked: stability_calib = 1.0, separability_calib = 0.81, but human_correlation_calib = 0.624
raw: stability_calib = 0.896, separability_calib = 0.652, human_correlation_calib = 0.882

Llama-3.2-3B (Weak Judge):

ranked: stability_calib = 1.0, separability_calib = 0.809, but human_correlation_calib = 0.0
raw: stability_calib = 0.0, separability_calib = 0.081, human_correlation_calib = 0.212

Root Cause Analysis:

Scale Compression: The (correlation+1)*5 transformation maps Kendall's τ from [-1,1] to [0,10], but creates a compressed range around 5.0 for most cases.
Loss of Discriminative Power: Most log probability rankings have moderate positive correlations with ideal [0,1,2,...,10], so scores cluster around 6-8, losing ability to distinguish between good/bad models.
Efficiency vs. Accuracy Trade-off: While ranked method extracts full 11-choice distribution at zero extra cost, its ordering-based collapse may be too conservative for human preference alignment.
Paradox: Ranked method is internally consistent (high stability, separability) but externally invalid (poor human correlation).

Proposed Solutions

The intuition about making it "more pronounced" is correct:

Exponential Scaling: decision = ((res.correlation+1)*5) ** 2 / 10
Power Transform: decision = (res.correlation ** 3 + 1) * 5
Direct Tau Scaling: decision = res.correlation * 10 + 5 (allows negative scores)
Sigmoid Enhancement: Apply sigmoid to stretch middle range

The fundamental issue: linear scaling (τ+1)*5 doesn't preserve enough correlation signal to maintain human alignment while creating stable rankings.

name	judgemark_score	judgemark_score_calib	stability	stability_calib	separability	separability_calib	human_correlation	human_correlation_calib
weighted	0.668	0.716	0.898	0.879	0.558	0.635	0.881	0.876
raw	0.678	0.731	0.896	0.896	0.572	0.652	0.882	0.882
ranked	0.607	0.81	0.896	1.0	0.531	0.81	0.624	0.624
ranked_norm	0.575	0.588	0.897	0.677	0.435	0.513	0.814	0.801
ranked_scaled	0.607	0.81	0.896	1.0	0.531	0.81	0.624	0.624
weighted_norm	0.562	0.551	0.897	0.54	0.424	0.498	0.777	0.775

Run ID: 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1

name	judgemark_score	judgemark_score_calib	stability_calib	separability	separability_calib	human_correlation	human_correlation_calib
weighted	0.089	0.102	0.0	0.077	0.087	0.225	0.261
raw	0.088	0.089	0.0	0.074	0.081	0.229	0.212
ranked	0.262	0.706	1.0	0.393	0.809	0.0	0.0
ranked_norm	0.065	0.068	0.0	0.053	0.059	0.176	0.17
ranked_scaled	0.262	0.706	1.0	0.393	0.809	0.0	0.0
weighted_norm	0.028	0.046	0.0	0.042	0.061	0.0	0.033

Run ID: ab20b598-845b-4da7-9f4c-56ec05405e28__meta-llama_llama-3_2-3b-instruct

2025-08-20 05:52:06

So I'm still confused why rank is worse. And why norm and scaled perform better in some cases and worse in others.

look at raw scoes into tau Processing run 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1...

RAW weighted k_tau (7.056521739130434, 5.692083333333334, 6.568333333333333, 6.972083333333333, 6.02625, 5.292916666666667, 4.985, 4.421666666666667, 3.9104166666666664, 5.91625, 5.632083333333333, 7.324166666666667, 5.82875, 7.610833333333333, 8.12625, 4.810416666666667, 5.170833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705 calibrated k_tau (6.913741010558691, 4.745233729099284, 6.136213642220673, 6.867702086872284, 5.3350524958555905, 3.9637780431712675, 3.4473454755860238, 2.6185410443378423, 1.9714928350431529, 5.109234456943669, 4.540345540824161, 7.3509579707211765, 4.885496283271738, 7.700297215002195, 8.38722761924037, 3.1579257830331477, 3.8035150809926783) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705

raw k_tau (7.2075, 5.891666666666667, 6.645833333333333, 7.17, 5.872083333333333, 5.295416666666667, 4.9225, 4.42375, 3.9130434782608696, 5.767916666666666, 5.505833333333333, 7.350416666666667, 5.902083333333334, 7.7004166666666665, 8.099166666666667, 4.839166666666666, 5.04625) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529 calibrated k_tau (7.116312056737589, 4.941891289390237, 6.232299054373523, 7.107176996608079, 4.875130409086237, 3.880235484714111, 3.2264773280337447, 2.6151016881173845, 1.9454358641574738, 4.691218682067615, 4.2111636112913695, 7.342314215232809, 4.98624091614085, 7.830943570767808, 8.34094973789701, 3.125561954719158, 3.4652997467373283) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529

ranked k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647

ranked norm k_tau (7.248333333333333, 5.461666666666667, 6.952083333333333, 7.414166666666667, 5.75625, 4.51375, 3.9595833333333332, 3.9333333333333336, 4.483478260869565, 5.414166666666667, 5.159166666666667, 6.91875, 5.63875, 6.787083333333333, 6.379583333333334, 4.099166666666666, 4.35) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.7058823529411764 calibrated k_tau (7.477241596803178, 4.977537889200114, 7.178479071511619, 7.701605757701165, 5.445191021992888, 3.600066247674374, 2.6228312425887785, 2.661732775182972, 3.348692105366149, 4.871257993125308, 4.464827657652036, 7.146088400456735, 5.244763232721369, 6.975696491248189, 6.218798140275223, 2.869811090813712, 3.158093516403424) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.6911764705882353

ranked scaled k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647

weightned norm k_tau (7.73375, 5.875, 7.0183333333333335, 7.793333333333333, 5.9591666666666665, 5.34625, 5.1425, 5.205, 5.20695652173913, 5.9425, 5.815416666666667, 7.7941666666666665, 6.183333333333334, 7.769166666666667, 8.17125, 5.407916666666667, 5.335833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8382352941176471 calibrated k_tau (7.230462091065811, 4.168466637432468, 6.189414960216487, 7.470368270765096, 4.300410881532898, 3.50188667965537, 3.1950628663490654, 3.3781083774570995, 3.288182342794895, 4.220391879303791, 4.123169536875159, 7.485909549004787, 4.815368181010978, 7.399227345259091, 7.961157687087653, 3.616137782509102, 3.43724307883553) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8529411764705882

12 KiB Raw Blame History