12 KiB
2025-07-23 23:57:26
using meta-llama_llama-3_2-3b-instruct
aggregated_score_raw Final Judgemark (raw) = 0.088 Final Judgemark (cal) = 0.089
aggregated_score_weighted Final Judgemark (raw) = 0.089 Final Judgemark (cal) = 0.102
aggregated_score_ranked Final Judgemark (raw) = 0.034 Final Judgemark (cal) = 0.030
Hm maybe I should try without p value, maybe with restricted scale got to load the saved logp from the runfile
2025-07-25 19:30:21
Withdeepseek
Normed logp Final Judgemark (raw) = 0.673 Final Judgemark (cal) = 0.736
Weighted Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.660
argmax Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.659
normal ranked (without stretching)
normed and weighted Final Judgemark (raw) = 0.624 Final Judgemark (cal) = 0.645
| name | score | score_norm |
|---|---|---|
| ranked_scaled | 0.67 | 0.79 |
| ranked_norm | 0.67 | 0.73 |
| weighted | 0.63 | 0.65 |
| raw | 0.63 | 0.65 |
| weighted_norm | 0.62 | 0.64 |
| ranked | 0.33 | 0.28 |
UPTO:
- why is ranked low on human calib? it's using kendall tau, against reference scores so I though it would be good. It's only using relative order
- well I guess ranking is smooth an stable but less pronounced... if we made it more pronounced (**10?, or squared, or softmax) would it be better on all fonts
Analysis of Ranked Method Performance Issue
Key Implementation Detail - Correcting Misconceptions
Critical Clarification: All methods obtain the full 11-choice probability distribution in a single forward pass - there are no extra queries or argmax operations. The difference is only in how they collapse the 11-dimensional logprob vector:
- Single prompt is sent to the judge model
- Model sees 11 choice tokens ["0", "1", ..., "10"] (often in JSON format)
- All 11 logprobs are requested in the same forward pass:
logp_arr = [L₀, L₁, ... L₁₀] - Different methods collapse this vector differently:
- ranked path:
corr = kendalltau(choices=[0…10], values=logp_arr).correlation, thenfinal = (corr+1)*5. Uses ordering of the 11 log-probs. - weighted path:
softmax(logp_arr)-weighted mean:score = Σ( i * exp(Lᵢ)/Σexp(L) ) - raw path: pick the token whose text content is parsed as score (greedy decoding)
Why Ranked Method Fails on Human Correlation
Looking at the full results:
DeepSeek-R1 (Strong Judge):
ranked: stability_calib = 1.0, separability_calib = 0.81, but human_correlation_calib = 0.624raw: stability_calib = 0.896, separability_calib = 0.652, human_correlation_calib = 0.882
Llama-3.2-3B (Weak Judge):
ranked: stability_calib = 1.0, separability_calib = 0.809, but human_correlation_calib = 0.0raw: stability_calib = 0.0, separability_calib = 0.081, human_correlation_calib = 0.212
Root Cause Analysis:
-
Scale Compression: The
(correlation+1)*5transformation maps Kendall's τ from [-1,1] to [0,10], but creates a compressed range around 5.0 for most cases. -
Loss of Discriminative Power: Most log probability rankings have moderate positive correlations with ideal [0,1,2,...,10], so scores cluster around 6-8, losing ability to distinguish between good/bad models.
-
Efficiency vs. Accuracy Trade-off: While ranked method extracts full 11-choice distribution at zero extra cost, its ordering-based collapse may be too conservative for human preference alignment.
-
Paradox: Ranked method is internally consistent (high stability, separability) but externally invalid (poor human correlation).
Proposed Solutions
The intuition about making it "more pronounced" is correct:
- Exponential Scaling:
decision = ((res.correlation+1)*5) ** 2 / 10 - Power Transform:
decision = (res.correlation ** 3 + 1) * 5 - Direct Tau Scaling:
decision = res.correlation * 10 + 5(allows negative scores) - Sigmoid Enhancement: Apply sigmoid to stretch middle range
The fundamental issue: linear scaling (τ+1)*5 doesn't preserve enough correlation signal to maintain human alignment while creating stable rankings.
| name | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | human_correlation | human_correlation_calib |
|---|---|---|---|---|---|---|---|---|
| weighted | 0.668 | 0.716 | 0.898 | 0.879 | 0.558 | 0.635 | 0.881 | 0.876 |
| raw | 0.678 | 0.731 | 0.896 | 0.896 | 0.572 | 0.652 | 0.882 | 0.882 |
| ranked | 0.607 | 0.81 | 0.896 | 1.0 | 0.531 | 0.81 | 0.624 | 0.624 |
| ranked_norm | 0.575 | 0.588 | 0.897 | 0.677 | 0.435 | 0.513 | 0.814 | 0.801 |
| ranked_scaled | 0.607 | 0.81 | 0.896 | 1.0 | 0.531 | 0.81 | 0.624 | 0.624 |
| weighted_norm | 0.562 | 0.551 | 0.897 | 0.54 | 0.424 | 0.498 | 0.777 | 0.775 |
Run ID: 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1
| name | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | human_correlation | human_correlation_calib |
|---|---|---|---|---|---|---|---|---|
| weighted | 0.089 | 0.102 | 0.0 | 0.0 | 0.077 | 0.087 | 0.225 | 0.261 |
| raw | 0.088 | 0.089 | 0.0 | 0.0 | 0.074 | 0.081 | 0.229 | 0.212 |
| ranked | 0.262 | 0.706 | 0.0 | 1.0 | 0.393 | 0.809 | 0.0 | 0.0 |
| ranked_norm | 0.065 | 0.068 | 0.0 | 0.0 | 0.053 | 0.059 | 0.176 | 0.17 |
| ranked_scaled | 0.262 | 0.706 | 0.0 | 1.0 | 0.393 | 0.809 | 0.0 | 0.0 |
| weighted_norm | 0.028 | 0.046 | 0.0 | 0.0 | 0.042 | 0.061 | 0.0 | 0.033 |
Run ID: ab20b598-845b-4da7-9f4c-56ec05405e28__meta-llama_llama-3_2-3b-instruct
2025-08-20 05:52:06
So I'm still confused why rank is worse. And why norm and scaled perform better in some cases and worse in others.
look at raw scoes into tau Processing run 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1...
RAW weighted k_tau (7.056521739130434, 5.692083333333334, 6.568333333333333, 6.972083333333333, 6.02625, 5.292916666666667, 4.985, 4.421666666666667, 3.9104166666666664, 5.91625, 5.632083333333333, 7.324166666666667, 5.82875, 7.610833333333333, 8.12625, 4.810416666666667, 5.170833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705 calibrated k_tau (6.913741010558691, 4.745233729099284, 6.136213642220673, 6.867702086872284, 5.3350524958555905, 3.9637780431712675, 3.4473454755860238, 2.6185410443378423, 1.9714928350431529, 5.109234456943669, 4.540345540824161, 7.3509579707211765, 4.885496283271738, 7.700297215002195, 8.38722761924037, 3.1579257830331477, 3.8035150809926783) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705
raw k_tau (7.2075, 5.891666666666667, 6.645833333333333, 7.17, 5.872083333333333, 5.295416666666667, 4.9225, 4.42375, 3.9130434782608696, 5.767916666666666, 5.505833333333333, 7.350416666666667, 5.902083333333334, 7.7004166666666665, 8.099166666666667, 4.839166666666666, 5.04625) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529 calibrated k_tau (7.116312056737589, 4.941891289390237, 6.232299054373523, 7.107176996608079, 4.875130409086237, 3.880235484714111, 3.2264773280337447, 2.6151016881173845, 1.9454358641574738, 4.691218682067615, 4.2111636112913695, 7.342314215232809, 4.98624091614085, 7.830943570767808, 8.34094973789701, 3.125561954719158, 3.4652997467373283) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529
ranked k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647
ranked norm k_tau (7.248333333333333, 5.461666666666667, 6.952083333333333, 7.414166666666667, 5.75625, 4.51375, 3.9595833333333332, 3.9333333333333336, 4.483478260869565, 5.414166666666667, 5.159166666666667, 6.91875, 5.63875, 6.787083333333333, 6.379583333333334, 4.099166666666666, 4.35) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.7058823529411764 calibrated k_tau (7.477241596803178, 4.977537889200114, 7.178479071511619, 7.701605757701165, 5.445191021992888, 3.600066247674374, 2.6228312425887785, 2.661732775182972, 3.348692105366149, 4.871257993125308, 4.464827657652036, 7.146088400456735, 5.244763232721369, 6.975696491248189, 6.218798140275223, 2.869811090813712, 3.158093516403424) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.6911764705882353
ranked scaled k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647
weightned norm k_tau (7.73375, 5.875, 7.0183333333333335, 7.793333333333333, 5.9591666666666665, 5.34625, 5.1425, 5.205, 5.20695652173913, 5.9425, 5.815416666666667, 7.7941666666666665, 6.183333333333334, 7.769166666666667, 8.17125, 5.407916666666667, 5.335833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8382352941176471 calibrated k_tau (7.230462091065811, 4.168466637432468, 6.189414960216487, 7.470368270765096, 4.300410881532898, 3.50188667965537, 3.1950628663490654, 3.3781083774570995, 3.288182342794895, 4.220391879303791, 4.123169536875159, 7.485909549004787, 4.815368181010978, 7.399227345259091, 7.961157687087653, 3.616137782509102, 3.43724307883553) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8529411764705882