# 2025-07-23 23:57:26 using meta-llama_llama-3_2-3b-instruct aggregated_score_raw Final Judgemark (raw) = 0.088 Final Judgemark (cal) = 0.089 aggregated_score_weighted Final Judgemark (raw) = 0.089 Final Judgemark (cal) = 0.102 aggregated_score_ranked Final Judgemark (raw) = 0.034 Final Judgemark (cal) = 0.030 # Hm maybe I should try without p value, maybe with restricted scale got to load the saved logp from the runfile # 2025-07-25 19:30:21 Withdeepseek Normed logp Final Judgemark (raw) = 0.673 Final Judgemark (cal) = 0.736 Weighted Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.660 argmax Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.659 normal ranked (without stretching) normed and weighted Final Judgemark (raw) = 0.624 Final Judgemark (cal) = 0.645 | name | score | score_norm | |---------------|----------|------------| | ranked_scaled | 0.67 | 0.79 | | ranked_norm | 0.67 | 0.73 | | weighted | 0.63 | 0.65 | | raw | 0.63 | 0.65 | | weighted_norm | 0.62 | 0.64 | | ranked | 0.33 | 0.28 | UPTO: - why is ranked low on human calib? it's using kendall tau, against reference scores so I though it would be good. It's only using relative order - well I guess ranking is smooth an stable but less pronounced... if we made it more pronounced (**10?, or squared, or softmax) would it be better on all fonts ## Analysis of Ranked Method Performance Issue ### **Key Implementation Detail - Correcting Misconceptions** **Critical Clarification**: All methods obtain the full 11-choice probability distribution in a **single forward pass** - there are no extra queries or argmax operations. The difference is only in how they collapse the 11-dimensional logprob vector: 1. **Single prompt** is sent to the judge model 2. Model sees 11 choice tokens ["0", "1", ..., "10"] (often in JSON format) 3. **All 11 logprobs** are requested in the same forward pass: `logp_arr = [L₀, L₁, ... L₁₀]` 4. Different methods collapse this vector differently: - **ranked** path: `corr = kendalltau(choices=[0…10], values=logp_arr).correlation`, then `final = (corr+1)*5`. Uses **ordering** of the 11 log-probs. - **weighted** path: `softmax(logp_arr)`-weighted mean: `score = Σ( i * exp(Lᵢ)/Σexp(L) )` - **raw** path: pick the token whose text content is parsed as score (greedy decoding) ### **Why Ranked Method Fails on Human Correlation** Looking at the full results: **DeepSeek-R1 (Strong Judge):** - `ranked`: stability_calib = **1.0**, separability_calib = **0.81**, but human_correlation_calib = **0.624** - `raw`: stability_calib = 0.896, separability_calib = 0.652, human_correlation_calib = **0.882** **Llama-3.2-3B (Weak Judge):** - `ranked`: stability_calib = **1.0**, separability_calib = **0.809**, but human_correlation_calib = **0.0** - `raw`: stability_calib = 0.0, separability_calib = 0.081, human_correlation_calib = **0.212** **Root Cause Analysis:** 1. **Scale Compression**: The `(correlation+1)*5` transformation maps Kendall's τ from [-1,1] to [0,10], but creates a compressed range around 5.0 for most cases. 2. **Loss of Discriminative Power**: Most log probability rankings have moderate positive correlations with ideal [0,1,2,...,10], so scores cluster around 6-8, losing ability to distinguish between good/bad models. 3. **Efficiency vs. Accuracy Trade-off**: While ranked method extracts full 11-choice distribution at zero extra cost, its ordering-based collapse may be too conservative for human preference alignment. 4. **Paradox**: Ranked method is internally consistent (high stability, separability) but externally invalid (poor human correlation). ### **Proposed Solutions** The intuition about making it "more pronounced" is correct: 1. **Exponential Scaling**: `decision = ((res.correlation+1)*5) ** 2 / 10` 2. **Power Transform**: `decision = (res.correlation ** 3 + 1) * 5` 3. **Direct Tau Scaling**: `decision = res.correlation * 10 + 5` (allows negative scores) 4. **Sigmoid Enhancement**: Apply sigmoid to stretch middle range The fundamental issue: linear scaling `(τ+1)*5` doesn't preserve enough correlation signal to maintain human alignment while creating stable rankings. | name | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | human_correlation | human_correlation_calib | |---------------|-----------------|-----------------------|-----------|-----------------|--------------|--------------------|-------------------|-------------------------| | weighted | 0.668 | 0.716 | 0.898 | 0.879 | 0.558 | 0.635 | 0.881 | 0.876 | | raw | 0.678 | 0.731 | 0.896 | 0.896 | 0.572 | 0.652 | 0.882 | 0.882 | | ranked | 0.607 | 0.81 | 0.896 | 1.0 | 0.531 | 0.81 | 0.624 | 0.624 | | ranked_norm | 0.575 | 0.588 | 0.897 | 0.677 | 0.435 | 0.513 | 0.814 | 0.801 | | ranked_scaled | 0.607 | 0.81 | 0.896 | 1.0 | 0.531 | 0.81 | 0.624 | 0.624 | | weighted_norm | 0.562 | 0.551 | 0.897 | 0.54 | 0.424 | 0.498 | 0.777 | 0.775 | Run ID: 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1 | name | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | **human_correlation** | human_correlation_calib | |---------------|-----------------|-----------------------|-----------|-----------------|--------------|--------------------|-------------------|-------------------------| | weighted | 0.089 | 0.102 | 0.0 | 0.0 | 0.077 | 0.087 | 0.225 | 0.261 | | raw | 0.088 | 0.089 | 0.0 | 0.0 | 0.074 | 0.081 | 0.229 | 0.212 | | ranked | 0.262 | 0.706 | 0.0 | 1.0 | 0.393 | 0.809 | 0.0 | 0.0 | | ranked_norm | 0.065 | 0.068 | 0.0 | 0.0 | 0.053 | 0.059 | 0.176 | 0.17 | | ranked_scaled | 0.262 | 0.706 | 0.0 | 1.0 | 0.393 | 0.809 | 0.0 | 0.0 | | weighted_norm | 0.028 | 0.046 | 0.0 | 0.0 | 0.042 | 0.061 | 0.0 | 0.033 | Run ID: ab20b598-845b-4da7-9f4c-56ec05405e28__meta-llama_llama-3_2-3b-instruct # 2025-08-20 05:52:06 So I'm still confused why rank is worse. And why norm and scaled perform better in some cases and worse in others. look at raw scoes into tau Processing run 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1... RAW weighted k_tau (7.056521739130434, 5.692083333333334, 6.568333333333333, 6.972083333333333, 6.02625, 5.292916666666667, 4.985, 4.421666666666667, 3.9104166666666664, 5.91625, 5.632083333333333, 7.324166666666667, 5.82875, 7.610833333333333, 8.12625, 4.810416666666667, 5.170833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705 calibrated k_tau (6.913741010558691, 4.745233729099284, 6.136213642220673, 6.867702086872284, 5.3350524958555905, 3.9637780431712675, 3.4473454755860238, 2.6185410443378423, 1.9714928350431529, 5.109234456943669, 4.540345540824161, 7.3509579707211765, 4.885496283271738, 7.700297215002195, 8.38722761924037, 3.1579257830331477, 3.8035150809926783) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705 raw k_tau (7.2075, 5.891666666666667, 6.645833333333333, 7.17, 5.872083333333333, 5.295416666666667, 4.9225, 4.42375, 3.9130434782608696, 5.767916666666666, 5.505833333333333, 7.350416666666667, 5.902083333333334, 7.7004166666666665, 8.099166666666667, 4.839166666666666, 5.04625) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529 calibrated k_tau (7.116312056737589, 4.941891289390237, 6.232299054373523, 7.107176996608079, 4.875130409086237, 3.880235484714111, 3.2264773280337447, 2.6151016881173845, 1.9454358641574738, 4.691218682067615, 4.2111636112913695, 7.342314215232809, 4.98624091614085, 7.830943570767808, 8.34094973789701, 3.125561954719158, 3.4652997467373283) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529 ranked k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 ranked norm k_tau (7.248333333333333, 5.461666666666667, 6.952083333333333, 7.414166666666667, 5.75625, 4.51375, 3.9595833333333332, 3.9333333333333336, 4.483478260869565, 5.414166666666667, 5.159166666666667, 6.91875, 5.63875, 6.787083333333333, 6.379583333333334, 4.099166666666666, 4.35) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.7058823529411764 calibrated k_tau (7.477241596803178, 4.977537889200114, 7.178479071511619, 7.701605757701165, 5.445191021992888, 3.600066247674374, 2.6228312425887785, 2.661732775182972, 3.348692105366149, 4.871257993125308, 4.464827657652036, 7.146088400456735, 5.244763232721369, 6.975696491248189, 6.218798140275223, 2.869811090813712, 3.158093516403424) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.6911764705882353 ranked scaled k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 weightned norm k_tau (7.73375, 5.875, 7.0183333333333335, 7.793333333333333, 5.9591666666666665, 5.34625, 5.1425, 5.205, 5.20695652173913, 5.9425, 5.815416666666667, 7.7941666666666665, 6.183333333333334, 7.769166666666667, 8.17125, 5.407916666666667, 5.335833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8382352941176471 calibrated k_tau (7.230462091065811, 4.168466637432468, 6.189414960216487, 7.470368270765096, 4.300410881532898, 3.50188667965537, 3.1950628663490654, 3.3781083774570995, 3.288182342794895, 4.220391879303791, 4.123169536875159, 7.485909549004787, 4.815368181010978, 7.399227345259091, 7.961157687087653, 3.616137782509102, 3.43724307883553) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8529411764705882