Files
Judgemark-v2lp/research_log.md
T
2025-08-23 08:18:24 +08:00

12 KiB

2025-07-23 23:57:26

using meta-llama_llama-3_2-3b-instruct

aggregated_score_raw Final Judgemark (raw) = 0.088 Final Judgemark (cal) = 0.089

aggregated_score_weighted Final Judgemark (raw) = 0.089 Final Judgemark (cal) = 0.102

aggregated_score_ranked Final Judgemark (raw) = 0.034 Final Judgemark (cal) = 0.030

Hm maybe I should try without p value, maybe with restricted scale got to load the saved logp from the runfile

2025-07-25 19:30:21

Withdeepseek

Normed logp Final Judgemark (raw) = 0.673 Final Judgemark (cal) = 0.736

Weighted Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.660

argmax Final Judgemark (raw) = 0.635 Final Judgemark (cal) = 0.659

normal ranked (without stretching)

normed and weighted Final Judgemark (raw) = 0.624 Final Judgemark (cal) = 0.645

name score score_norm
ranked_scaled 0.67 0.79
ranked_norm 0.67 0.73
weighted 0.63 0.65
raw 0.63 0.65
weighted_norm 0.62 0.64
ranked 0.33 0.28

UPTO:

  • why is ranked low on human calib? it's using kendall tau, against reference scores so I though it would be good. It's only using relative order
  • well I guess ranking is smooth an stable but less pronounced... if we made it more pronounced (**10?, or squared, or softmax) would it be better on all fonts

Analysis of Ranked Method Performance Issue

Key Implementation Detail - Correcting Misconceptions

Critical Clarification: All methods obtain the full 11-choice probability distribution in a single forward pass - there are no extra queries or argmax operations. The difference is only in how they collapse the 11-dimensional logprob vector:

  1. Single prompt is sent to the judge model
  2. Model sees 11 choice tokens ["0", "1", ..., "10"] (often in JSON format)
  3. All 11 logprobs are requested in the same forward pass: logp_arr = [L₀, L₁, ... L₁₀]
  4. Different methods collapse this vector differently:
  • ranked path: corr = kendalltau(choices=[0…10], values=logp_arr).correlation, then final = (corr+1)*5. Uses ordering of the 11 log-probs.
  • weighted path: softmax(logp_arr)-weighted mean: score = Σ( i * exp(Lᵢ)/Σexp(L) )
  • raw path: pick the token whose text content is parsed as score (greedy decoding)

Why Ranked Method Fails on Human Correlation

Looking at the full results:

DeepSeek-R1 (Strong Judge):

  • ranked: stability_calib = 1.0, separability_calib = 0.81, but human_correlation_calib = 0.624
  • raw: stability_calib = 0.896, separability_calib = 0.652, human_correlation_calib = 0.882

Llama-3.2-3B (Weak Judge):

  • ranked: stability_calib = 1.0, separability_calib = 0.809, but human_correlation_calib = 0.0
  • raw: stability_calib = 0.0, separability_calib = 0.081, human_correlation_calib = 0.212

Root Cause Analysis:

  1. Scale Compression: The (correlation+1)*5 transformation maps Kendall's τ from [-1,1] to [0,10], but creates a compressed range around 5.0 for most cases.

  2. Loss of Discriminative Power: Most log probability rankings have moderate positive correlations with ideal [0,1,2,...,10], so scores cluster around 6-8, losing ability to distinguish between good/bad models.

  3. Efficiency vs. Accuracy Trade-off: While ranked method extracts full 11-choice distribution at zero extra cost, its ordering-based collapse may be too conservative for human preference alignment.

  4. Paradox: Ranked method is internally consistent (high stability, separability) but externally invalid (poor human correlation).

Proposed Solutions

The intuition about making it "more pronounced" is correct:

  1. Exponential Scaling: decision = ((res.correlation+1)*5) ** 2 / 10
  2. Power Transform: decision = (res.correlation ** 3 + 1) * 5
  3. Direct Tau Scaling: decision = res.correlation * 10 + 5 (allows negative scores)
  4. Sigmoid Enhancement: Apply sigmoid to stretch middle range

The fundamental issue: linear scaling (τ+1)*5 doesn't preserve enough correlation signal to maintain human alignment while creating stable rankings.

name judgemark_score judgemark_score_calib stability stability_calib separability separability_calib human_correlation human_correlation_calib
weighted 0.668 0.716 0.898 0.879 0.558 0.635 0.881 0.876
raw 0.678 0.731 0.896 0.896 0.572 0.652 0.882 0.882
ranked 0.607 0.81 0.896 1.0 0.531 0.81 0.624 0.624
ranked_norm 0.575 0.588 0.897 0.677 0.435 0.513 0.814 0.801
ranked_scaled 0.607 0.81 0.896 1.0 0.531 0.81 0.624 0.624
weighted_norm 0.562 0.551 0.897 0.54 0.424 0.498 0.777 0.775

Run ID: 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1

name judgemark_score judgemark_score_calib stability stability_calib separability separability_calib human_correlation human_correlation_calib
weighted 0.089 0.102 0.0 0.0 0.077 0.087 0.225 0.261
raw 0.088 0.089 0.0 0.0 0.074 0.081 0.229 0.212
ranked 0.262 0.706 0.0 1.0 0.393 0.809 0.0 0.0
ranked_norm 0.065 0.068 0.0 0.0 0.053 0.059 0.176 0.17
ranked_scaled 0.262 0.706 0.0 1.0 0.393 0.809 0.0 0.0
weighted_norm 0.028 0.046 0.0 0.0 0.042 0.061 0.0 0.033

Run ID: ab20b598-845b-4da7-9f4c-56ec05405e28__meta-llama_llama-3_2-3b-instruct

2025-08-20 05:52:06

So I'm still confused why rank is worse. And why norm and scaled perform better in some cases and worse in others.

look at raw scoes into tau Processing run 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1...

RAW weighted k_tau (7.056521739130434, 5.692083333333334, 6.568333333333333, 6.972083333333333, 6.02625, 5.292916666666667, 4.985, 4.421666666666667, 3.9104166666666664, 5.91625, 5.632083333333333, 7.324166666666667, 5.82875, 7.610833333333333, 8.12625, 4.810416666666667, 5.170833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705 calibrated k_tau (6.913741010558691, 4.745233729099284, 6.136213642220673, 6.867702086872284, 5.3350524958555905, 3.9637780431712675, 3.4473454755860238, 2.6185410443378423, 1.9714928350431529, 5.109234456943669, 4.540345540824161, 7.3509579707211765, 4.885496283271738, 7.700297215002195, 8.38722761924037, 3.1579257830331477, 3.8035150809926783) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705

raw k_tau (7.2075, 5.891666666666667, 6.645833333333333, 7.17, 5.872083333333333, 5.295416666666667, 4.9225, 4.42375, 3.9130434782608696, 5.767916666666666, 5.505833333333333, 7.350416666666667, 5.902083333333334, 7.7004166666666665, 8.099166666666667, 4.839166666666666, 5.04625) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529 calibrated k_tau (7.116312056737589, 4.941891289390237, 6.232299054373523, 7.107176996608079, 4.875130409086237, 3.880235484714111, 3.2264773280337447, 2.6151016881173845, 1.9454358641574738, 4.691218682067615, 4.2111636112913695, 7.342314215232809, 4.98624091614085, 7.830943570767808, 8.34094973789701, 3.125561954719158, 3.4652997467373283) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529

ranked k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647

ranked norm k_tau (7.248333333333333, 5.461666666666667, 6.952083333333333, 7.414166666666667, 5.75625, 4.51375, 3.9595833333333332, 3.9333333333333336, 4.483478260869565, 5.414166666666667, 5.159166666666667, 6.91875, 5.63875, 6.787083333333333, 6.379583333333334, 4.099166666666666, 4.35) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.7058823529411764 calibrated k_tau (7.477241596803178, 4.977537889200114, 7.178479071511619, 7.701605757701165, 5.445191021992888, 3.600066247674374, 2.6228312425887785, 2.661732775182972, 3.348692105366149, 4.871257993125308, 4.464827657652036, 7.146088400456735, 5.244763232721369, 6.975696491248189, 6.218798140275223, 2.869811090813712, 3.158093516403424) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.6911764705882353

ranked scaled k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647 calibrated k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647

weightned norm k_tau (7.73375, 5.875, 7.0183333333333335, 7.793333333333333, 5.9591666666666665, 5.34625, 5.1425, 5.205, 5.20695652173913, 5.9425, 5.815416666666667, 7.7941666666666665, 6.183333333333334, 7.769166666666667, 8.17125, 5.407916666666667, 5.335833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8382352941176471 calibrated k_tau (7.230462091065811, 4.168466637432468, 6.189414960216487, 7.470368270765096, 4.300410881532898, 3.50188667965537, 3.1950628663490654, 3.3781083774570995, 3.288182342794895, 4.220391879303791, 4.123169536875159, 7.485909549004787, 4.815368181010978, 7.399227345259091, 7.961157687087653, 3.616137782509102, 3.43724307883553) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8529411764705882