# 2025-07-23 23:57:26
using meta-llama_llama-3_2-3b-instruct

aggregated_score_raw
    Final Judgemark (raw)   = 0.088
    Final Judgemark (cal)  = 0.089
    

aggregated_score_weighted
        Final Judgemark (raw)   = 0.089
    Final Judgemark (cal)  = 0.102
    
aggregated_score_ranked
        Final Judgemark (raw)   = 0.034
    Final Judgemark (cal)  = 0.030
# 

Hm maybe I should try without p value, maybe with restricted scale
got to load the saved logp from the runfile

# 2025-07-25 19:30:21

Withdeepseek

Normed logp
Final Judgemark (raw)   = 0.673
Final Judgemark (cal)  = 0.736

Weighted
Final Judgemark (raw)   = 0.635
Final Judgemark (cal)  = 0.660

argmax
Final Judgemark (raw)   = 0.635
Final Judgemark (cal)  = 0.659

normal ranked (without stretching)

normed and weighted
Final Judgemark (raw)   = 0.624
Final Judgemark (cal)  = 0.645


| name          | score    | score_norm |
|---------------|----------|------------|
| ranked_scaled | 0.67     |     0.79   |
| ranked_norm   | 0.67     |     0.73   |
| weighted      | 0.63     |     0.65   |
| raw           | 0.63     |     0.65   |
| weighted_norm | 0.62     |     0.64   |
| ranked        | 0.33     |     0.28   |


UPTO:
- why is ranked low on human calib? it's using kendall tau, against reference scores so I though it would be good. It's only using relative order
- well I guess ranking is smooth an stable but less pronounced... if we made it more pronounced (**10?, or squared, or softmax) would it be better on all fonts

## Analysis of Ranked Method Performance Issue

### **Key Implementation Detail - Correcting Misconceptions**

**Critical Clarification**: All methods obtain the full 11-choice probability distribution in a **single forward pass** - there are no extra queries or argmax operations. The difference is only in how they collapse the 11-dimensional logprob vector:

1. **Single prompt** is sent to the judge model
2. Model sees 11 choice tokens ["0", "1", ..., "10"] (often in JSON format)  
3. **All 11 logprobs** are requested in the same forward pass: `logp_arr = [L₀, L₁, ... L₁₀]`
4. Different methods collapse this vector differently:

- **ranked** path: `corr = kendalltau(choices=[0…10], values=logp_arr).correlation`, then `final = (corr+1)*5`. Uses **ordering** of the 11 log-probs.
- **weighted** path: `softmax(logp_arr)`-weighted mean: `score = Σ( i * exp(Lᵢ)/Σexp(L) )`
- **raw** path: pick the token whose text content is parsed as score (greedy decoding)

### **Why Ranked Method Fails on Human Correlation**

Looking at the full results:

**DeepSeek-R1 (Strong Judge):**
- `ranked`: stability_calib = **1.0**, separability_calib = **0.81**, but human_correlation_calib = **0.624**
- `raw`: stability_calib = 0.896, separability_calib = 0.652, human_correlation_calib = **0.882**

**Llama-3.2-3B (Weak Judge):**
- `ranked`: stability_calib = **1.0**, separability_calib = **0.809**, but human_correlation_calib = **0.0**
- `raw`: stability_calib = 0.0, separability_calib = 0.081, human_correlation_calib = **0.212**

**Root Cause Analysis:**

1. **Scale Compression**: The `(correlation+1)*5` transformation maps Kendall's τ from [-1,1] to [0,10], but creates a compressed range around 5.0 for most cases.

2. **Loss of Discriminative Power**: Most log probability rankings have moderate positive correlations with ideal [0,1,2,...,10], so scores cluster around 6-8, losing ability to distinguish between good/bad models.

3. **Efficiency vs. Accuracy Trade-off**: While ranked method extracts full 11-choice distribution at zero extra cost, its ordering-based collapse may be too conservative for human preference alignment.

4. **Paradox**: Ranked method is internally consistent (high stability, separability) but externally invalid (poor human correlation).

### **Proposed Solutions**

The intuition about making it "more pronounced" is correct:

1. **Exponential Scaling**: `decision = ((res.correlation+1)*5) ** 2 / 10`
2. **Power Transform**: `decision = (res.correlation ** 3 + 1) * 5`  
3. **Direct Tau Scaling**: `decision = res.correlation * 10 + 5` (allows negative scores)
4. **Sigmoid Enhancement**: Apply sigmoid to stretch middle range

The fundamental issue: linear scaling `(τ+1)*5` doesn't preserve enough correlation signal to maintain human alignment while creating stable rankings.


| name          | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | human_correlation | human_correlation_calib |
|---------------|-----------------|-----------------------|-----------|-----------------|--------------|--------------------|-------------------|-------------------------|
| weighted      | 0.668           | 0.716                 | 0.898     | 0.879           | 0.558        | 0.635              | 0.881             | 0.876                   |
| raw           | 0.678           | 0.731                 | 0.896     | 0.896           | 0.572        | 0.652              | 0.882             | 0.882                   |
| ranked        | 0.607           | 0.81                  | 0.896     | 1.0             | 0.531        | 0.81               | 0.624             | 0.624                   |
| ranked_norm   | 0.575           | 0.588                 | 0.897     | 0.677           | 0.435        | 0.513              | 0.814             | 0.801                   |
| ranked_scaled | 0.607           | 0.81                  | 0.896     | 1.0             | 0.531        | 0.81               | 0.624             | 0.624                   |
| weighted_norm | 0.562           | 0.551                 | 0.897     | 0.54            | 0.424        | 0.498              | 0.777             | 0.775                   |

Run ID: 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1


| name          | judgemark_score | judgemark_score_calib | stability | stability_calib | separability | separability_calib | **human_correlation** | human_correlation_calib |
|---------------|-----------------|-----------------------|-----------|-----------------|--------------|--------------------|-------------------|-------------------------|
| weighted      | 0.089           | 0.102                 | 0.0       | 0.0             | 0.077        | 0.087              | 0.225             | 0.261                   |
| raw           | 0.088           | 0.089                 | 0.0       | 0.0             | 0.074        | 0.081              | 0.229             | 0.212                   |
| ranked        | 0.262           | 0.706                 | 0.0       | 1.0             | 0.393        | 0.809              | 0.0               | 0.0                     |
| ranked_norm   | 0.065           | 0.068                 | 0.0       | 0.0             | 0.053        | 0.059              | 0.176             | 0.17                    |
| ranked_scaled | 0.262           | 0.706                 | 0.0       | 1.0             | 0.393        | 0.809              | 0.0               | 0.0                     |
| weighted_norm | 0.028           | 0.046                 | 0.0       | 0.0             | 0.042        | 0.061              | 0.0               | 0.033                   |

Run ID: ab20b598-845b-4da7-9f4c-56ec05405e28__meta-llama_llama-3_2-3b-instruct

# 2025-08-20 05:52:06

So I'm still confused why rank is worse. And why norm and scaled perform better in some cases and worse in others.


look at raw scoes into tau
Processing run 8b031c01-ac20-4f6d-8838-2d8481c61a55__deepseek_deepseek-r1...

RAW 
  weighted
k_tau (7.056521739130434, 5.692083333333334, 6.568333333333333, 6.972083333333333, 6.02625, 5.292916666666667, 4.985, 4.421666666666667, 3.9104166666666664, 5.91625, 5.632083333333333, 7.324166666666667, 5.82875, 7.610833333333333, 8.12625, 4.810416666666667, 5.170833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705
    calibrated
k_tau (6.913741010558691, 4.745233729099284, 6.136213642220673, 6.867702086872284, 5.3350524958555905, 3.9637780431712675, 3.4473454755860238, 2.6185410443378423, 1.9714928350431529, 5.109234456943669, 4.540345540824161, 7.3509579707211765, 4.885496283271738, 7.700297215002195, 8.38722761924037, 3.1579257830331477, 3.8035150809926783) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8823529411764705

raw
k_tau (7.2075, 5.891666666666667, 6.645833333333333, 7.17, 5.872083333333333, 5.295416666666667, 4.9225, 4.42375, 3.9130434782608696, 5.767916666666666, 5.505833333333333, 7.350416666666667, 5.902083333333334, 7.7004166666666665, 8.099166666666667, 4.839166666666666, 5.04625) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529
    calibrated
k_tau (7.116312056737589, 4.941891289390237, 6.232299054373523, 7.107176996608079, 4.875130409086237, 3.880235484714111, 3.2264773280337447, 2.6151016881173845, 1.9454358641574738, 4.691218682067615, 4.2111636112913695, 7.342314215232809, 4.98624091614085, 7.830943570767808, 8.34094973789701, 3.125561954719158, 3.4652997467373283) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.9117647058823529


ranked
k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647
    calibrated
k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647


ranked norm
k_tau (7.248333333333333, 5.461666666666667, 6.952083333333333, 7.414166666666667, 5.75625, 4.51375, 3.9595833333333332, 3.9333333333333336, 4.483478260869565, 5.414166666666667, 5.159166666666667, 6.91875, 5.63875, 6.787083333333333, 6.379583333333334, 4.099166666666666, 4.35) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.7058823529411764
    calibrated
k_tau (7.477241596803178, 4.977537889200114, 7.178479071511619, 7.701605757701165, 5.445191021992888, 3.600066247674374, 2.6228312425887785, 2.661732775182972, 3.348692105366149, 4.871257993125308, 4.464827657652036, 7.146088400456735, 5.244763232721369, 6.975696491248189, 6.218798140275223, 2.869811090813712, 3.158093516403424) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.6911764705882353


ranked scaled
k_tau (5.61, 3.9, 2.72, 7.3, 5.02, 2.92, 1.66, 3.27, 1.68, 4.63, 2.82, 8.26, 4.92, 7.24, 6.48, 4.32, 3.85) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647
    calibrated
k_tau (7.0, 4.3999999999999995, 2.5238095238095246, 8.913207547169812, 6.085271317829456, 3.0, 0.0, 3.5, 0.04761904761904766, 5.480620155038759, 2.761904761904762, 10.0, 5.930232558139535, 8.845283018867924, 7.9849056603773585, 5.0, 4.328571428571428) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.588235294117647


weightned norm
k_tau (7.73375, 5.875, 7.0183333333333335, 7.793333333333333, 5.9591666666666665, 5.34625, 5.1425, 5.205, 5.20695652173913, 5.9425, 5.815416666666667, 7.7941666666666665, 6.183333333333334, 7.769166666666667, 8.17125, 5.407916666666667, 5.335833333333333) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8382352941176471
    calibrated
k_tau (7.230462091065811, 4.168466637432468, 6.189414960216487, 7.470368270765096, 4.300410881532898, 3.50188667965537, 3.1950628663490654, 3.3781083774570995, 3.288182342794895, 4.220391879303791, 4.123169536875159, 7.485909549004787, 4.815368181010978, 7.399227345259091, 7.961157687087653, 3.616137782509102, 3.43724307883553) (1243, 1163, 1240, 1276, 1214, 1114, 1050, 1029, 989, 1147, 1159, 1333, 1246, 1402, 1430, 1099, 1102) 0.8529411764705882