readme updates

This commit is contained in:
sam-paech
2025-01-31 20:31:38 +11:00
parent f46c3fc0bb
commit da8387bfa1
+6 -6
View File
@@ -65,10 +65,10 @@ python judgemark_v2.py \
## How It Works
1. **Reading In Samples**
The script loads `samples_file`, which contains creative snippets from multiple “writer models.”
The script loads `samples_file`, which contains completions to creative writing prompts from multiple “writer models.”
2. **Generating Judge Prompts**
For each snippet, we load a partial “judge prompt from `prompts_file`. This typically includes instructions like:
For each completion, we load a judge prompt from `prompts_file`. This typically includes instructions like:
```
Please assign numeric scores (0-10) for these criteria:
- Nuanced Characters
@@ -79,18 +79,18 @@ python judgemark_v2.py \
```
3. **Sending Requests to the Judge Model**
Each snippet + prompt is sent to the `--judge-model` via the functions in `utils/api.py`. We specify a moderate temperature (often `0.5`) and top-k for variability.
Each completion + prompt is sent to the `--judge-model` via the functions in `utils/api.py`. We specify a moderate temperature (often `0.5`) and top-k for variability.
4. **Parsing the Judge Output**
The script captures lines like `Nuanced Characters: 8` or `Weak Dialogue: 3`, extracts the numeric scores, and aggregates them into a single raw score. Negative markers (like “Weak Dialogue”) are inverted so 10 = worst.
The script captures lines like `Nuanced Characters: 8` or `Weak Dialogue: 3`, extracts the numeric scores, and aggregates them into a single raw score. Negative criteria (like “Weak Dialogue”) are inverted so 10 = worst.
5. **Storing & Re-Trying**
Results are saved in your designated `runs-file`. If a snippet fails or provides incomplete scores, the script can retry (in subsequent runs) without overwriting previous data.
Results are saved in your designated `runs-file`. If an item fails or provides incomplete scores, the script can retry (in subsequent runs) without overwriting previous data.
6. **Final Judgemark Scores**
Once all samples are scored:
- A *raw* Judgemark score is computed from the distribution of assigned scores.
- A *calibrated* score is computed after normalizing each judges “score spread” to a standard range.
- A *calibrated* score is computed after normalizing each judges “score spread” to a standard distribution anchored to the mean, 25th & 75th percentile, upper & lower range. Calibration linearly transforms the distribution from these anchor points to match an ideal distribution of 0-10 range, 5 mean, and 25th & 75th percentile
- Additional metrics quantify how consistent (stable) and discriminative the judge is.
## Interpreting the Results