diff --git a/README.md b/README.md index 81d0f6f..3ce29b1 100644 --- a/README.md +++ b/README.md @@ -65,10 +65,10 @@ python judgemark_v2.py \ ## How It Works 1. **Reading In Samples** - The script loads `samples_file`, which contains creative snippets from multiple “writer models.” + The script loads `samples_file`, which contains completions to creative writing prompts from multiple “writer models.” 2. **Generating Judge Prompts** - For each snippet, we load a partial “judge prompt” from `prompts_file`. This typically includes instructions like: + For each completion, we load a judge prompt from `prompts_file`. This typically includes instructions like: ``` Please assign numeric scores (0-10) for these criteria: - Nuanced Characters @@ -79,18 +79,18 @@ python judgemark_v2.py \ ``` 3. **Sending Requests to the Judge Model** - Each snippet + prompt is sent to the `--judge-model` via the functions in `utils/api.py`. We specify a moderate temperature (often `0.5`) and top-k for variability. + Each completion + prompt is sent to the `--judge-model` via the functions in `utils/api.py`. We specify a moderate temperature (often `0.5`) and top-k for variability. 4. **Parsing the Judge Output** - The script captures lines like `Nuanced Characters: 8` or `Weak Dialogue: 3`, extracts the numeric scores, and aggregates them into a single raw score. Negative markers (like “Weak Dialogue”) are inverted so 10 = worst. + The script captures lines like `Nuanced Characters: 8` or `Weak Dialogue: 3`, extracts the numeric scores, and aggregates them into a single raw score. Negative criteria (like “Weak Dialogue”) are inverted so 10 = worst. 5. **Storing & Re-Trying** - Results are saved in your designated `runs-file`. If a snippet fails or provides incomplete scores, the script can retry (in subsequent runs) without overwriting previous data. + Results are saved in your designated `runs-file`. If an item fails or provides incomplete scores, the script can retry (in subsequent runs) without overwriting previous data. 6. **Final Judgemark Scores** Once all samples are scored: - A *raw* Judgemark score is computed from the distribution of assigned scores. - - A *calibrated* score is computed after normalizing each judge’s “score spread” to a standard range. + - A *calibrated* score is computed after normalizing each judge’s “score spread” to a standard distribution anchored to the mean, 25th & 75th percentile, upper & lower range. Calibration linearly transforms the distribution from these anchor points to match an ideal distribution of 0-10 range, 5 mean, and 25th & 75th percentile - Additional metrics quantify how consistent (stable) and discriminative the judge is. ## Interpreting the Results