Experiment using LLM's to detect BS writing

An experiment to see if I can detect "BS" using LLM's in a robust way.

It's impossible to detect AI generated text with 100% accuracy. But we will have more success detecting bad, empty, or contentless text.

Why might this work? As with Schmidhuber's definition of curiosity good writing should initially suprise the reader but become less suprising as they learn about it. Empty writing is not suprising. And confusing writing stays confusing even after you have read it. In other words it should have a high perplexity, which goes down after learning it.

The ideal way to do this would be by fine tuning. But that would be momory intensive so I will try and do it with prompts. I ask:

Is the text suprising (high perplexity)
Is less suprising when given a summary? (low perplexity)

If yes/yes then it's may be suprising new information. If either is not true, then it's proboably BS.

See main.ipynb for the code and results.

Results

When using microsoft/phi-2 we get this amount of perplexity reduction by including a summary of the key learnings

	sample	learning%
3	einsteins theory of general relativity	0.0751468
5	wikipedia on LK-99	0.0674738
8	Schmidhuber 2023 Subjective Novelty, Surprise	0.0396319
1	good_ml	0.0321225
0	bad_ml	-9.58801e-05
2	sokal hoax	-0.0168107
7	AI gen fake paper	-0.134864
4	lorem ipsum	-0.69694
6	I have a dream	-0.796421

As you can see, some of these are probobly in the training set