‘I can tell when it’s been written by AI’

Warning: the first two paragraphs might feel a bit like wading through treacle but I think what follows is useful and is probably necessary context to the activity linked at the end!

LLMs generate text using sophisticated prediction/ probability models, and whilst I am no expert (so if you want proper, technical accounts please do go to an actual expert!) I think it useful to hone in on three concepts that help explain how their outputs feel and read: temperature, perplexity and burstiness. Temperature sets how adventurous the word-by-word choices are: low values produce steady, highly predictable prose; high values (this is on a 0-1 scale) invite surprise and variation (supposedly more ‘creativity’ and certainly more hallucination). Perplexity measures how hard it is to predict the next word overall, and burstiness captures how unevenly those surprises cluster, like the mix of long and short sentences in some human writing, and maybe even a smattering of stretched metaphor and whimsy. Most early (I say early making it sound like mediaeval times but we’re talking 2-3 years ago!) AI writing felt ‘flat’ or ‘bland’ and therefore more detectable to human readers because default temperatures were conservative and burstiness was low.

I imagine most ChatGPT (other tools are available) users do not think much about such things given these are not visible choices in the main user interface. Funnily enough, I do recall these were options in the tools that were publicly available and pre-dated GPT 3.5 (the BIG release in November ’22). Like a lot of things skilled use can impact (so a user might specify a style or tone in the prompt). Also, with money comes better options so, for example,  Pro account custom GPTs can have precise built in customisations. I also note that few seem to use the personalisation options that override some of the things that many folk find irritating in LLM outputs (Mine states for example that it should use British English as default, never use em dashes and use ‘no mark up’ as default). I should also note that some tools still allow for temperature manipulation in the main user interface (Google Gemini AI Studio for example) or when using the API (ChatGPT). Google AI Studio also has a ‘top P’ setting allowing users to specify the extent to which word choices are predictable or not.  These things can drive you to distraction so it’s probably no wonder that most right-thinking, time poor people have no time for experimental tweaking of this nature. But as models have evolved, developers have embedded dynamic temperature controls and other tuning methods that automatically vary these qualities. The result is that the claim ‘I can tell when it’s AI’ may be true of inexpert, unmodified outputs from free tools but so much harder from more sophisticated use and paid for tools. Interestingly, the same appears true for AI detectors. The early detectors’ reliance on low-temperature signatures now need revisiting too for those not already convinced of their vincibility.  

Evolutionary and embedded changes therefore have a humanising effect on LLM outputs. Modern systems can weave in natural fluctuations of rhythm and unexpected word choices, erasing much of the familiar ChatGPT blandness. Skilled (some would say ‘cynical’) users, whether through careful prompting or bypassing text through paraphrasers and ‘humanisers’,  can amplify this further. Early popular detectors such as GPTZero (at my work we are clear colleagues should NEVER be uploading student work to such platforms btw) leaned heavily on perplexity and burstiness patterns to spot machine-generated work, but this is increasingly a losing battle. Detector developers are responding with more complex model-based classifiers and watermarking ideas, yet the arms race remains uneven: every generation of LLMs makes it easier to sidestep statistical fingerprints and harder to prove authorship with certainty.

For fun I ran this article through GPT Zero….Phew!

It is also worth reflecting on what kinds of writing we value. My own style, for instance, happily mixes a smorgasbord of metaphors in a dizzying (or maybe its nauseating) cocktail of overlong sentences, excessive comma use and dated cultural references (ooh, and sprinkles in frequent parentheses too). Others might genuinely prefer the neat, low-temperature clarity an AI can produce. And some humans write with such regularity that a detector might wrongly flag them as synthetic. I understand that these traits may often reflect the writing of neurodivergent or multi-lingual students.

To explore this phenomenon and your own thinking further, please try this short activity. I used my own text as a starting point and generated (in Perplexity) five AI variants of varying temperatures. The activity was built in Claude. The idea is it reveals your own preferred ‘perplexity and burstiness combo’ and might prompt a fresh look at your writing preferences and the blurred boundaries between human and machine style. The temperature degree is revealed when you make your selection. Please try it out and let know how I might improve it (or whether I should chuck it out the window i.e. DefenestrAIt it)

Obviously, as my job is to encourage thinking and reflection about what this means for those teaching, those studying and broadly the institution they work or study in, I’ll finish with a few questions to stimulate reflection or discussion:

In teaching: Do you think you can detect AI writing? How might you respond when you suspect AI use but cannot prove it with certainty? What happens to the teacher-student relationship when detection becomes guesswork rather than evidence?

For assignment design: Could you shift towards process-focused assessment or tasks requiring personal experience, local knowledge or novel data? What kinds of writing assignments become more meaningful when AI can handle the routine ones? Has that actually changed in your discipline or not?

For your students: How can understanding these technical concepts help students use AI tools more thoughtfully rather than simply trying to avoid detection? What might students learn about their own writing voice through activities that reveal their personal perplexity and burstiness patterns? What is it about AI outputs that students who use them value and what is it that so many teachers disdain?

For your institution: Should institutions invest in detection tools given this technological arms race, or focus resources elsewhere? How might academic integrity policies need updating as reliable detection becomes less feasible?

For equity: Are students with access to sophisticated prompting techniques or ‘humanising’ tools gaining unfair advantages? How do we ensure that AI developments don’t widen existing educational inequalities? Who might we be inadvertently discriminating against with blanket bans or no use policies?

For the bigger picture: What kinds of human writing and thinking do we most want to cultivate in an age when machines can produce increasingly convincing text? How do we help students develop authentic voice and critical thinking skills that remain distinctly valuable?

When you know the answer to the last question, let me know.

Blank canvases

Inspired by something I saw in a meeting yesterday morning, I returned today to Gemini Canvas and Claude equivalent (still not sure what it is called). Both these tools are designed to enable you to “go from a blank slate to dynamic previews to share-worthy creations, in minutes.”

The resource I used was The Renaissance of the Essay? (LSE Impact Blog) and the accompanying Manifesto which Claire Gordon (LSE) and I led on with input from colleagues from LSE and here at King’s. I wondered how easily I could make the manifesto a little more dynamic and interactive. In the first instance I was thinking about activating engagement beyond the scroll and secondly thinking about text inputs and reflections.

The basic version in Gemini was a 4th-iteration output where after initial very basic prompt:

“turn this into an interactive web-based and shareable resource”

…I tweaked (using natural language) the underpinning code so that the boxes were formatted better for readability and to minimise scrolling and the reflection component went from purely additional text to a clickable pop-up. I need to test with a screen reader to see how that works of course.

I then experimented with adding reflection boxes and an export notes function. It took 3 or 4 tweaks (largely due to copy text function limits in browser) but this is the latest version. Obviously with work this could be made to look nicer but I’m impressed with initial output and ability to iterate and for functionality in very short time (about 15 mins total).

For the Claude one I thought I’d try having all those features including in-text input interaction from the start. Perhaps that was a mistake, because although the intial output looked great, the text input was buggy. 13 iterations later and I got the input fix. However, then the export function that I’d added around version 3 had stopped working so I needed to do a lot more back and forth. In the end I ran out of time (about 40 mins in and at version 19) and settled on this version with the inadequate copy/ paste function.

It’s all still relatively new and what’s weird about the whole thing is the continual release of beta tools, experiemtnal spaces and things that in any other context would not be released to the World. Nevertheless, there is already utility visible here and no doubt they will continue to improve. I sometimes think that my biggest barrier to finding utility is my own limited imagination. I defintiely vibe off seeing what others have done. This further underlines for me the difference and a significant problem we have going forward. ‘Here’s a thing.’ they say. “What’s it for?’ we ask. ‘I dunno,’ they shrug, ‘efficiency?’

My prompt for this was:
‘tech bros shrugging’