I’m one of those people whose knee is constantly jiggling. Especially when I am sat in ‘receive’ mode in a meeting or something. To reduce the jiggling I fiddle with things and the thing I have been fiddling with will be familiar to anyone who likes to see what all the fuss is about with new tech. I’m always asking myself- novelty or utility? (I had my fingers burnt with interactive whiteboards and have been cautious ever since). You may be interested in the output of Perplexity’s ‘Comet’- the browser based AI agent, the outcomes of which are littering LinkedIn right now- or the video below which is a conversation between me and one of my AI avatars… if not either of these I’d stop reading now tbh.
In the image below is a link to what I instructed using a simple prompt: “display this video in a window with some explanatory text about what it is and then have a self-marking multi choice quiz below it.” [youtube link]
It is a small web application that displays a YouTube video, provides some explanatory text, and then offers a self-marking multiple choice quiz beneath it.
Click on the image to see the artefact and try the quiz
The process was straightforward but illuminating. The agent prepared an interactive webpage with three generated files (index.html, style.css, and app.js) and then assembled them into a functioning app. It automatically embedded the YouTube video correctly (but needed an additional prompt when it did not initially display), added explanatory text about the focus of the video (AI in education at King’s College London), and then generated an eight-question multiple choice quiz based on the transcript.
The quiz has self-marking functionality, with immediate feedback, score tracking and final results. The design is clean and the layout works in my view. The questions cover key points from the transcript: principles, the presenter’s role, policy considerations and recommendations for upskilling. The potential applications are pretty obvious I think. Next step would be to look at likely accessibility issues (a quick check highlights a number of heading and formatting issues), finding a better solution for hosting and then the extent to which fine tuning the questions for level is do-able with ease. But given I only needed to tweak one for this example, even that basic functionality suggests this will be of use.
The real novelty here is the browser but also the execution. I have tried a few side-by-side experiments with Claude and in each the fine tuning needed for a satisfactory output was less here. The one failed experiment so far is converting all my saved links to a searchable / filterable dashboard. The dashboard looks good but I think there were too many links and it kept failing to make all the links active. Where tools like notebook LM are offerring a counter UX to text in; reams out LLMs of the ChatGPT variety, this offers a closer-to-seamless agent experience and it is both ease of use and actualy utility that will drive use I think.
Warning: the first two paragraphs might feel a bit like wading through treacle but I think what follows is useful and is probably necessary context to the activity linked at the end!
LLMs generate text using sophisticated prediction/ probability models, and whilst I am no expert (so if you want proper, technical accounts please do go to an actual expert!) I think it useful to hone in on three concepts that help explain how their outputs feel and read: temperature, perplexity and burstiness. Temperature sets how adventurous the word-by-word choices are: low values produce steady, highly predictable prose; high values (this is on a 0-1 scale) invite surprise and variation (supposedly more ‘creativity’ and certainly more hallucination). Perplexity measures how hard it is to predict the next word overall, and burstiness captures how unevenly those surprises cluster, like the mix of long and short sentences in some human writing, and maybe even a smattering of stretched metaphor and whimsy. Most early (I say early making it sound like mediaeval times but we’re talking 2-3 years ago!) AI writing felt ‘flat’ or ‘bland’ and therefore more detectable to human readers because default temperatures were conservative and burstiness was low.
I imagine most ChatGPT (other tools are available) users do not think much about such things given these are not visible choices in the main user interface. Funnily enough, I do recall these were options in the tools that were publicly available and pre-dated GPT 3.5 (the BIG release in November ’22). Like a lot of things skilled use can impact (so a user might specify a style or tone in the prompt). Also, with money comes better options so, for example, Pro account custom GPTs can have precise built in customisations. I also note that few seem to use the personalisation options that override some of the things that many folk find irritating in LLM outputs (Mine states for example that it should use British English as default, never use em dashes and use ‘no mark up’ as default). I should also note that some tools still allow for temperature manipulation in the main user interface (Google Gemini AI Studio for example) or when using the API (ChatGPT). Google AI Studio also has a ‘top P’ setting allowing users to specify the extent to which word choices are predictable or not. These things can drive you to distraction so it’s probably no wonder that most right-thinking, time poor people have no time for experimental tweaking of this nature. But as models have evolved, developers have embedded dynamic temperature controls and other tuning methods that automatically vary these qualities. The result is that the claim ‘I can tell when it’s AI’ may be true of inexpert, unmodified outputs from free tools but so much harder from more sophisticated use and paid for tools. Interestingly, the same appears true for AI detectors. The early detectors’ reliance on low-temperature signatures now need revisiting too for those not already convinced of their vincibility.
Evolutionary and embedded changes therefore have a humanising effect on LLM outputs. Modern systems can weave in natural fluctuations of rhythm and unexpected word choices, erasing much of the familiar ChatGPT blandness. Skilled (some would say ‘cynical’) users, whether through careful prompting or bypassing text through paraphrasers and ‘humanisers’, can amplify this further. Early popular detectors such as GPTZero (at my work we are clear colleagues should NEVER be uploading student work to such platforms btw) leaned heavily on perplexity and burstiness patterns to spot machine-generated work, but this is increasingly a losing battle. Detector developers are responding with more complex model-based classifiers and watermarking ideas, yet the arms race remains uneven: every generation of LLMs makes it easier to sidestep statistical fingerprints and harder to prove authorship with certainty.
For fun I ran this article through GPT Zero….Phew!
It is also worth reflecting on what kinds of writing we value. My own style, for instance, happily mixes a smorgasbord of metaphors in a dizzying (or maybe its nauseating) cocktail of overlong sentences, excessive comma use and dated cultural references (ooh, and sprinkles in frequent parentheses too). Others might genuinely prefer the neat, low-temperature clarity an AI can produce. And some humans write with such regularity that a detector might wrongly flag them as synthetic. I understand that these traits may often reflect the writing of neurodivergent or multi-lingual students.
To explore this phenomenon and your own thinking further, please try this short activity. I used my own text as a starting point and generated (in Perplexity) five AI variants of varying temperatures. The activity was built in Claude. The idea is it reveals your own preferred ‘perplexity and burstiness combo’ and might prompt a fresh look at your writing preferences and the blurred boundaries between human and machine style. The temperature degree is revealed when you make your selection. Please try it out and let know how I might improve it (or whether I should chuck it out the window i.e. DefenestrAIt it)
Obviously, as my job is to encourage thinking and reflection about what this means for those teaching, those studying and broadly the institution they work or study in, I’ll finish with a few questions to stimulate reflection or discussion:
In teaching: Do you think you can detect AI writing? How might you respond when you suspect AI use but cannot prove it with certainty? What happens to the teacher-student relationship when detection becomes guesswork rather than evidence?
For assignment design: Could you shift towards process-focused assessment or tasks requiring personal experience, local knowledge or novel data? What kinds of writing assignments become more meaningful when AI can handle the routine ones? Has that actually changed in your discipline or not?
For your students: How can understanding these technical concepts help students use AI tools more thoughtfully rather than simply trying to avoid detection? What might students learn about their own writing voice through activities that reveal their personal perplexity and burstiness patterns? What is it about AI outputs that students who use them value and what is it that so many teachers disdain?
For your institution: Should institutions invest in detection tools given this technological arms race, or focus resources elsewhere? How might academic integrity policies need updating as reliable detection becomes less feasible?
For equity: Are students with access to sophisticated prompting techniques or ‘humanising’ tools gaining unfair advantages? How do we ensure that AI developments don’t widen existing educational inequalities? Who might we be inadvertently discriminating against with blanket bans or no use policies?
For the bigger picture: What kinds of human writing and thinking do we most want to cultivate in an age when machines can produce increasingly convincing text? How do we help students develop authentic voice and critical thinking skills that remain distinctly valuable?
When you know the answer to the last question, let me know.