The Clockwork Penguin

Daniel Binns is a media theorist and filmmaker tinkering with the weird edges of technology, storytelling, and screen culture. He is the author of Material Media-Making in the Digital Age and currently writes about posthuman poetics, glitchy machines, and speculative media worlds.

Tag: AI art

  • How I Read AI Images

    Image generated by Adobe Firefly, 3 September 2024; prompt unknown.

    AI-generated media sit somewhere between representational image — representations of data rather than reality — and posthuman artefact. This ambiguous nature suggests that we need methods that not just consider these images as cultural objects, but also as products of the systems that made them. I am following here in the wake of other pioneers who’ve bravely broken ground in this space.

    For Friedrich Kittler and Jussi Parikka, the technological, infrastructural and ecological dimensions of media are just as — if not more — important than content. They extend Marshall McLuhan’s notion that ‘the medium is the message’ from just the affordances of a given media type/form/channel, into the very mechanisms and processes that shape the content before and during its production or transmission.

    I take these ideas and extend them to the outputs themselves: a media-materialist analysis. Rather than just ‘slop’, this method contends that AI media are cultural-computational artefacts, assemblages compiled from layered systems. In particular, I break this into data, model, interface, and prompt. This media materialist method contends that each step of the generative process leaves traces in visual outputs, and that we might be able to train ourselves to read them.

    Data

    There is no media generation without training data. These datasets can be so vast as to feel unknowable, or so narrow that they feel constricting. LAION-5B, for example, the original dataset used to train Stable Diffusion, contains 5.5 billion images. Technically, you could train a model on a handful of images, or even one, or even none, but the model would be more ‘remembering’, rather than ‘generating’. Video models tend to use smaller datasets (comparatively), such as PANDA-70M which contains over 70 million video-caption pairs: about 167,000 hours of footage.

    Training data for AI models is also hugely contentious, given that many proprietary tools are trained on data scraped from the open internet. Thus, when considering datasets, it’s important to ask what kinds of images and subjects are privileged. Social media posts? Stock photos? Vector graphics? Humans? Animals? Are diverse populations represented? Such patterns of inclusion/exclusion might reveal something about the dataset design, and the motivations of those who put it together.

    A ‘slice’ of the LAION-Aesthetics dataset. The tool I used for this can be found/forked on Github.

    Some datasets are human-curated (e.g. COCO, ImageNet), and others are algorithmically scraped and compiled (e.g. LAION-Aesthetics). There may be readable differences in how these datasets shape images. You might consider:

    • Are the images coherent? Chaotic/glitched?
    • What kinds of prompts result in clearer, cleaner outputs, versus morphed or garbled material?

    The dataset is the first layer where cultural logics, assumptions, patterns of normativity or exclusion are encoded in the process of media generation. So: what can you read in an image or video about what training choices have been made?

    Model

    The model is a program: code and computation. The model determines what happens to the training data — how it’s mapped, clustered, and re-surfaced in the generation process. This re-surfacing can influence styles, coherence, and what kinds of images or videos are possible with a given model.

    If there are omissions or gaps in the training data, the model may fail to render coherent outputs around particular concepts, resulting in glitchy images, or errors in parts of a video.

    Midjourney was built on Stable Diffusion, a model in active development by Stability AI since 2022. Stable Diffusion works via a process of iterative de-noising: each stage in the process brings the outputs closer to a viable, stable representation of what’s included in the user’s prompt. Leonardo.Ai’s newer Lucid models also operate via diffusion, but specialists are brought in at various stages to ‘steer’ the model in particular directions, e.g. to verify what appears as ‘photographic’, ‘artistic’, ‘vector graphic design’, and so on.

    When considering the model’s imprint on images or videos, we might consider:

    • Are there recurring visual motifs, compositional structures, or aesthetic fingerprints?
    • Where do outputs break down or show glitches?
    • Does the model privilege certain patterns over others?
    • What does the model’s “best guess” reveal about its learned biases?

    Analysing AI-generated media with these considerations in mind may reveal the internal logics and constraints of the model. Importantly, though, these logics and constraints will always shape AI media, whether they are readable in the outputs or not.

    Interface

    The interface is what the user sees when they interact with any AI system. Interfaces shape user perceptions of control and creativity. They may guide users towards a particular kind of output by making some choices easier or more visible than others.

    Midjourney, for example, displays a simple text box with the option to open a sub-menu featuring some more customisation options. Leonardo.Ai’s interface is more what I call a ‘studio suite’, with many controls visible initially, and plenty more available with a few menu clicks. Offline tools like DiffusionBee and ComfyUI similarly offer both simple (DiffusionBee) and complex (ComfyUI) options.

    Midjourney’s web interface: ‘What will you imagine?’
    Leonardo.Ai’s ‘studio suite’ interface.

    When looking at interfaces, consider what controls, presets, switches or sliders are foregrounded, and what is either hidden in a sub-menu or not available at all. This will give a sense of what the platform encourages: technical mastery and fine control (lots of sliders, parameters), or exploration and chance (minimal controls). Does this attract a certain kind of user? What does this tell you about the ‘ideal’ use case for the platform?

    Interfaces, then, don’t just shape outputs. They also cultivate different user subjectivities: the tinkerer, the artist, the consumer.

    Reading interfaces in outputs can be tricky. If the model or platform is known, one can speak of the outputs in knowledgeable terms about how the interface may have pushed certain styles, compositions, or aesthetics. But even if the platform is not known, there are some elements to speak to. If there is a coherent style, this may speak to prompt adherence or to presets embedded in the interface. Stable compositions — or more chaotic clusters of elements — may speak to a slider that was available to the user.

    Whimsical or overly ‘aesthetic’ outputs often come from Midjourney. Increasingly, outputs from Kling and Leonardo are becoming much more realistic — and not in an uncanny way. But both Kling and Leonardo’s Lucid models put a plastic sheen on human figures that is recognisable.

    Prompt

    While some have speculated that other user input modes might be forthcoming — and others have suggested that such modes might be better — the prompt has remained the mainstay of the AI generation process, whether for text, image, video, software, or interactive environment. Some platforms say explicitly that their tools or models offer good ‘prompt adherence’, ie. what you put in is what you’ll get, but this is contingent on your putting in plausible/coherent prompts.

    Prompts activate the model’s statistical associations (usually through the captions alongside the images in training embeddings), but are filtered through linguistic ambiguity and platform-specific ‘prompting grammars’.

    Tools or platforms may offer options for prompt adherence or enhancement. This will push user prompts through pre-trained LLMs designed to embellish with more descriptors and pointers.

    If the prompt is known, one might consider the model’s interpretation of it in the output, in terms of how literal or metaphorical the model has been. There may be notable traces of prompt conventions, or community reuse and recycling of prompts. Are there any concepts from the prompt that are over- or under-represented? If you know the model as well as the prompt, you might consider how much the model has negotiated between user intention and known model bias or default.

    Even the clearest prompt is mediated by statistical mappings and platform grammars — reminding us that prompts are never direct commands, but negotiations. Thus, prompts inevitably reveal both the possibilities and limitations of natural language as an interface with generative AI systems.

    Sample Analysis

    Image generated by Leonardo.Ai, 29 September 2025; prompt by me.
    Prompt‘wedded bliss’
    ModelLucid Origin
    PlatformLeonardo.Ai
    Prompt enhancementoff
    Style presetoff

    The human figures in this image are young, white, thin, able-bodied, and adhere to Western and mainstream conventions of health and wellness. The male figure has short trimmed hair and a short beard, and the female figure has long blonde hair. The male figure is taller than the female figure. They are pictured wearing traditional Western wedding garb, so a suit for the man, and a white dress with veil for the woman. Notably, all of the above was was true for each of the four generations that came out of Leonardo for this prompt. The only real difference was in setting/location, and in distance of the subjects from the ‘camera’.

    By default, Lucid Origin appears to compose images with subjects in the centre of frame, and the subjects are in sharp focus, with details of the background tending to be in soft focus or completely blurred. A centered, symmetrical composition with selective focus is characteristic of Leonardo’s interface presets, which tend toward professional photography aesthetics even when presets are explicitly turned off.

    The model struggles a little with fine human details, such as eyes, lips, and mouths. Notably the number of fingers and their general proportionality are much improved from earlier image generators (fingernails may be a new problem zone!). However, if figures are touching, such as in this example where the human figures are kissing, or their faces are close, the model struggles to keep shadows, or facial features, consistent. Here, for instance, the man’s nose appears to disappear into the woman’s right eye. When the subjects are at a distance, inconsistencies and errors are more noticeable.

    Overall though, the clarity and confident composition of this image — and the others that came out of Leonardo with the same prompt — would suggest that a great many wedding photos, or images from commercial wedding products, are present in the training data.

    Interestingly, without prompt enhancement, the model defaulted to an image presumably from the couples wedding day, as opposed to interpreting ‘wedded bliss’ to mean some other happy time during a marriage. The model’s literal interpretation here, i.e. showing the wedding day itself rather than any other moment of marital happiness, reveals how training data captions likely associate ‘wedded bliss’ (or ‘wed*’ as a wildcard term) directly with wedding imagery rather than the broader concept of happiness in marriage.

    This analysis shows how attention to all four layers — data biases, model behavior, interface affordances, and prompt interpretation — reveals the ‘wedded bliss’ image as a cultural-computational artefact shaped by commercial wedding photography, heteronormative assumptions, and the technical characteristics of Leonardo’s Lucid Origin model.


    This analytic method is meant as an alternative to dismissing AI media outright. To read AI images and video as cultural-computational artefacts is to recognise them as products, processes, and infrastructural traces all at once. Such readings resist passive consumption, expose hidden assumptions, and offer practical tools for interpreting the visuals that generative systems produce.


    This is a summary of a journal article currently under review. In respect of the ethics of peer review, this version is much edited, heavily abridged, and the sample analysis is new specifically for this post. Once published, I will link the full article here.

  • A Little Slop Music

    The AI experiment that turned my ick to 11 (now you can try it too!)

    When I sit at the piano I’m struck by a simple paradox: twelve repeating keys are both trivial and limitless. The layout is simple; mastery is not. A single key sets off a chain — lever, hammer, string, soundboard. The keyboard is the interface that controls an intricate deeper mechanism.

    The computer keyboard can be just as musical. You can sequence loops, dial patches, sample and resample, fold fragments into new textures, or plug an instrument in and hear it transformed a thousand ways. It’s a different kind of craft, but it’s still craft.

    Generative AI has given me more “magic” moments than any other technology I’ve tried: times when the interface fell away and something like intelligence answered my inputs. Images, text, sounds appearing that felt oddly new: the assemblage transcending its parts. Still, my critical brain knows it’s pattern-play: signal in noise.

    AI-generated music feels different, though.

    ‘Blåtimen’, by Lars Vintersholm & Triple L, from the album Just North of Midnight.

    In exploring AI, music, and ethics after the Velvet Sundown fallout, a colleague tasked students with building fictional bands: LLMs for lyrics and backstory, image and video generators for faces and promo, Suno for the music. Some students leaned into the paratexts; the musically inclined pulled stems apart and remixed them.

    Inspired, I tried it myself. And, wouldn’t you know, the experience produced a pile of Thoughts. And not insignificantly, a handful of Feelings.

    Lars Vintershelm, captured for a feature article in Scena Norge, 22 August 2025.

    Ritual-Technic: Conjuring a Fictional AI Band

    1. Start with the sound

    • Start with loose stylistic prompts: “lofi synth jazz beats,” “Scandi piano trio,” “psychedelic folk with sitar and strings,” or whatever genre-haunting vibe appeals.
    • Generate dozens (or hundreds) of tracks. Don’t worry if most are duds — part of the ritual is surfing the slop.
    • Keep a small handful that spark something: a riff, a texture, an atmosphere.

    2. Conjure the band

    • Imagine who could be behind this sound. A trio? A producer? A rotating collective?
    • Name them, sketch their backstories, even generate portraits if you like.
    • The band is a mask: it makes the output feel inhabited, not just spat out by a machine.

    3. Add the frame

    • Every band needs an album, EP, or concept. Pick a title that sets the mood (Just North of Midnight, Spectral Mixtape Vol. 1, Songs for an Abandoned Mall).
    • Create minimal visuals — a cover, a logo, a fake gig poster. The paratexts do heavy lifting in conjuring coherence.

    4. Curate the release

    • From the pile of generations, select a set that holds together. Think sequencing, flow, contrasts — enough to feel like an album, not a playlist.
    • Don’t be afraid to include misfires or weird divergences if they tell part of the story.

    5. Listen differently

    • Treat the result as both artefact and experiment. Notice where it feels joyous, uncanny, or empty.
    • Ask: what is my band teaching me about AI systems, creativity, and culture?

    Like many others, I’m sure, it took me a while to really appreciate jazz. For the longest time, for an ear tuned to consistent, unchanging monorhythms, clear structures, and simple chords and melodies, it just sounded like so much noise. It wasn’t until I became a little better at piano, but really until I saw jazz played live, and started following jazz musicians, composers, and theorists online, that I became fascinated by the endless inventiveness and ingenuity of these musicians and this music.

    This exploration, rightly, soon expanded into the origins, people, stories, and cultures of this music. This is a music born of pain, trauma, struggle, injustice. It is a music whose pioneers, masters, apprentices, advocates, have been pilloried, targeted, attacked, and abused, because of who they are, and what they were trying to express. Scandinavian jazz, and European jazz in general, is its own special problematic beast. At best, it is a form of cultural appropriation, at worst, it is an offensive cultural colonialism.

    Here I was, then, conjuring music from my imaginary Scandi jazz band in Suno, in the full knowledge that even this experiment, this act of play, brushes up against both a fraught musical history, as well as ongoing debates and court cases on creativity, intellectual property, and generative systems.

    Play is how I probe the edges of these systems, how I test what they reveal about creativity, culture, and myself. But for the first time, the baseline ‘ickiness’ I feel around the ethics of AI systems became almost emotional, even physiological. I wasn’t just testing outputs, but testing myself: the churn of affect, the strangeness in my body, the sick-fascinated thrill of watching the machine spit out something that felt like an already-loaded form of music, again and again. Addictive, uncanny, grotesque.

    It’s addictive, in part, because it’s so fast. You put in a few words, generate or enter some lyrics, and within two minutes you have a functional piece of music that sounds 80 or 90% produced and ready to do whatever you want with. Each generation is wildly different if you want it to be. You might also generate a couple of tracks in a particular style, enable the cover version feature, and hear those same songs in a completely different tone, instrumentation, genre. In the midst of generating songs, it felt like I was playing or using some kind of church organ-cum-starship enterprise-cum-dream materialiser…. the true sensation of non-stop slop.

    What perhaps made it more interesting was the vague sense that I was generating something like an album, or something like a body of work within a particular genre and style. That meant that when I got a surprising result, I had to decide whether this divergence from that style was plausible for the spectral composer in my head.

    But behind this spectre-led exhilaration: the shadow of a growing unease.

    ‘Forever’, by Lars Vintersholm & Triple L (ft. Magnus LeClerq), from the album Just North of Midnight.

    AI-generated music used to only survive half-scrutiny: fine as background noise, easy to ignore. They still can be — but with the right prompts and tweaks, the outputs are now more complex, even if not always more musical or artistic.

    If all you want is a quick MP3 for a short film or TikTok, they’re perfect. If you’re a musician pulling stems apart for remixing or glitch experiments, they’re interesting too — but the illusion falls apart when you expect clean, studio-ready stems. Instead of crisp, isolated instruments, you hear the model’s best guesses: blobs of sound approximating piano, bass, trumpet. Like overhearing a whole track, snipping out pieces that sound instrument-like, and asking someone else to reassemble them. The seams show. Sometimes the stems are tidy, but when they wobble and smear, you catch a glimpse of how the machine is stitching its music together.

    The album Just North of Midnight only exists because I decided to make something out of the bizarre and queasy experience of generating a pile of AI songs. It exists because I needed a persona — an artist, a creative driver, a visionary — to make the tension and the weirdness feel bearable or justified. The composer, the trio, the album art, the biographies: all these extra elements, whether as worldbuilding or texture, lend (and only lend) a sense of legitimacy and authenticity to what is really just an illusion of a coherent, composed artefact.

    For me, music is an encounter and an entanglement — of performer and instrument, artist and audience, instrument and space, audience and space, hard notes and soft feel. Film, by contrast (at least for me), is an assemblage — sound and vision cut and layered for an audience. AI images or LLM outputs feel assemblage-like too: data, models, prompts, outputs, contexts stitched together. AI music may be built on the same mechanics, but I experience it differently. That gap — between how it’s made and how it feels — is why AI music strikes me as strange, eerie, magical, uncanny.

    ‘Seasonal Blend’, by Lars Vintersholm & Triple L, from the album Just North of Midnight.

    So what’s at stake here? AI music unsettled me because it plays at entanglement without ever truly achieving it. It mimics encounter while stitching together approximations. And in that gap, I — perhaps properly for the first time — glimpsed the promise and danger of all AI-generated media: a future where culture collapses into an endless assemblage of banal, plausible visuals, sounds, and words. This is a future that becomes more and more likely unless we insist on the messy, embodied entanglements that make art matter: the contexts and struggles it emerges from, the people and stories it carries, the collective acts of making and appreciating that bind histories of pain, joy, resistance, and creativity.


    Listen to the album Just North of Midnight in its complete strangeness on SoundCloud.

  • Alternate Spaces

    Alternate Spaces © 2024 by Daniel Binns is licensed under CC BY-SA 4.0.

    See more AI weirdness here.