How CORPUS Came About - Mathis Nitschke

April 28, 2026: This article has been substantially revised.

I’m a composer, properly trained. Pieces with a beginning and an end and something in between that leads from one to the other. Linear composition. Around 2014 that stopped holding me. A vision has been sitting in my head ever since: another form of music. One you can walk into. Something a listener can enter, move through, and shape by moving. Like a jam session among musicians: you step in, something living takes you up, your playing changes things, and that comes back at you. How any of this really works, I still don’t know. I’m only approaching it.

I’ve been trying to translate that into my work for years. Vergehen was where it started. Maya came later, in 2017, as a mixed-reality opera. Then the soundwalks with the Munich Philharmonic, and in 2019/20 Lure, an open studio on music theatre and AI in which my team and I got seriously into music AI. With each of these attempts, I kept arriving at the same wall. The tools for interactive and adaptive music aren’t actually that complicated. But planning all the assets and all the rules for every possible eventuality is so demanding that you barely get to where it starts to feel alive. Where it would be a being you make contact with. By the time of Lure it was clear that such a being can in the end only be an AI model.

Training data

Lure also showed me where it fails. At the training data. There simply isn’t enough music to train such a thing properly. Even where there’s quantity, diversity is missing, meaning variety in genre and style. And the truly rich repositories that exist somewhere are all rights-bound, license-protected, locked. Text and images are different: we’ve all spent two decades willingly sharing content on social networks, and those corpora were then swept up for AI training, legal or not. In music, this Wild-West data harvesting didn’t happen, and I actually think that’s a good thing. In music we still have a working legal system.

Incentives

That realization honestly discouraged me at the time. The results of music AIs hadn’t struck me as all that exciting anyway, and the data question seemed unsolvable. I kept doing my artistic work and turned alongside it to other things that excited me. Decentralized structures and crypto economies in particular. What fascinated me there was the mechanics of incentive systems. How collective endeavors function, how people get involved because the act of participating itself becomes valuable, because they hold a share in what gets built together. The speculative dynamic that drives such networks I also found interesting, precisely because it’s exciting. The same mechanics turn up in network marketing, where they work surprisingly well. I find them entirely legitimate when used for something decent.

At some point I started thinking both together. I started wondering why there’s so much training data for text and images and so little for music. That clearly has to do with the social networks through which our self-exposure has been running for the last twenty years. There were attempts to build a comparable social network for music, but none of them really lifted off, because making music as an activity is too niche. There’s no engine. And the thought was: maybe that engine could come from the incentive systems of the crypto world.

That’s how the idea for a protocol emerged that prompts the global community of music-makers to build a shared training corpus together. With samples, with recorded pieces, and with the at-least-as-important work of description and quality control. Whoever contributes content, description, or quality gets a share of the corpus formally assigned to them. When that corpus is later licensed or otherwise generates value, the value flows through the protocol back to everyone who helped build it.

The protocol can do more than distribute. It can reward diversity, by checking whether a new contribution occupies territory that the corpus is currently underrepresenting. The more original, the further from what’s already there, the higher the reward. It can weight musical and technical quality. In this way the incentives for quality and diversity are baked in long before any model is trained on the data.

Jam

Money on its own won’t get musicians to take part, especially when the listener at the end is a machine. Making music is really about encounter, about playing together, about the third thing that arises between two people. And that’s missing entirely in today’s training corpora. So we came up with Jam, a platform that works like an open rehearsal. I record an idea, a guitar phrase, a few bars. A musician I don’t know hears it, feels something there, plays on top of it, a saxophone, a voice, a loop. A third person hears the saxophone and responds to that. From these connections, kinship relations emerge between contributions, first, second, third generation. For an AI model that’s one day supposed to even approximately handle something like inspiration and flow, such relations are, I think, one of the most valuable kinds of data there is.

This idea kicked around in my head for a while. That was 2021, before the ChatGPT moment that then turned the whole AI world upside down. For some time I thought, well, the topic has slipped away from me, the big wave is rolling now, my idea from 2021 is obsolete. But the problem of legally clean training data hasn’t gone away. If anything, it’s become more visible by the month. About a year ago I came across a funding call from the European Union, and I used the occasion to sharpen the ideas, distill a concrete project, write a proposal. It’s now being funded, and CORPUS is being built.

With the talks I’m currently giving, I’m looking for partners, associations, alliances. I don’t want to develop the project as an island, even from a self-interested point of view, since something like this only works as part of a broader movement.

My interest in AI is carried by a concrete artistic question: whether forms of musical experience can emerge that lie beyond linear composition. Spaces that respond. The AI models capable of that don’t yet exist. For them to exist, they have to be trained on data that’s fair, diverse, and licensable. CORPUS is meant to build that data foundation. Then the actual work can begin.

More on the project: corpus.music. Longer pieces in the journal: journal.corpus.music.

Training data

Incentives

Jam

Mathis Nitschke

Connect with me on Instagram