AI Content Generation, Part 4: What’s next
Software hasn't changed this much or this rapidly since the invention of the web.
This is Part 4 of a multi-part series on AI content generation. If you haven’t caught up on the series, you may want to do so before continuing. Part 1 covers machine learning basics, Part 2 covers ML tasks and gives an overview of different models, and Part 3 is a deep dive into Stable Diffusion.
If you’re building any of the stuff I describe in this piece or that I cover in this Substack, then you can use this form to tell me about it and I may be able to help you get connected with funding.
“A new primitive” — that’s how OpenAI’s Sam Altman referred to machine learning in a recent chat with Reid Hoffman. This “new primitive” has recently been applied to web3 tech like NFTs, and loosely translated it means something like:
A new capability we didn’t have before, and that we can build new things with. Not a technology most of us will use in isolation, like social media, but more of an enabling tech — like hypertext, the OSI network stack, or distributed ledger technology (a.k.a. cryptocurrencies) — that will support a combinatorial explosion of new user experiences and businesses.
I’ve now spent some time with this new primitive and with the communities that are rapidly growing up around it — Discords, Github pages, Google Colab notebooks, academic papers, YouTube — and I have some thoughts about what it is and what’s next.
This article is divided into two main sections:
Macro trends, i.e., things that are happening at the layer of models and tasks that will drive changes in the application layer.
Application trends, i.e., concrete applications that we’ll start seeing very soon.
The idea is to give a high-level overview of a handful of the forces that are shaping the machine learning space, and then try to game out how those forces will impact hardware and software in the near to medium term.
Neoauthorship and text generation
Optimization and tuning
Digital prosthetics and real-time, full-spectrum anonymity
On-demand, personalized retail
This list of macro trends that are driving big changes in AI is very incomplete and doesn’t even include many of the things ML researchers are talking about right now. But I selected these to cover because these are the trends that keep coming up for me as I think about user-facing applications and platforms.
The term “decentralized AI” can mean a few things, but I’m partial to the definition that Balaji Srinivasan and StablilityAI’s Emad Mostaque both use, which is: AI that’s open-source, collaborative, and not controlled by any one entity.
In a world of decentralized AI, the middle part of this diagram from Part 2 of this series will blow up, as models and variants of models proliferate:
Yes, large companies are still going to keep their most state-of-the-art models either totally private (like Google’s Imagen) or walled off behind an API and a license agreement (like DALL-E from OpenAI). But we’ll start to see more models, like Stable Diffusion, that are open-source and freely available to modify and use.
The shift towards decentralized AI will ultimately have three main drivers:
Open-source-first companies and collectives (maybe DAOs) whose mission is to keep AI in the public’s hands.
Closed companies that release their trailing-edge models into the public domain.
Optimizations and innovations that combine with Moore’s Law and open datasets to make powerful models far cheaper to train, bringing customized models in reach of ever smaller businesses.
On this last point, we’re currently entering the era of the sub-$500K GPT-3 competitor, and the cost will continue to drop rapidly:
So the future of AI probably looks a bit like the past of relational databases, where even the biggest commercial users at this point are happily running either Postgresql or MySQL (both of them free and open-source), and Oracle is still making money selling proprietary databases via lock-in and value adds.
I’ve spent a fair amount of time in the weeds of how to customize Stable Diffusion by adding my own face to it, so that I can prompt it to generate pictures of “Jon Stokes, a person” in various styles. (I was going to write a how-to on this, but the tools are in such a state of flux that it’s best to wait until things settle down a bit.)
But this time has given me a glimpse of what’s coming really soon: generative models that know me, my family and pets, and my corpus of written work, and can spit out brand new family photos in different locations or new written works in my voice.
The kind of alternate reality image generation described above is already possible, but the tooling isn’t there to make it widely accessible, yet. The tooling situation will change, though, as any big tech company that hosts photo albums where users are tagged in the photos — Apple, FB, Google — will be able to take all of those tagged photos and offer us customized models where we can take all our friends, relatives, and pets, and put them into new and interesting contexts. Or, we could subtly tweak a photo by changing the angle or who’s in it. The possibilities are endless.
The alternate reality wave definitely won’t be limited to people and pets. Think about product photography for small-batch and handcrafted items, like Etsy shops and the like. Anyone who makes and sells things online will be able to generate an infinitely varied library of product photographs in a range of styles and contexts from a handful of awkward iPhone photos. (The reverse is also true — an Etsy seller will be able to generate photography for a product that doesn’t yet exist, then post it for sale to gauge demand, and then fill orders if enough customers buy it. More on this in the section on retail, below.)
Neoauthorship and text generation
Text will work similarly to the personalized image generation described in this section. Google’s NLP-based writing suggestions are a small taste of what’s to come, but there’s more in the works. A lot more. Writers like myself, who have a large volume of work online (I’ve archived most of mine with a paid service called Authory.com) will be able to dump that corpus into a model and get a customized text generation engine that sounds like us.
For a writer like myself, this AI-powered neoauthorship might look something like the following:
I dump all my research notes on a topic into an app that turns them into an article that reads and sounds like it was written by me because the generative model was trained on my work. And in fact, we could meaningfully argue that generated text will have actually been written by me. What I mean is, if the model was trained on my “voice,” and I personally use it to produce more work under my own name (which I then edit and prepare for publication), then it’s my work, right?
Authorship has always been kind of weird — unnamed editors often contribute quite materially to a published article, yet the byline doesn’t reflect that. So neoauthorship is going to be sort of like that, but even weirder.
Prompts: inference and personalization
There’s more to personalization in AI than just models that are trained on my personal data (images, text, audio, music, etc.). To understand what’s coming, consider how Google currently uses machine learning in the context of personalized search.
Google pulls in a ton of data about me and my interests in order to infer intent, so that the search results it offers me are tailored to my geography, past search history, purchase history, email archives, and so on. (At least, that’s how it’s supposed to work. Search quality has degraded so much lately that I’m not sure what the heck they’re actually doing right now.)
Now imagine this kind of personalization and intent inference, but for searches of latent space.
An app like PlaygroundAI, which not only has a history of my prompts but also a map of my social graph, can and probably will be tweaked to infer intent from a natural language text input (not an engineered prompt, but a more casual description of what I’m looking for) and offer me a set of results that reflect what I meant (if not what I literally said).
Imagine an online writing app that knows I like certain styles and subjects, and maybe it also has some data on my purchase history and location — in fact, maybe it knows even more about how I think than Google does because I’ve been curating training data for it by diligently labeling my photos and so on. Let’s say I’ve used this writing app to write most this essay of mine on The Can Opener Problem.
I click the “Add Image” button, and instead of an upload dialog box, I get a prompt window.
I start typing, “Can opener…, “ and it uses the article context to begin autocompleting “…by Leonardo DaVinci, hand drawing, mechanical, detailed…”
tabto accept the autocompletion, and it generates four different hand-drawn diagrams of a can opener in the style of one of Da Vinci’s mechanical drawings.
I click the one I like best and it inserts it into the page…
…except it doesn’t actually insert the image into the page. Rather, it just inserts the prompt and seed combination into the page using an image tag like
<img prompt=“A drawing of a can opener by Leonardo Da Vinci, mechanical, detailed, trending on arstation” seed=“45523043” />, and then the browser uses an optimized image generation model to generate the image on the client side on page load. (See the section below on optimization for more on how this would work.)
There are a number of core improvements to machine learning that will unlock new levels of power and functionality for this class of software. The short list below is by no means comprehensive, and I invite knowledgeable readers to submit more examples in the comments.
Current machine learning models are run in two separate phases, which take place in this order:
Training: This is where the model is exposed to training data and where it does all its actual learning.
Inference: This is the phase where we, the end-users, are actually using the fully trained model to perform tasks like classification and content generation.
But of course, learning in the real world isn’t a one-off affair like this. Humans and animals are constantly alternating between “training” and “inference” phases, or maybe sometimes we’re essentially in both phases simultaneously.
At some point, machine learning models will move to continual learning, where they’re constantly going back and updating their impressions of the world as we use them in different contexts.
This is a lot harder than it sounds, as the current generation of models is so malleable that it’s easy to completely overwrite old information with new information — this is called catastrophic forgetting, and it’s a long-standing, fiendishly difficult problem in neural networks.
But at some point, we’re going to need models that can continuously learn new concepts and update old concepts with new information while still remembering everything they’ve ever learned.
Just to unpack this a bit more: not only can I update my mental model of the world, but I can also remember what my mental model was prior to the update. In other words, the history of the evolution of many of my internal concepts is accessible to me and often useful. Machine learning models don’t yet have anything like this kind of meta-cognition, or the ability to access and reason about prior but currently deprecated versions of concepts. We will need a combination of continual learning and meta-cognitive abilities if we’re ever going to get to AGI.
Why Continual Learning is the key towards Machine Intelligence
A multimodal model is a machine learning model that relates different types of input — text, images, video, audio, 3D models, gestures, physiological signals, LIDAR data, etc. — to one another, so that it can understand concepts across different domains.
If you’re using text-to-image generators, then you’re already using multimodal models. But there’s a lot more going on in the research space with these kinds of models.
Multimodal is good for more than things like text-to-image, voice-to-text, or voice-to-video. By combining different types of sensory input, models will be able to do things like detect emotions (see the diagram above). Combine this with the model personalization discussion above, and the implications for hyper-personalized computing experience are amazing and also kinda scary.
Optimization and tuning
Since Stable Diffusion was released I’ve been watching local execution times drop as new tweaks and optimizations come out from the community. This kind of focused optimization of specific models will continue driving execution times down and bringing locally run models within reach of ever more compute-constrained platforms (tablets, phones, lower-end laptops, etc.).
There will also be fundamental advances in model design and algorithms that will give even more dramatic speedups. For instance, the new optimization below enables diffusion models to generate high-quality images with just 1-4 diffusion steps (as opposed to the 25-50 currently recommended for Stable Diffusion):
This would mean no more waiting around for a model to generate an image from a prompt — instead, it would happen instantaneously. This kind of speed has immediate implications far beyond just less time staring at a loading spinner:
Imagine a real-time version of Stable Diffusion that can generate a stream of images from a stream of prompt + seed combinations. This would be amazing for text-to-video applications, augmented reality, and virtual reality.
The diffusion steps are expensive in terms of computation and dollar cost on cloud platforms, so a 1-4 step diffusion model gives you two options:
A free (or nearly free) image generation service
A paid image generation service that gives you a page of results for a prompt, instead of the current standard of one to four results.
A browser that renders images on-the-fly from a prompt and seed combination given in a simple image tag (see the section above on “Prompt inference” for a hypothetical example of such a tag).
A multimodal voice-to-image model that changes its renders interactively in real time in response to voice commands.
The above are just some random examples I thought of in a few minutes. A global marketplace of millions of active ML developers will come up with tens of thousands of such ideas every day, and then hundreds of millions of software developers downstream of them will iterate on every layer of the stack — from the open models up to the user interfaces. Wild times are ahead.
The model- and task-level developments described above are already giving rise to new classes of applications and to dramatic enhancements of existing applications. The brief survey that follows is by no means comprehensive but should give you a good feel for what to look out for in the next 12-24 months.
Here’s the bottom line for everything in this section: what I describe below is a world in which every creative job is essentially a product management job, where you manage and steer a suite of ML-powered tools to produce stunning finished products that would’ve taken teams of humans just a few years before.
Machine learning will bring about the PMification of work in whatever area it touches.
The first category of apps to emerge in the AI content generation space have been what I’ll refer to as co-creation apps: these are apps where the model augments some creative activity that the user is already engaged in.
Many of us are already using ML-powered co-creation in Gmail, where the company’s models auto-suggest the rest of whatever sentence you start typing:
An even more powerful example of co-creation is AI-powered code generation. GitHub Copilot and Replit’s Generate Code are two popular code generators that use a specially tuned version of OpenAI’s GPT-3 text generation model to auto-suggest entire blocks of code.
Code generators eliminate a lot of the copypasta coding and boilerplate generation takes up so much of a developer’s time, allowing coders to focus on solving higher-value problems. These will get a lot more powerful and move well beyond the elimination of boilerplate, but that’s a topic for a future post.
As generative models get more accessible, this kind of co-creation will be low-hanging fruit for many categories of apps, and we’ll start seeing even more of it shortly in some of the following forms:
APIs that enable any text editor at all (even vi and emacs) to offer Google-style suggestions, longer blocks of generated text, text transformation (e.g., notes and bullet points to polished paragraphs), and code completion.
Plug-ins for popular image editors and drawing tools, so that you can do in-painting, out-painting, image-to-image generation, and maybe even text-to-image generation from within apps like Photoshop, Procreate, Pixelmator, etc.
Plug-ins for digital audio workstations that do generate and transform audio from within the tool.
One recently floated concept for how all these might come together is the $1000 blockbuster. We live in a world where Satoshi and others have launched global currencies from their laptops, so it’s not at all wild to speculate that a creative person could use the tools described in this article to produce an entire Hollywood-style movie from their laptop, without a single trained actor, musician, or writer in the mix anywhere.
The bigger picture here is an impending total collapse in the cost of many kinds of artistic and intellectual labor. But much more on this and its implications in a later piece.
The Twitter thread above shows a fascinating example of something we’re going to see a lot more of in the coming weeks and months: text-to-task. What I mean by this term is, “I type a prompt that describes some desired work product, and the model responds with a prototype that that work product.”
Software is the first place this is happening, but advances in instruction following will bring text-to-task into many other realms. Imagine giving Alexa not just voice commands, but actual high-level instructions. Examples of such prompts might be:
Alexa, I’m entertaining here at the house on Saturday. I’ll be hosting about fifteen people, and I need vegetarian options for the main course and the sides. The preparation time should be low, and ideally, I should be using up some of the rice I bought last week and still have plenty of.
Alexa, I need a new playlist for my car trip. It should be about three hours long and have a lot of deep cuts from the 90s bands I like.
Alexa, I just emailed you a web page with plans for a backyard chicken coop. I need you to add all the materials to my Amazon cart.
The general trend will be away from voice or text commands and prompts for carrying out discrete task steps and towards natural language descriptions of finished products where the model can then carry out the steps needed to make the product.
At first, this capability will be limited to work where the sequence of steps is known and can be represented in the training data somehow. This is because hierarchically arranging novel tasks in order of importance, and related reasoning about multi-step processes, is still a hard problem in AI. So we’ll go as far as we can with text-to-task without needing the model to reason on its own about how different steps fit together. But that will be pretty far, especially for repetitive, solved problems that we can pretty easily script in software.
Eventually, models will be able to problem-solve in more unstructured and dynamic ways. But this starts to get into the realm of AGI, so we’ll save that for another day.
Ubiquitous machine learning will make even the highest quality, most elaborate digital artifacts — books, movies, TV shows, and music — easily and cheaply modifiable. We will soon be in the age of the infinite remaster.
Did you ever want to hear Johnny Cash cover John Prine’s classic, “Angel From Montgomery,” but you can’t find where this was ever recorded? No problem. Which era of Johnny Cash did you have in mind for this? A generative audio model will be able to synthesize a cover in the same way that Stable Diffusion can currently render Gandalf in the style of Studio Ghibli.
The culture wars will get even crazier when both studio and high-quality fan remasters of touchstone cultural properties start to emerge. Have you heard any of the media complaints about the lack of diversity in the Friends cast? I expect whoever owns it might eventually release a “patched” version on all the streaming services with the races of the cast altered.
This kind of remastering will cut the other way, as well. Get ready for a remaster of the new Amazon “Rings of Power” series that makes all the elves and hobbits white. Or a remaster of the upcoming “Little Mermaid” with a white AI-generated character subbed in for black actress Halle Bailey. (Someone on Twitter claimed to have already started on this and even had screencaps, but the account was deleted.)
I can easily imagine a world where Disney’s entire back catalog gets retroactively scrubbed of its heteronormativity, while trads with laptops not only preserve and redistribute the old stuff but scrub any and all hints of homosexuality from its current and future productions.
Some kids will read woke Dr. Seuss books that Theodore Geisel never wrote. Other kids will watch keklord8297’s trad remasters of the Netflix She-Ra cartoon. And on and on it’ll go.
It won’t all be culture wars, though. There’s a massive backlog of cultural materials that can be neutrally restored, colorized, and remastered to better suit the aesthetics and technology of the present times.
For instance, video games will be one of the earliest places to feel an immediate impact from AI models. Users are already using Stable Diffusion to upgrade the pixel art in older games to a modern, 3D-rendered-looking style.
There are so many classic games that could be redone this way very cheaply. In fact, if you think about the optimizations mentioned above, it’s quite realistic to imagine a world where such visual upgrades of in-game assets are done on the fly by a model embedded in a modified version of the game code — or even by a graphics driver that just works on the data from the frame buffer.
Digital prosthetics and real-time, full-spectrum pseudonymity
Instagram filters are a tiny little peek at a much bigger trend that will take hold in the age of ubiquitous AI: real-time, full-spectrum pseudonymity.
It’s real-time because you’ll be using Zoom, FaceTime, or some other video app to communicate, but your voice and appearance will be radically altered in a way that’s not visibly detectable by those on the other end of the link.
It’s full-spectrum because whatever digital prosthetic you produce can be used in every online venue where you show yourself to an audience — from Instagram, to Zoom, to Facebook, to TikTok.
To bring this back to the culture wars, again, think about DEI-driven hiring decisions in the era of 100% remote work and pseudonymity. Is your new colleague really a Chinese woman with slightly accented English? If you’ve never seen her outside of remote work, how can you know?
This kind of tech will obviously be used for psyops, and probably already is because it’s certainly accessible right now to sophisticated enough actors. (Innovation will bring the cost of the tools way down and make the workflows easy enough for even the least technical to master.) Are you sure that the Iranian woman TikToker who’s chronicling the present protests for your feed isn’t a Saudi man wearing digital makeup?
Of course, what technology takes away, tech can give back:
So at least we can hope that such manipulation will be flagged when it’s happening.
(In case you’re wondering, I plan on using this tech to remain my 27-year-old self in perpetuity. Not only was I in really good shape then, but this will also make me immune to Silicon Valley ageism.)
On-demand, personalized retail
The legacy retail workflow has been something like: design => build => sell => ship. The ML-driven workflow, in contrast, will look like: design => sell => build => ship.
The “sell” and “build” steps will swap places, as retailers sell a product first — complete with glossy product photos of the item being used and lavishly displayed — and then build it and ship it once the orders start coming in.
Of course, this “sell first, build later” thing has been going on for a long time in one form or another. But it will become the norm in more retail verticals as it gets easier for both retailers and their customers to do the design step with ML.
Imagine an interactive retail flow that goes as follows:
Alice clicks on a Facebook ad for Super Rad Hoodies.
She finds a hoodie with a style she likes, but she doesn’t like any of the colors or prints.
She customizes the hoodie by typing in a prompt, e.g., “Cartoon sunflower pattern.”
She selects the rendered version of the hoodie that most closely matches what she wants, and then with one clicks gets a gallery of the site’s model wearing the hoodie she just designed.
She uploads about ten pictures of herself from different angles and views a generated gallery of herself in different poses, lighting conditions, and environments wearing the hoodie.
She adds the hoodie to the cart and checks out.
Alice’s hoodie design is sent to a garment factory to be manufactured. (The design file includes the prompt and seed Alice used so that the factory can use a model to generate a higher-resolution version of the cartoon sunflower texture.)
She gets the hoodie in the mail.
All of the hard problems in the flow above have been solved, and all that remains is for someone to build it. (If I don’t finish this article quickly enough, something like the above will be live before this newsletter hits your inbox.)
We’ll also do this for entire homes, where you work with a set of ML-powered tools to design your home, and then the parts are 3D printed and it’s assembled for you.
This article has been kind of a grab-bag of trends, predictions, and suggestions, and I could’ve easily kept going but at some point, one has to start breaking things up into separate pieces.
There are so many things in this piece that I want to go deeper into, and even more things I haven’t touched on here but will cover in future installments. So here’s a quick look at what’s next for my “What’s Next” articles:
The growing text-to-3D-model scene is blowing my mind and is going to power a revolution in VR and gaming realism.
When it comes to code generation, I have many questions about how the models are currently being trained, and a ton of ideas for where in the social coding process we might use them (code review, TDD, transforming unidiomatic code into idiomatic code, etc.).
I’ve written previously about how machine learning is a form of compression, but I’ve yet to explore the technical and economic implications of a sudden, massive change in the “compute vs. storage” tradeoff dynamic across the entire computing ecosystem.
Following on the compression point, generative models will have a massive impact on the capabilities of blockchains where block space is at a premium.
I’ll be covering all of the above and a lot more in future posts, and I’ll also be doing interviews with content creators and those working in the field of ML. So if you haven’t subscribed, do it now so you don’t miss out.
Great series of posts, Jon! Keep going.
Great overview, Jon. Exciting times for sure. I'd love if you could comment on the following, either here or in follow-up pieces.
1. Is there space for privacy in these next steps? If open-source keeps this trend and there are further advances to bring computing requirements within consumer reach, could we keep the custody of our own models and just plug them into a larger ecosystem with privacy guarantees? Otherwise, we'd be surrendering our full capacities and personalities to whatever large coorp happens to own our models.
2. The embedded tag for image based on prompt+seed might fall short. It looks like some inpainting or variation over an initial model output from a generative model is often necessary for polished finish. What about model versioning too? Would the tag include then prompt+seed+model_version+inpainting_post_processing?
3. There's been much said about virtual assistants. I wonder now about virtual doubles. If we train a good-enough model to act like us, will we be able to send an avatar to a meeting on our behalf? Will there be standards to ensure that there is transparency when we interact online, to disclose whether we are interacting with a real person or some abstracted representation of them?
4. Will generative models suffer from a long tail problem similar to self-driving cars where they kind of work, but never quite fully? Faces in DALL-E are still ghostly sometimes and hands are notoriously hard to get right. Are a fix for these shortcomings round the corner or do they need a new, different approach that doesn't yet exist and might actually never be?