Why I'm A Better Editor Than GPT-4 (& Probably GPT-5)
My one weird trick for staying ahead of the bots, according to science.
🏆 There’s a particular thing I’m good at as a writer and editor, and it goes by different names. Sometimes I and my peers call it “zeitgeisting,” or maybe “vibe reading.” But whatever name it goes by, this ability is easy to describe: A good editor like myself can perform better than chance at spotting which stories and angles have viral potential and which ones do not.
I have different ways of talking about this kind of thing with colleagues. I’ll look at a story or a pitch and say things like:
“Yeah, this has traffic juice if we get it out immediately, especially if we tweak the opener like this or put this in the title.”
“No, this is stale. The traffic window closed on that yesterday.”
“We should sit on this until there’s traffic in it. I think X or Y news is about to drop that’ll make this relevant.”
These are the kinds of things working editors say when they’ve been monitoring conversations on Twitter and other venues, and they have a sense of which stories or parts of stories have some sort of breakout potential and are therefore worth investing scarce editorial and promotional resources in.
🛑 GPT-4 cannot do this, and it’s likely that GPT-5 won’t be able to do it, either.
There’s a very straightforward technical reason why large language models (LLMs) are bad at zeitgeisting, and will continue to be bad at it pending some major technological advances.
jonstokes.com is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Here’s the TL;DR:
GPT-4’s cost per token is still insanely high. It costs a minimum of $0.24 to fill the model’s 8K-token window, and at least $1.97 to fill the 32K-token window. (I’ll explain all these terms, below.)
As a skilled human editor who can read and digest large volumes of text very rapidly, my personal cost per thousand tokens (even if I billed at a rate of, say, $500/hour) is insanely low.
Nailing the zeitgeist is all about using a high volume of fresh, up-to-the-nanosecond context tokens to steer the production of new content.
Ergo, I can steer content into the center of the zeitgeist at a far lower cost per finished piece than any LLM can in the near or probably even medium term.
I know what you’re going to say: Algorithms have been reading the zeitgeist and in fact have been creating the zeitgeist since the launch of Google News. Sentiment algos do stock trading successfully. Successful zeitgeist-reading algorithms are used in content planning to suggest topics to write on that have SEO potential or some other viral traffic potential. What I am saying can’t be done is already a solved problem.
🤨 But you’re wrong about this. It’s ok, though. It’s not your fault. If you knew exactly what I was talking about — exactly what I mean when I say the right editor matched with the right beat and the right writer can nail the viral sweet spot as it passes by in the feed — then you’d be able to do what I do.
(Of course, I realize that what I do — i.e., writing stuff that, on occasion, some meaningful number of people want to click, read, and share — is one of those things everyone thinks they could do if they just tried. But unpacking why it’s roughly the same type of challenge as seed-stage startup investing (but for content, not companies)… well, that’s one of those things where you can pay my consulting rate if you’re really curious to know more.)
Case Study: Podcast Transcription
To give further color on where GPT-4 falls short of my own editorial skills, let’s take a detailed look at a specific example of where I used GPT-4 to create an editorial product and saw it faceplant.
I recently used the following workflow to generate a cleaned-up transcript of a podcast episode for, where I interviewed Chroma's Anton Troynikov:
Record the podcast interview on Riverside.
Use Riverside’s ML-powered transcription feature to produce a raw transcript of the episode. (Lots of “mmmm” and “yeah” and “like” in there.)
Dump the raw transcript into ChatGPT, using the GPT-4 model and the prompt: “Clean up this interview transcript without summarizing too much. Just make it more readable.” I did this step in roughly 2,000-word chunks.
(This is a really great podcast episode, by the way. All my readers will enjoy it, so head over toto read and overview and find links to both the audio and the video.)
The resulting “transcript” is very readable, but I have to put “transcript” in scare quotes because it’s not really a true transcript — much of the content is slightly rephrased or summarized. But overall it presents an accurate, condensed picture of the conversation.
Sure, there was a little bit of muddiness introduced, as the AI took some of the points I made and put them in the interviewee’s mouth, and vice versa. But overall, GPT-4 did a way better job than I had been doing as I worked my way through the interview manually… with one exception.
🕵️ With my zeitgeist-reading editor glasses on, I could see that GPT-4 had nuked from its version of the parts of the conversation that had the most traffic juice. As in, some of the bits I’d have put in pull quotes and used to advertise the piece, GPT-4 had discarded as irrelevant.
Then it hit me: GPT-4 doesn’t have a big enough token window to read the room.
Let’s look at some examples.
On the left is the raw transcript, and on the right is GPT-4’s cleanup of it.
I'm very, very interested in this kind of intersection between like what, what does technology mean to the culture at large and how do those two things interact? Like these are really important questions for me. And I'm like genuinely terrified that we'll end up in this kind of kind of like valley of despair where people stop believing that technology And so they just react to it really strongly. And we've seen that. We've seen it start to happen. We saw it with SVP. Then the problem is, is that all the nerds got rich. And now we're Wall Street. And people fucking hate Wall Street. So that's what's happened. But the thing is, it's still genuinely true that this is the driver of the stuff that prevents us from basically the way I put it, is locking ourselves into an airtight phone booth and stabbing each other to death for their major So this is stuff that I think about
I'm very interested in the intersection of technology and culture, and I'm concerned about people's reactions to technology. That's a brief overview of my background.
This great point that Anton made, where he’s talking about how everyone hates Silicon Valley now (it’s the new Wall Street) but that’s tragic because it’s still a driver of human progress — GPT-4 just noped it. I ran that section through twice, too, tweaking the prompt and insisting that it leave as much of the original text as possible, and it just wasn’t going to leave it in there.
Here’s a less extreme example that still jumped out at me:
We have what people call a capabilities overhang, by which I mean, we could have stopped training these models maybe six months ago and we'd still be discovering new things we could do with them 12 months from now. time because it also means like if you're just kind of like tinkering with this thing and you're like making a little application you're like you're wondering oh can the model do this. You are at the research front right there with people who are like writing scientific papers about oh the model can do this. Like the reason the reason by step by step thing it's that that was a that was a paper it's a peer reviewed published scientific paper where it's just oh yeah we asked the model step by step and then we evaluated and saw that it worked better. So that's exciting.
We currently have a capabilities overhang, meaning we could have stopped training these models months ago and still be discovering new things they can do. If you're tinkering with these models and wondering what they can do, you're at the forefront of research along with those writing scientific papers. The "step by step" paper is an example of this.
In this part, Anton makes the point that the knowledge gap between hackers working on commercial, user-facing ML software and researchers working in ML is basically nonexistent. Both types of people are essentially at the forefront of poking and prodding at these models in order to understand what they can do and what they can’t. So this point, and its relationship to the capabilities overhang, is really important. But GPT-4 doesn’t know it’s important, so it gives it short shrift.
Here’s an analogy between access to compute and access to payment rails that GPT-4 threw out entirely:
So of course, you know, Google has Google, OpenAI has Azure, you know, stability, I like they sort of like try to work out all kinds of different compute partnerships patched together from something or like they own their own compute to a limited extent, but that's kind of the bottleneck, right? and various online sort of transactions, if you get cut off from the payments network, it doesn't really matter what you do. And so if you get cut off from compute, it also doesn't really matter what you do. You can't do research, you can't even run a model, you can't do inference for people. And so if people get cut off from compute, which by the way is again, like the compute owners are building these large models themselves, that's kind of a worry. And that's one of the reasons to like really watch out for regulatory pressure in this direction. of pushing the direction that you're hinting at here, there's a really big question, open question for me to like what extent like safety and quote unquote unalignment concerns are being used to push for that kind of regulatory capture.
Google has its own resources, and OpenAI relies on Azure. Stability is achieved through various compute partnerships or limited ownership of their own compute resources, but that remains a bottleneck. If you get cut off from compute, you can't do research, run a model, or perform inference for users. Large model builders who own compute resources may face concerns about being cut off from compute, which is why regulatory pressure in this direction is worth watching.
This payment rails analogy is on point right now, because of Chokepoint 2.0 and all the banking stuff that’s happening right now. If I had been editing this manually, I’d have really tried to highlight that analogy between compute access and payments access as a way of selling the episode by connecting it to the current thing.
Here’s another section that GPT-4 really drained all the sweet sweet traffic juice out of, leaving it a dry but informative husk:
Yeah, look, the only way to access GPT-4 right now outside of OpenAI is either because you're a partner, which they released with, and you know, actually some people have had this integrated into their products for several months now without telling anybody, which, you know, you kind of find out by being on the ground here in SF. see how it gets tailbroken because it will be, and I think OpenAI is very aware that it will be. The steerability piece is really interesting. I just think people haven't had enough time to experiment with this stuff. One thing that I noticed from the Twitter space is we've gone from, I don't know, like GPT-3 release what, a year ago, and people being freaking out. Is this conscious? Is this intelligent? Is it going to replace jobs in the Twitter space being like, oh, yeah, here's how we're going to use this in our enterprise case in here, we see it being useful and it's getting normalized so fast.
People haven't had much time to experiment with GPT-4 yet, but it's already becoming normalized. Some partners have had access to GPT-4 for months, integrating it into their products without publicizing it. The steerability aspect is interesting, but more experimentation is needed. The shift in conversation on Twitter from worrying about GPT-3's impact to discussing practical applications for GPT-4 in enterprise cases shows how quickly it's becoming normalized.
The original statement has a lot of great stuff in it:
“You still need to be in SF to do AI” is a big discourse topic, and Anton gives some color on that with his aside about how you hear GPT-4 chatter from people with early model access if you’re “on the ground” in SF.
The jailbreak topic is hot, but GPT-4 may not have figured out that “tailbroken” is a mistranscription for “jailbroken.” I don’t know, though, because it is really good at correcting stuff like this, so maybe it just ignored this topic.
This is such a great quote that GPT-4 just totally squashes flat: “One thing that I noticed from the Twitter space is we've gone from, I don't know, like GPT-3 release what, a year ago, and people freaking out. Is this conscious? Is this intelligent? Is it going to replace jobs in the Twitter space being like, oh, yeah, here's how we're going to use this in our enterprise case in here, we see it being useful and it's getting normalized so fast.”
🤦♂️ I could keep working my way through the transcript, but you get the idea. Our dialogue has all these little moments of resonance with The Discourse, but GPT-4 has no sense of The Discourse so it cannot spot them and highlight them. Instead, it just compresses them into lifelessness or drops them entirely.
The result is actually severe enough that I was conflicted about publishing the GPT-4 treatment unedited, which is what I did end up doing. On the one hand, I wanted readers to see what GPT-4 was capable of; on the other, it made the interview feel less fresh and zeitgeisty, and turned it more into an info dump.
So what would it take to fix this? There are two options, which aren’t mutually exclusive:
Massively increase the size of the token window while also decreasing the cost per token
Token budget optimization strategies
Some mix of both of these will probably take us to the next level of automated zeitgeist infusion.
Another look at the token window
🪦 Most of my readers are probably aware of the fact that an LLM doesn’t “know” about anything that happens after its training is finished. The model weights that emerge from the training phase are static, and unless you make some fine-tuning passes to alter them to reflect new facts, they represent the state of the world as it was mirrored in the training data.
Right now, the only way we can put new facts into an LLM in an ad hoc, on-the-fly manner, is via the model’s token window.
I wrote in detail about the token window in my previous piece describing the CHAT stack, but I want to drill down on this topic because it’s very important.
First, a quick definition: tokens are the things the model takes in as a prompt and puts out as a prediction. For a text-to-text LLM like GPT-3.5, we can think of tokens as words and punctuation, so the model takes in words as input and produces words as output. (This is an oversimplification, because tokens are on average a bit smaller than words, so most words have to be broken down into them, but it’s good enough for our purposes.)
In my earlier piece, I described the token window from the perspective of a user or software developer. In this way of looking at things, which is also the way OpenAI’s developer docs look at them, the token window contains two types of tokens:
Prompt tokens: These are the words you give ChatGPT, e.g., “Write a limerick about a man from Nantucket.”
Completion tokens: These are the words ChatGPT gives back to you in response to the prompt, e.g., “There once was a man from Nantucket, who…”
The token window, then, has to hold both the input tokens and the output tokens. So when you use the model to generate text, you have to leave room in the token window for the number of output tokens you want. This means that to use the token window optimally, you have to guesstimate the output size you’re looking for. This is hard because it involves not only making a guesstimate about your output needs but also balancing multiple types of input against that guesstimate.
Model inputs & context compression
If you read my CHAT stack post, you know that prompt tokens — the inputs to an LLM — come in three flavors:
The user query or prompt.
Context tokens, which contain supplemental material (documents, web pages, chat histories, etc.) that we want the model to read, understand, and refer to as authoritative when responding to the user prompt.
System tokens, which instruct the model (e.g., “You are a helpful AI assistant.”) or otherwise prep the model in some way so that it’s more likely to respond in a way we find useful.
The context tokens, then, are where we put the fresh, post-training facts, vibes, or Discourse droppings we want the model to work with in responding to our prompt.
To return to an example from my podcast interview, in order for GPT-4 to understand that my audience would be engaged by Anton’s compute hardware <=> payment rails analogy, I would need to find a way to load enough tweets and articles from myself and my Twitter circles into the context window that GPT-4 could pattern-match that bit of text as a reference to a thing we’re all currently chattering about.
🌉 The “SF is back for AI” example is an even more subtle and difficult example because you’d really need to be a close follower of The Discourse to have seen the right handful of tweets from the right big accounts to smell the traffic juice in that part of the interview.
Now, some of you are already mentally solving this problem with a giant document archive, embeddings, scoring and raking schemes, and word clouds, all assembled into a giant zeitgeist-o-meter that can sniff each complete sentence for traffic juice and, on finding a hit, load the relevant context tokens so the model knows to bring out the right notes and tones in the transcript.
💰 If so, that’s great. Keep doing that. What you’re imagining here is context compression, which is going to be a huge area of research and software development activity in the coming months and years. We’ll take the considerable engineering knowledge we as a society have built up around human attention hacking (i.e., adtech) and turn it toward the task of populating token windows in the most optimized manner possible. I imagine Google could crush this, but if they airball then some startup will nail it and become a context compression decacorn.
But this compression is necessary because context windows are tiny and very expensive on a per-token basis.
To really understand just how expensive tokens still are and the scale of the challenge associated with context compression, we have to look in a little more detail at ChatGPT’s token window.
Filling the token window
Remember when I said that prompt tokens and completion tokens go into the same token window? You’re probably wondering what that’s all about. Like, why not just a window for input tokens and then some other type of structure for the output tokens?
Also, if you’ve looked at GPT-4’s pricing, you wonder why the completion tokens cost twice as much as the prompt tokens.
All of this has to do with the fact that GPT-4 operates iteratively, generating one token at a time based on the contents of the token window. I’ve tried to illustrate this process with the picture below and the simple starting prompt,
How much wood would a woodchuck chuck if a.
You can see that the token window starts with this prompt in it, then the LLM reads that window and predicts the next token, i.e. the orange
Then in the next step, the LLM looks again at the token window and sees the original prompt plus
woodchuck, and then based on that predicts another token,
This process repeats, with each iteration of the process seeing new tokens added to the window until either the window is full or the number of tokens the user has asked for is added.
When you look at the process laid out this way, you can start to see why the completion tokens are priced at twice as much as the prompt tokens. The more tokens you have in that window on every iteration, the more computer power it takes to predict the next token in the sequence. So as the window fills, the iterations get more expensive. OpenAI, then, has priced the tokens so that context is cheaper and predictions are more expensive, and the cost grows with the token count.
💡 Think about what the above, iterative token completion process might mean if you’re a user who’s trying to generate a juicy, zeitgeisty paragraph based on tons of imported vibes and Discourse context. You have a limited token budget to spend on all the different types of tokens in the token window:
System tokens that instruct the model
The prompt tokens you’re using to generate the paragraph
Context tokens that you’ve identified as relevant background for the prompt
Completion tokens that contain the text you’re trying to generate
⚖️ How do you balance all of these? It’s a wide-open question — this is all new territory, so there’s no set of best practices, yet. But as you being to think about it, two hard problems immediately present themselves:
How do you know how much context is required to nail the zeitgeist with this particular paragraph? How do you even go about formulating rules or heuristics for this?
How do you estimate how much output text is enough to adequately convey the thought and hit all the culturally relevant notes you want to hit?
🤔 If you’re a writer of any experience and ability, you know these are two super deep questions. Furthermore, they’re not questions you’re used to thinking about in primarily quantitative, word-count terms, or in terms of process. I personally can’t say I have any insight into how I do either of the above. Before I started writing about and using generative AI, I never really thought about it.
Many ML types will jump to the conclusion that we should just train two more models, a context-length optimizer and an output length estimator. And maybe that’s the answer. But I suspect the best way to cut the Gordian knot is to just have a massive token window that’s big enough to contain a large chunk of the zeitgeist and let ‘er rip.
AI will eventually get there
It’s very early days, and right now we’re in the “128K RAM” era of LLMs, where even GPT-4’s larger 32K token window has barely enough size to handle many of the applications we can envision.
We need brand new compute architectures, dedicated hardware, and massive amounts of optimization in the inference phase to bring us into an era where token windows have the size to handle the kinds of nuances I’ve discussed in this post.
Until we get there, we’ll see a whole new software industry arise to tackle the problem of token window optimization — hundreds of billions of dollars worth of tooling needs to be built out in this area.
Or, it may turn out that there’s an architectural breakthrough around the corner — something beyond the LLM, or in addition to it. Something that lets models absorb and retain large volumes of fresh context the way a human brain does.
One way or another, though, I’m on borrowed time with this whole zeitgeisting thing. The models will eventually catch up to me, as they will to everyone else who manipulates symbols for fun and profit.
jonstokes.com is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
I'm just thinking out loud here, but...
If the token window is limited, and we want to have a larger input + output than the window allows...
Could we use a sort of rolling "convolution" over the input in order to get an output that is larger than the window might allow?
Window: 8k tokens
Input: 20k tokens
Parse the first 4k tokens of input, generate 2k tokens of output
Append 2k output to the next 4k tokens of input, generate 2k tokens of output
Repeat until the input has been fully parsed, or continue until the window is full with output tokens.
Your technical understanding of the AI world is really impressive. Been reading your work for a few months, thought I must share my appreciation for your tech chops. 🙏