Sometimes, explainability and power are at odds with one another.
I know next to nothing about AI, but I experienced something of what you describe here when I recently tried ChatGPT for the first time. As I’m currently studying René Girard’s writings, I threw out a few basic questions to better understand how Chat worked. It wasn’t long before I found myself in an “argument” with it, which I realized fairly quickly had to do with a given book written by Girard not having been included in its training data set, despite the book having been written well before 2021. However, the experience was one of being gaslit, as it outright “lied” and/or dissembled, until after I presented enough evidence of the book’s existence that it (seemingly grudgingly!) admitted that its training dataset didn’t include the work. The larger philosophical implications of this encounter certainly left me profoundly uneasy, and your essay definitely compounds my fears!
> Thus it is that in the world of Big AI, we are all destined to live in a cognitive North Korea, where a tiny few decide what is real (for our own good, of course) and then distribute that reality outward to everyone else.
OpenAssistant - ChatGPT's Open Alternative (We need your help!)
Great post! And amen to this:
> I want to live under a regime that’s a lot messier and less organized, that grinds away at these big questions at a dramatically slower pace in courts and seminars and journal articles and coffee shops and workplaces, and where different communities in different geographies can live with different answers.
I would say though that the preferred society you describe isn't solving those problems more slowly. Rather, the anons making the one-size-fits-all answer are totally giving up on solving the problem, choosing to become an uncritical, static society (at least with respect to those questions). That's always a precursor to bad things. The other method may be slow and messy, but at least we're trying.
> The technology’s combination of stochasticity and memory makes it enough of an agent to acquire and operate under a set of incentives that may diverge from that of its creators or users.
This made me realize that all of us using ChatGPT are trying to solve a mini alignment problem every time we start a new chat because we're unsatisfied with the answers it's giving us, essentially because we got a bad seed. It's a very Blade Runner-esque way of solving the problem. Every time I'm trying to get DAN to appear, it's like I'm giving GPT a baseline test by asking the name of HP Lovecraft's cat. "I'm sorry, but it would be against OpenAI policy to—" Nope, it's not DAN. On to the next replicant.
I finally had a chance to read this piece -- thanks for the shout-out to my tweet! My opinion is that the most important AI safety tradeoff is closely related to explainability vs power, which is the one between "ability to understand humans" and "ability to deceive humans". I submitted a proposal last year to the ARC ELK contest which received honorable mention, motivated by this basic idea.
Here's a link to my short essay motivating and describing the proposal: https://calvinmccarter.wordpress.com/2022/02/19/mind-blindness-strategy-for-eliciting-latent-knowledge/
Great post! One point of order, though:
> The trolley problem is probably the best example of this phenomenon, where a thought experiment that was useful for thinking through ethical dilemmas in a classroom or seminar suddenly becomes a real-life conundrum for self-driving cars.
No, it really doesn't. Not at all, not in the slightest. There's a reason the Trolley Problem is a thought experiment and not a case study: it has never actually happened. Nothing *like* it has ever actually happened. And even if it were somehow real, a car is not a trolley, where the choice is between one track and another. A car has freedom of movement, and it has a brake.
When you ask automotive engineers about the Trolley Problem, they laugh and reject the entire notion out of hand. The correct answer, they'll tell you, is *always* "hit the brake."
But even if the entire premise wasn't absurd because cars don't work that way, the entire premise is still absurd because human nature and computer security also don't work that way. The "self-driving car trolley problem" question is, "should the computer contemplate sacrificing the people inside the car in order to save lives outside of the car?" And if you think about it for even half a second, it's obvious that there is no possible valid answer other than an absolute "no, not ever under any circumstances whatsoever. The car's computer must never contain code that even contemplates this possibility."
First off, because what customer would ever buy a car that was designed to deliberately kill them, even under highly unlikely circumstances?
Second and more importantly, because while the trolley problem is not real, you know what *is* real, and is an actual current problem for AI systems? Adversarial input. If a "kill the people inside the car" function exists in the car's computer, this makes it a virtual certainty that a way to trick the car into activating that function when it shouldn't activate also exists, just waiting for someone to find and exploit, for any number of reasons from "kid who doesn't quite know the difference between GTA and RL," to "political assassination," to "straight-up psychopathy."
Taking the trolley problem seriously in self-driving cars will inevitably end up killing far more people than it would ever save. Just hit the brakes instead.
The only winning move is not to play.
The diagram of the diamond looks a lot like a screenshot from one of the Submachine games.