Text of post

##tempering expectations for generative AI

Originally posted here: https://www.linkedin.com/pulse/tempering-expectations-generative-ai-andrew-marble/

There's a software development concept called low code programming - idea is that you could automate away lots of the boilerplate for common coding tasks, so as to minimize the code written, often by drawing a picture or writing some other kind of specification instead. Turns out it works great for about 95% of the programming tasks you'd want to do, but makes the other 5% more complex or impossible because it doesn't fit nicely in the low-code template so you need to hack around it. And unfortunately, this 5% is present in pretty much every real software project, and is what takes all the work. That's why despite the time that's passed since the first low code tool was developed (it was called COBOL) most software is still written in fully featured programming languages, and low code is only used in niche areas. Enter the new "low code" revolution. There's been a wave of hype around AI that's higher and faster moving than any of the past tech cycles we've seen. There was a mini-wave in last summer with the text-to-image generative models that's now become a tsunami (lame metaphor I know) with chatGPT and other large language models. The technological progress in undeniably cool, but it's got nothing on the hype that's come with it. I'd like to share a few thoughts, specifically to assert that LLMs are more like past deep learning technologies in their strengths and weaknesses than most will admit, and we're firmly still in the fun phase of looking at what they can do, rather than realizing what they can't do. And when the dust settles, like with low-code programming, there are going to be limitations that make LLMs much more narrow in their applicability than the initial hype predicts. Hype aside, LLMs are an evolution of past deep learning models. And one challenge with deep learning - the key reason it didn't live up to its promises, is the long tail problem - just like low code, it's great at mostly solving your problem. Contrary to popular belief (and with some caveats that don't matter here), what we call AI doesn't extrapolate, it interpolates within the data it has been trained on. When something "new", in the technical sense of outside the training data distribution, comes along, the algorithm doesn't work. In real use cases, there are always outliers. These are the "hard" bits where a person working on the problem actually has to use their brain. What it means in practice is that AI is good at delivering 95% solutions, 99% solutions, 99.9% but cannot deliver 100% solutions. And the small percent it fails on are the actual interesting ones where you need some kind of intelligence. The long tail problem is why previous iterations of deep leaning didn't live up to its expectations. We got some very cool demos, but potential users rarely got accustomed to the fact that it can't work flawlessly and that ultimately some kind of decision deferral mechanism where people handle edge cases, is required for high value applications. There's been a perfect storm of factors that have boosted popular perceptions of (this round of) generative AI, and thus hype. First, there are cool products to play with. AI used to be some abstract thing we tried to get businesses to invest in. Now everyone can try something called AI. At the same time, these technologies produce "soft" output - writing and pictures, that ironically are pretty forgiving to the underlying model. They bundle up the good 95% with the bad 5% to make something pretty passable. "So it generated a picture of person with six fingers or weird eyes or something. Still pretty good." Compare that to a classifier where every failure is clearly visible. The problem is, the mistakes are still there, and we're still firmly in the cool demo phase, not in the last mile where we try and cope with the errors to make something that can work in practice to save time or improve quality. We're mostly there for classical deep learning - through it's under appreciated, we can generally understand whether a deep learning classifier is qualified to make a prediction and assess how much weight we should put in it's output (if you're interested in this, I'm happy to talk). But we haven't even started to factor model accountability (in the technical sense - there's lots of social studies versions) into how we approach LLMs or generative models, and really have no idea how they're going to perform on real applications once the honeymoon ends. I'd add that now that we understand the limitations of classical deep learning, we've generally had to temper expectations about how much value it can add. Products have shifted from "automating jobs" to "working with humans" and the savings are much more modest, or in some cases nonexistent. If 95% of a human's time is spend dealing with the hard 5% of cases, automating the easy 95% may not be helpful - and this happens all too often. Until there's strong evidence to the contrary, my money is on generative AI's plateau of productivity shaking out the way low-code programming has. That is, it exists and fills many domain specific niches, but it's not market dominating and hasn't changed the structure of industry of society.

rbitr/tempering.md