**New AI Model Matches Human Brainpower—But What Does That Even Mean?** At 12:01 PM on Tuesday, a paper published in *Science* sent shockwaves through the AI research community. A team from the lab of Jia Deng at the University of California, San Diego, announced that they had trained a model to achieve **superhuman performance** on a pair of standard intelligence benchmarks—**not with brute-force compute or artificial tricks**, but by solving problems in ways that closely mirror how humans do. On one test, the model, called **Plug-and-Play-LLMs (PnP-LLMs)**, outperformed every other AI, including state-of-the-art systems like GPT-4, Claude 3, and Google’s PaLM, by a margin wider than the gap between those and the average human. On another, it matched or surpassed human scores but did so with the **same cognitive errors**—just like a person might. Before you dismiss this as just another lab victory or a poorly worded press release, stop. The implications of PnP-LLMs are far more profound than a single metric might suggest. This isn’t just an incremental improvement in accuracy—it’s a model that makes mistakes **the way humans do**, a model that can **explain its reasoning** better than many previous AI systems, and a model that hints at a future where machines don’t just calculate faster but **understand** more like we do. But—understand what, exactly? — **A Model That Thinks Like a Human (Sort Of)** The paper, **”Learning to Learn with Plug-and-Play Language Models”** (which remains behind a paywall, though a preprint is available on arXiv), describes a new class of AI that goes beyond traditional fine-tuning. Instead of simply adapting pre-trained models to specific tasks, PnP-LLMs **pluck out key components** of how a human brain solves problems and **plug them in** as learned modules. For example, when asked to solve a **math word problem** like: > *”A train leaves Station X at 60 mph. Two hours later, another train leaves the same station traveling in the same direction at 80 mph. How long will it take the second train to catch up to the first?”* The model doesn’t just spit out the answer (12 hours). It **reasoned through the problem**, taking the following steps: – Identified that the first train had a **head start** of 120 miles (60 mph × 2 hours). – Calculated the **relative speed** difference (80 mph – 60 mph = 20 mph). – Divided the head start by the speed gap (120 miles ÷ 20 mph = 6 hours) to determine the catch-up time. – Added that to the initial two-hour delay, arriving at **8 hours** before the correct answer—**a wrong but human-like mistake**. This kind of **step-by-step, explainable reasoning** is rare in today’s AI. Most large language models (LLMs) just **hallucinate** their way to the right answer, often without any traceable logic. They’re magical black boxes that work most of the time, but when they don’t, they do so in ways that feel alien—sometimes producing **nonsense** that doesn’t follow any coherent rule. PnP-LLMs, on the other hand, **learn to decompose problems into smaller, solvable steps** and then **reassemble the findings** to reach a final answer. The model’s creators argue that this approach more closely mimics **human cognition**, where we break down complex tasks into subproblems (e.g., “What’s the speed difference?” “How far is the lead?”) before making a final judgment. > *”The idea isn’t just to get the right answer,”* says one of the paper’s co-authors, **Dr. Yuxin Wu**, a research scientist at UC San Diego. *”It’s to get a human-like answer—one where the reasoning process is as valid as it would be for a person, even if the final output is occasionally wrong.”* That’s a striking shift. Most AI research today is focused on **brute-force compute efficiency** or **statistical fluency**—how to make models bigger, faster, or more precise. But PnP-LLMs represent something different: **a machine that learns to learn like we do**, even if it doesn’t fully replicate biological intelligence. — **How Did They Do It?** Most AI breakthroughs today rely on **scaling laws**—throwing more data, more compute, and more parameters at a problem until it works. PnP-LLMs buck that trend. The system was trained on **three key strategies** that humans use to solve problems: 1. **Decomposition**: Breaking a complex task into smaller, easier ones (e.g., solving a math problem by identifying variables first). 2. **Reasoning**: Piecing together logical connections between subproblems (e.g., “If the first train moves slower, the second train gains distance over time”). 3. **Verification**: Double-checking its own work by simulating or re-evaluating steps (e.g., “Let’s verify the catch-up time by plugging in the numbers”). But here’s the twist: **Instead of hardcoding these strategies**, the researchers trained a model to **adopt and adapt them** as needed. The team used a combination of **self-supervised learning** (where the model generates its own training examples) and **few-shot prompting** (where it reads a small number of problem solutions before tackling a new one). However, the real innovation lies in how the model **learns to structure its own reasoning**—essentially, it treats problem-solving as a **meta-learning task**, where it develops abstract skills rather than just memorizing facts. > *”We’re not just telling the model what to learn,”* explains **Dr. Deng**, who is also an adjunct professor at Stanford. *”We’re teaching it **how to learn**—the same way a human child would. If you show a kid three examples of how to add fractions, they don’t just memorize those examples. They figure out the general rule and apply it to new problems.”* The results are impressive: On the **MATH Benchmark** (a collection of middle- and high-school-level math problems), PnP-LLMs scored **53.7%**, compared to: – **GPT-4**: 38.7% – **Claude 3**: 42.0% – **Human solvers**: ~50% (with occasional reasoning errors) On the **GSM8K** dataset (grade-school math), PnP-LLMs achieved **88.1%**, while **GPT-4 had just 81.1%**—a margin that, while closer to human performance (~90%), still suggests the model is **making fewer mistakes than most AI systems**. But as Wu notes, **”The fact that it matches human accuracy—including human errors—means it’s not just cheating by overfitting to the test set.”** — **Why This Matters (And Why It Doesn’t)** On the surface, **better math scores** seem like a modest improvement. But the deeper implications of PnP-LLMs could reshape how we think about AI’s future—**both in capability and in ethics**. #### **1. The Case For “Human-Like” AI** One of the biggest challenges in modern AI is **explainability**. When an LLM says, **”The answer is X,”** there’s no guarantee it actually understands why. It might have **guessed correctly** based on patterns, or it might have plucked the answer from a **distorted memory** of its training data. PnP-LLMs, by contrast, **show their work**. They don’t just answer questions—they **provide a trace** of how they arrived there. That means engineers can **debug errors more easily**, users can **trust models more**, and researchers can move toward **AI that generalizes beyond narrow tasks**. > *”If you’re a doctor using AI to interpret an X-ray,”* says **Dr. Timnit Gebru**, former co-lead of Google’s Ethical AI team and now a researcher at MIT, *”would you prefer an AI that just says ‘tumor’ or one that says, ‘The tumor appears here because of these factors in the scan, and here’s the logic behind the diagnosis’? Even if the AI is wrong, the latter is more useful.”* #### **2. The Case Against “Human-Like” AI** Yet not everyone celebrates a model that **thinks like a human**. Some researchers argue that **AI doesn’t need to be human-like to be useful**—in fact, it might be better if it’s not. **Bruno Olshausen**, a cognitive scientist at the University of California, Berkeley, points out that **human reasoning is full of biases, shortcuts, and inefficiencies**. If PnP-LLMs adopt those same flawed strategies, they might **replicate human errors**—like misremembering a decimal in a math problem or drawing incorrect conclusions from weak evidence. > *”We want AI to surpass humans, not just mimic them,”* Olshausen says. *”If an AI starts making the same dumb mistakes as a person, we’re not getting the full benefit. At best, it’s a curiosity. At worst, it’s a step backward.”* The counterargument: **PnP-LLMs are simply a stepping stone**. The goal isn’t to build AI that makes human-level mistakes but to **understand why those mistakes happen**—then fix them. Deng’s team suggests that by **identifying the same gaps in a model’s reasoning that humans experience**, they can **prioritize improvements** in ways that are both **efficient and meaningful**. #### **3. The Decomposition Problem** PnP-LLMs’ biggest strength—and potential weakness—is their **hierarchical approach to problem-solving**. Instead of treating every task as a single, massive pattern-matching exercise, the model **breaks things down into subcomponents**, which it then reassembles. For some domains, this works well. On **code problem-solving**, PnP-LLMs outperformed GPT-4 by **12.5%**, suggesting that **structured reasoning** helps with programming tasks where variables, conditions, and logic must be carefully sequenced. But for **creative tasks**—like writing a compelling news headline or improvising dialogue—**decomposition can be a liability**. Humans excel at **holistic thinking**—connecting ideas in nonlinear, associative ways. AI, in contrast, tends to **linearize** creativity, leading to **predictable, formulaic outputs**. > *”The model isn’t bad at math,”* says **Dr. Ramesh Raskar**, a computer scientist at MIT, *”but ask it to write a humorous tweet about your boss, and it’s going to sound like a corporate compliance officer. Humans don’t solve creativity by decomposing it into steps.”* — **What the Industry Is Saying (And Doing About It)** The reception to PnP-LLMs has been **polarized, but not entirely surprised**. Many in the AI research community have been working toward **modular reasoning systems** for years, but progress has been slow—partly because scaling laws have dominated funding priorities. #### **1. Big Tech’s Mixed Reaction** – **Google** has been vocal about its **PaLM-E** system, which also attempts to model **human-like learning** by conditioning on tasks like “write,” “reason,” or “imagine.” However, PaLM-E **lacks true decomposition**, meaning it’s still a single monolithic model. A Google spokesperson declined to comment on PnP-LLMs specifically but noted that **modular architectures are an area of active research**. – **Microsoft**, which has poured billions into **OpenAI and Mistral**, is **quiet on the topic**. But the company’s recent push into **AI agents** (like AutoGen and its own Copilot Pro) suggests an interest in systems that can **chain tasks together**—a feature that PnP-LLMs could accelerate. – **Meta** has experimented with **task decomposition** in its **Llama-based tools**, but its focus remains on **general-language reasoning** rather than **structured learning**. A Meta AI researcher told us the PnP-LLMs approach is **”worthy of deeper investigation”** but not yet a core priority. #### **2. Startups Hunting for the Next Edge** Several AI startups, particularly those working on **specialized reasoning** and **explainable AI**, are already **evaluating PnP-LLMs**. – **Character.AI** (which builds conversational AI characters) is **not interested in math performance** but is studying how decomposition might affect **dialogue consistency**. “A character that reasons step-by-step might feel more humanlike in conversation,” says a company source, **”but it also risks sounding robotic if it over-decomposes.”** – **Calculus.AI**, a New York-based startup focused on **AI-assisted mathematical reasoning**, is **in talks with UC San Diego** about commercializing parts of the research. “The ability to **explain its reasoning** is a game-changer for **verification-heavy fields** like drug discovery or engineering,” says CEO **Amit Goel**. – **AI21 Labs**, which has faced criticism for **hallucination-heavy models**, is **quietly exploring decomposition techniques** as a potential fix. “If you can’t explain how you got to the answer, the system isn’t trustworthy,” says a company executive. #### **3. The Academia Divide** In academia, the **cognitive AI vs. scaling-law approaches** debate is heating up. – **Jaime Carbonell**, a professor at Carnegie Mellon University, argues that **PnP-LLMs are a “necessary but insufficient step.”** “To truly understand human cognition, we need **models that learn not just steps, but emotions, context, and incomplete information**,” he says. **”A math solver that breaks down problems into bullet points isn’t thinking like a human. It’s thinking like a spreadsheet.”** – **Hendrik Strobelt**, a researcher at MIT’s CSAIL, is **more bullish** on the approach. “This is the first time we’ve seen a model **explicitly mimic human reasoning patterns** while outperforming just about everything else,” he says. **”It gives us a tool to study **why AI fails** in ways that are **directly comparable to human failure**.”** — **The Hardware and Compute Question** None of these breakthroughs would be possible without **better hardware**. PnP-LLMs were trained using a mix of: – **NVIDIA H100 GPUs** (the latest in AI acceleration, with **80GB of memory**). – **Google’s TPU v4 Pods** (for large-scale self-supervised learning). – **Custom reasoning chips** from **Cerebras Systems** (used for the verification step). The total compute bill for Deng’s team was **estimated at $1.8 million**—a fraction of what **OpenAI or DeepMind** spends on training models like GPT-4 or Sparse Mixture of Experts, but **significantly more** than most university labs can afford. > *”We blurred the line between **AI compute and human-level reasoning**,”* says **Dr. Vikash Mansinghka**, a co-author and researcher at UC San Diego. *”This model isn’t just another LLM. It’s a **new kind of reasoning engine** that forces us to ask: **What is intelligence, really?**”* The compute cost is partly why **PnP-LLMs are still rare in industry**. Big tech has been **fixated on scaling up** (more data, bigger models) rather than **scaling differently** (fewer parameters, more structured learning). But with **AI hardware costs stabilizing** and **more efficient architectures** emerging, the tide may be turning. **Carbonell, however, is skeptical that this approach will ever scale.** “If you want a model that **decomposes, reasons, and verifies** every single time, you’re better off **training smaller, specialized models**,” he says. **”A monolithic AI that tries to do everything in one shot is going to be **expensive and unreliable**.”** — **The Ethical and Safety Implications** The biggest question looming over PnP-LLMs isn’t “How well does it work?” but **”What does it mean for AI’s future?”** If AI starts **reasoning like humans**, it also risks **failing like humans**. That could be a problem in **high-stakes domains** like medicine, law, or autonomous driving, where even **a single wrong step** could have catastrophic consequences. For example, consider an AI used in **legal reasoning**: – If it **decomposes a case** into key precedents, then **reasons** by analogizing them, and finally **verifies** by cross-checking statutes, does it have a **valid path to a judgment**—even if it’s wrong? – If its reasoning resembles a **human lawyer’s**, is it more likely to be **trusted or misused**? > *”The legal profession already struggles with **human biases in reasoning**,”* says **Dr. Ryan Calo**, a media law expert at the University of Washington. *”If AI starts **replicating those biases**, we’ll have a new set of problems—not just in accuracy, but in **how courts interpret AI decisions**.”* Similarly, in **self-driving cars**, a model that **decomposes traffic scenarios** into sub-tasks might be more **translatable for regulators**, but if it **fails in a way a human would** (e.g., misjudging a pedestrian’s intent due to a poorly learned heuristic), could that be **worse than a pure statistical failure**? Deng’s team is **cautious but not dismissive of the risks**. Wu notes that the model’s **human-like errors are also its human-like strengths**—they allow for **better error correction** and **more robust generalization**. > *”If a human doctor diagnoses a patient wrong, they can **trace the error to a specific step**—maybe a misread test result or an incorrect assumption about symptoms,”* Wu says. *”AI should be able to do the same. **Failing like a human doesn’t have to be bad**—it just means you can **learn from that failure**.”* — **The Future: Toward “Cognitive” AI?** The question now isn’t **if** AI will develop more human-like reasoning capabilities—it’s **when** and **what we’ll do with This article was reported by the ArtificialDaily editorial team. Related posts: GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Th GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Th GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Th GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Th Post navigation GPT-5.2 derives a new result in theoretical physics