When the Black Box Beats the Experts: A Catalog of Inexplicably Good AI Results

TL;DR

Across at least 18 well-documented cases spanning games, materials science, mathematics, chip design, antibiotics, fusion control, and aerospace, AI systems have produced solutions that domain experts call "alien," "unconventional," or "creative" — and that outperform the best human-designed alternatives despite resisting full mechanistic explanation.
The most credible recent examples are not artistic curiosities but production-grade engineering wins: AlphaChip now lays out Google's TPUs in hours instead of months; AlphaEvolve (May 2025) recovers ~0.7% of Google's worldwide compute and broke a 56-year-old matrix-multiplication record; AlphaFold (Nobel Prize, Chemistry 2024) solved a 50-year-old biology problem; AlphaDev's sorting routines are now in the LLVM C++ standard library.
Two consistent patterns explain why these results feel inexplicable: (a) the AI exploits the full physics or combinatorics of a system — analog FPGA glitches, sub-instruction CPU rewrites, sphere packings in 11 dimensions — that humans abstract away; and (b) the AI is unconstrained by historical "good style," so it finds locally counterintuitive moves (queen sacrifices, shoulder hits, ungainly antenna shapes) that turn out to be globally optimal.

Key Findings

Black-box wins are now routine in production. Three of the cases below — AlphaChip (Google TPU layouts), AlphaDev (LLVM libc++ sort), and AlphaEvolve (Borg scheduling and Gemini training kernels) — are running in real infrastructure used by billions of users, not laboratory demos.
"Inexplicable" rarely means uninterpretable forever. AlphaGo's Move 37 and AlphaZero's queen sacrifices have been partially absorbed into human theory; AlphaFold's mechanism is now understood as sophisticated fold recognition exploiting the completeness of the PDB. Inexplicability is often a temporary state.
The most genuinely mysterious cases involve evolved hardware/circuits. Adrian Thompson's 1996 FPGA tone discriminator and NASA's ST5 antenna remain the canonical examples of physical artifacts that work but whose internal logic resists clean human description.
AI now contributes verifiable mathematics. FunSearch (2023) gave the first LLM-discovered solution to an open problem (cap sets); AlphaTensor (2022) broke Strassen's 1969 record; AlphaEvolve (2025) improved more than a dozen matrix-multiplication algorithms and improved best-known bounds on roughly 20% of 50+ open math problems it was tested on.
Medical AI has surfaced features humans didn't know existed in plain sight. Google/Verily's retinal model predicts sex with AUC 0.97 from fundus photos — a signal ophthalmologists were unaware existed. Halicin (MIT, 2020) is a structurally novel antibiotic whose discovery path through a neural net cannot be reduced to medicinal-chemistry rules.

Details

Game-Playing and Strategy

1. AlphaGo Move 37 (DeepMind, 2016). In Game 2 against Lee Sedol, AlphaGo's 37th move — a shoulder hit on the fifth line — was a move AlphaGo itself estimated a human had a 1-in-10,000 probability of playing. European champion Fan Hui's reaction in the AlphaGo documentary: "When I see this move, for me, it's just a big shock. What? Normally, humans, we never play this one because it's bad. It's just bad. We don't know why, it's bad." Commentator Michael Redmond (9-dan) called it "creative" and "unique"; An Younggil (8-dan) called it "a rare and intriguing shoulder hit." Lee Sedol left the room and took 15 minutes to respond. The move's value was learned during self-play and could not be reduced to traditional Go aesthetics; it is now studied as a canonical example of machine creativity.

2. AlphaZero's queen sacrifices (DeepMind, 2017–2018). Trained from scratch in 9 hours with only the rules of chess, AlphaZero defeated Stockfish 8 with a 28-0-72 record. DeepMind co-founder Demis Hassabis called the play style "alien": "It sometimes wins by offering counterintuitive sacrifices, like offering up a queen and bishop to exploit a positional advantage. It's like chess from another dimension." Grandmaster Matthew Sadler said its style was "like discovering the secret notebooks of some great player from the past." The system searched only 80,000 positions per second compared to Stockfish's 70 million, meaning the neural net's positional intuition — not brute calculation — was doing the work, and that intuition was not reverse-engineerable to known chess principles.

3. Libratus & Pluribus poker AIs (Carnegie Mellon, 2017 & 2019). Libratus beat four top heads-up no-limit Hold'em pros in 2017, and Pluribus beat five pros simultaneously in 2019. Their key inexplicable move was massive over-bets — sometimes wagering $20,000 into a $100 pot — that initially looked like errors but were later analyzed and adopted by the human poker community. CMU's Frank Pfenning: "The computer can't win if it can't bluff. Developing an AI that can do that successfully is a tremendous step forward scientifically." Sandholm's twist: rather than modeling opponents, Libratus algorithmically patched holes in its own strategy each night.

Scientific Discovery

4. AlphaFold 2 (DeepMind, 2020; Nobel Prize 2024). Predicted 3D protein structures from amino-acid sequences at near-experimental accuracy at CASP14, ending a 50-year grand challenge. By 2024 the AlphaFold database contained over 200 million predicted structures used by more than 2 million researchers in 190 countries. Demis Hassabis and John Jumper shared half of the 2024 Nobel Prize in Chemistry "for protein structure prediction." Even authors writing post-hoc analyses describe it as a "black box approach": "Input a protein sequence and by 'magic' the protein's three-dimensional structure appears" (Skolnick et al., 2021, Proteins). The mechanism is now partially understood (sophisticated fold recognition exploiting the completeness of the PDB), but its sudden, discontinuous jump in accuracy from CASP13 to CASP14 was unanticipated by the field.

5. GNoME materials discovery (DeepMind + Lawrence Berkeley Lab, 2023). A graph neural network predicted 2.2 million new crystal structures, of which 380,000 are predicted stable — equivalent, DeepMind says, to "nearly 800 years' worth of knowledge." Before GNoME, humans had identified roughly 48,000 inorganic crystals over decades. Independent labs have already physically synthesized 736 of the predicted structures. Per DeepMind's official GNoME blog, GNoME also identified "528 potential lithium ion conductors, 25 times more than a previous study" (the 25× figure compares candidate counts, not the conductivity level itself). Crucially, the chemical reasoning behind why specific candidates are stable is encoded in graph-neural-network weights, not in interpretable rules.

6. AlphaTensor matrix-multiplication algorithms (DeepMind, Nature 2022). Formulated matrix multiplication as a single-player game and trained an AlphaZero-style agent to discover algorithms. It rediscovered Strassen's 1969 algorithm and then surpassed it: for 4×4 matrices in modulo-2 arithmetic, it found a 47-multiplication algorithm, beating Strassen's 49 for the first time in 53 years. For 5×5 matrices it improved from 80 to 76 multiplications. Tuned for an NVIDIA V100, its discovered algorithms ran a median 8.5% faster; on TPUs, 10.3% faster.

7. FunSearch on cap sets (DeepMind, Nature December 2023). Pairing a PaLM-2 LLM with an automated evaluator, FunSearch produced programs (not just numbers) that constructed the largest known cap sets in 8 dimensions — a size-512 set, improving on the previous best of 496, and the largest improvement in the asymptotic lower bound for the cap-set problem in 20 years. The cap-set problem had been described by Fields Medalist Terence Tao as one of his favorite open questions. DeepMind: "To the best of our knowledge, this shows the first scientific discovery — a new piece of verifiable knowledge about a notorious scientific problem — using an LLM."

8. AlphaEvolve (DeepMind, May 2025). A Gemini-powered evolutionary coding agent that combines Gemini Flash (breadth) and Gemini Pro (depth) with automated evaluators. Key results, all from DeepMind's official announcement and white paper:

48-multiplication algorithm for 4×4 complex-valued matrices, improving on Strassen's 49-multiplication algorithm for the first time in this setting since 1969. The white paper reports AlphaEvolve "improves the state of the art for 14 matrix multiplication algorithms" across rectangular sizes.
Tested on over 50 open problems in mathematical analysis, geometry, combinatorics and number theory: rediscovered state-of-the-art solutions in ~75% of cases and improved them in ~20%. Named results include a new lower bound for the kissing number in 11 dimensions (593 spheres, up from 592) and a new upper bound for Erdős's minimum overlap problem.
Production deployment at Google: a scheduling heuristic for Borg "continuously recovers, on average, 0.7% of Google's worldwide compute resources." A Verilog rewrite "removed unnecessary bits in a key, highly optimized arithmetic circuit for matrix multiplication," now "integrated into an upcoming Tensor Processing Unit." A Pallas kernel for Gemini was sped up by 23%, reducing total Gemini training time by 1%. Up to 32.5% speedup for the FlashAttention kernel.
DeepMind researcher Matej Balog (VentureBeat, May 14, 2025): "What we found, to our surprise, to be honest, is that AlphaEvolve, despite being a more general technology, obtained even better results than AlphaTensor. For these four by four matrices, AlphaEvolve found an algorithm that surpasses Strassen's algorithm from 1969 for the first time in that setting."
Terence Tao, who collaborated on follow-up math work and tested AlphaEvolve on 67 problems including the moving sofa problem, Sendov's conjecture, and Crouzeix's conjecture, wrote (Nov 5, 2025 blog post): "Perhaps unsurprisingly, AlphaEvolve was extremely good at locating 'exploits' in the verification code we provided … but were not in the spirit of the actual problem." And: "The stochastic nature of the LLM can actually work in one's favor in such an evolutionary environment: many 'hallucinations' will simply end up being pruned out of the pool of solutions being evolved due to poor performance, but a small number of such mutations can add enough diversity to the pool that one can break out of local extrema."

Engineering and Design

9. NASA ST5 Evolved Antenna (NASA Ames, 2006). A genetic algorithm evolved an X-band antenna for NASA's three-satellite Space Technology 5 mission with a "wide beamwidth for a circularly-polarized wave and wide bandwidth." NASA Ames' own report (Hornby, Lohn, Linden) describes the result candidly: "It has an unusual organic looking structure, one that expert antenna designers would not likely produce." The evolved antenna outperformed the antenna contractor's hand-designed version on key metrics, was re-evolved in under a month when the orbit changed, and flew successfully on all three spacecraft from March 22, 2006 — the first computer-evolved hardware ever flown in space.

10. Adrian Thompson's evolved FPGA tone discriminator (University of Sussex, 1996). Thompson evolved a circuit on a Xilinx XC6216 FPGA to discriminate 1 kHz from 10 kHz tones using fewer than 40 logic gates and no clock. The final circuit exploited subthreshold analog effects, EMI between unconnected logic cells, and silicon process variations of that specific FPGA: moving the bitstream to a different region of the same chip degraded performance. Some logic blocks appeared disconnected but, if removed, broke the circuit. EE Times: "Such a circuit would have been impossible to design by hand and it was almost impossible to understand how it worked." This is arguably the purest example in the catalog of an AI artifact that genuinely cannot be reverse-engineered.

11. AlphaChip / TPU floorplanning (Google DeepMind, Nature 2021; updated 2024). A reinforcement-learning agent places macro blocks on chip floorplans in hours rather than weeks. DeepMind: "It generates superhuman or comparable chip layouts in hours, rather than taking weeks or months of human effort." Production deployed in Google's TPU v5e, v5p, and Trillium (6th gen). Whereas human designers lay out components in neat lines, AlphaChip uses a "more scattered approach" that traditional chip engineers find counterintuitive. Caveat: the 2021 Nature paper has been disputed — a UCSD reproduction (Andrew Kahng group) and former Google researcher Igor Markov raised concerns that the comparison baselines were not standardized and that some commercial EDA-tool preprocessing was not disclosed. DeepMind published a rebuttal in 2024.

12. Airbus A320 "Bionic Partition" (Autodesk + The Living + APWorks, 2015–2016). Generative-design software (Project Dreamcatcher) seeded with slime-mold and mammalian-bone-growth algorithms produced a 3D-printed cabin-galley partition that is ~45% lighter than the standard honeycomb partition while being structurally as strong. Airbus innovation manager Bastian Schaefer: "Our goal was to reduce the weight by 30 percent, and we altogether achieved weight reduction by 55 percent." Estimated CO₂ savings of up to 465,000 metric tonnes per year if rolled out across the A320 fleet. The 66,000-micro-lattice-bar geometry is too complex to reason about cell-by-cell.

13. DeepMind tokamak plasma control (DeepMind + EPFL Swiss Plasma Center, Nature 2022). A deep RL agent autonomously learned to control the 19 magnetic coils of the Variable Configuration Tokamak (TCV) and sustain plasma in shapes previously requiring nested controllers — including elongated, "negative triangularity," "snowflake" configurations, and even maintaining two separate "droplet" plasmas simultaneously. MIT Technology Review: "The AI also learned how to control the plasma by adjusting magnets in a way that humans had not tried before, which suggests that there may be new reactor configurations to explore." Ambrogio Fasoli (director, Swiss Plasma Center): "We can take risks with this kind of control system that we wouldn't dare take otherwise." Now part of a partnership with Commonwealth Fusion Systems on the SPARC tokamak.

14. Karl Sims' Evolved Virtual Creatures (Thinking Machines / SIGGRAPH, 1994). A historical foundational case. Genetic algorithms running on a Connection Machine CM-5 co-evolved 3D-block creatures' morphology and control neural networks for swimming, walking, jumping, and competing for a cube. Many evolved gaits exploited simulator artifacts — "unexpected methods of cheating," as Sims himself put it — by leveraging numerical instabilities or unintended physics. This is the canonical early demonstration that optimization processes find solutions outside the designer's hypothesis space.

Algorithms, Computing, and Forecasting

15. AlphaDev sorting algorithms (DeepMind, Nature June 2023). Treated sorting as an assembly-instruction game and discovered new sort routines for sequences of 3, 4, and 5 elements. Per DeepMind's official AlphaDev blog: "AlphaDev uncovered new sorting algorithms that led to improvements in the LLVM libc++ sorting library that were up to 70% faster for shorter sequences and about 1.7% faster for sequences exceeding 250,000 elements." AlphaDev's "swap" and "copy" moves are sequences of assembly instructions that save a single instruction each but were not in any human's library. The algorithms have been merged into the LLVM standard C++ sort library and are now called trillions of times per day.

16. GraphCast weather forecasting (DeepMind, Science November 2023). A graph neural network trained on 40 years of ECMWF ERA5 reanalysis data produces 10-day global forecasts at 0.25° resolution in under one minute — versus an hour on the ECMWF supercomputer. GraphCast outperforms ECMWF's HRES — the world's gold-standard deterministic forecast — on 89.3% of 2,760 target variables and lead times evaluated, and on >99% of troposphere variables. Per DeepMind's official GraphCast blog: "In September, a live version of our publicly available GraphCast model, deployed on the ECMWF website, accurately predicted about nine days in advance that Hurricane Lee would make landfall in Nova Scotia… traditional forecasts had greater variability in where and when landfall would occur, and only locked in on Nova Scotia about six days in advance." ECMWF's head of Earth system modeling Peter Dueben told MIT Technology Review (Melissa Heikkilä, Nov 14, 2023): "When Google DeepMind first debuted GraphCast last December, it felt like Christmas." GraphCast won the 2024 MacRobert Award, the UK's top engineering prize. Why it works at the physical level is poorly understood: it learns statistical relationships from data without solving the underlying physics equations.

Medical and Biological

17. Halicin antibiotic discovery (MIT Collins/Barzilay labs, Cell 2020). A deep neural network was trained on a labeled set of 2,335 molecules tested for E. coli inhibition, then deployed for screening. Per MIT News (Feb 20, 2020): "The computer model, which can screen more than a hundred million chemical compounds in a matter of days… After identifying halicin, the researchers also used their model to screen more than 100 million molecules selected from the ZINC15 database." Halicin itself was identified — in under two hours — from the ~6,000-compound Broad Drug Repurposing Hub; the >100-million-molecule ZINC15 screen was a follow-up. Halicin kills E. coli, Clostridium difficile, Acinetobacter baumannii, and Mycobacterium tuberculosis, including pan-resistant strains. E. coli developed no resistance to halicin in 30 days of laboratory evolution, versus 24–72 hours for ciprofloxacin. Collins: "Our approach revealed this amazing molecule which is arguably one of the more powerful antibiotics that has been discovered." The molecule is structurally unlike conventional antibiotics, and its membrane-depolarization mechanism was not predicted in advance — the neural network selected it on grounds that are not reducible to known medicinal-chemistry rules. A 2023 follow-up by the same group, using "explainable AI," found a new class of MRSA-killing compounds.

18. Sex/cardiovascular prediction from retinal photos (Google + Verily, Nature Biomedical Engineering 2018). Poplin et al. trained Inception-v3 on 284,335 fundus images and predicted patient sex with AUC = 0.97 — a signal ophthalmologists did not believe existed in retinal images. The model also predicted age (mean absolute error 3.26 years), smoking status (AUC 0.71), systolic BP (MAE 11.23 mmHg), and major adverse cardiac events (AUC 0.70). Saliency maps highlighted the optic disc, vessels, and "non-specific features" — i.e., the network was using "a bit of everything but nothing in particular." Independent replications (Korot et al. 2021, Kim et al.) confirmed the finding generalizes. Cardiologist commentary: "Until this day, we didn't think that there were any differences between male and female eyes. How can a neural network tell them apart from just a photo of the retina is yet to be elucidated."

Recommendations

For a researcher, engineer, or executive trying to use this landscape:

If you need a verifiable, deployable AI breakthrough today, prioritize problems with a cheap, automatic evaluator. AlphaEvolve, FunSearch, AlphaTensor, AlphaDev, and AlphaChip all share one feature: correctness or performance can be measured by a script. If your domain has such an evaluator (matrix multiplication, sorting, chip placement, mathematical proof verification, simulation reward), an evolutionary or RL coding agent is the strongest current bet. Threshold: if you can write a unit-test-like scorer for your problem, you are in scope.
If you cannot afford "alien" solutions you cannot audit, stay out of evolved-hardware territory. Thompson's FPGA and arguably AlphaChip's TPU layouts illustrate that AI optimization will happily exploit features humans abstract away. For safety-critical aerospace, medical implants, or financial systems, require the AI solution to be re-implementable in human-auditable form — as Airbus did with the bionic partition (final geometry was certified through standard FAA/EASA structural tests).
Treat "the AI discovered X" claims with calibrated skepticism, especially for in-house benchmarks. The AlphaChip dispute (Markov; Kahng/UCSD) shows that even Nature papers can have undisclosed pre-processing or unequal baselines. Demand: (a) open code, (b) third-party reproduction, (c) deployment receipts. AlphaFold, AlphaDev, GraphCast, and AlphaEvolve all clear this bar to varying degrees; many vendor demos do not.
Invest in interpretability after the win, not before. AlphaFold's mechanism is being reverse-engineered years after CASP14; halicin's mode of action was characterized after the screen, not before. The historical record suggests the right sequence is: deploy the black-box solution, then ask why it works. If you reverse the order, you forfeit most of the available wins.
Benchmarks that would change my recommendations: (i) a single instance of an evolved-hardware solution causing a production failure that humans could not diagnose; (ii) a peer-reviewed disproof of one of AlphaEvolve's claimed mathematical improvements; (iii) replication failure of GraphCast or AlphaDev on a fresh independent dataset. None of these has happened as of May 2026.

Caveats

Selection bias. This catalog is the AI hall of fame, not a representative sample. For every halicin there are thousands of AI-suggested compounds that failed in vivo; for every Move 37 there are bad moves that look creative until you analyze them. The base rate of "surprising-and-correct" remains low.
"Inexplicable" is partly a rhetorical convenience. AlphaFold's mechanism is now partially understood; AlphaZero's queen sacrifices are being taught in chess academies; AlphaGo's Move 37 has been studied move-by-move. The frontier of inexplicability moves as the field matures. A more precise term would be "discovered before it was understood."
Vendor-published numbers need independent replication. AlphaChip's superhuman claim is actively disputed by reproducibility researchers; AlphaEvolve's 0.7% / 23% / 32.5% production numbers are first-party Google measurements; GNoME's stability predictions have been challenged by some computational chemists who note that DFT-predicted stability does not guarantee synthesizability. Treat first-party benchmarks as upper bounds.
Some "inexplicable" results are actually shortcut learning. The retinal-photo sex predictor and many medical-imaging models may be picking up dataset artifacts (camera type, demographic correlates) rather than novel biology. The fact that humans can't see the signal does not always mean the signal is real biology; it sometimes means the signal is a confounder.
Two domains are notably absent from the strongest evidence: algorithmic trading (proprietary, unverifiable from outside) and AI-generated art (subjective evaluation; no clean "superhuman" benchmark exists). I deliberately did not include them in the list of strong cases despite the prominent role they play in popular accounts.