
Researchers introduce Pythagoras-Prover, an open-source family of efficient Lean theorem provers that outperforms much larger models through curriculum learning and a novel data augmentation technique called Augmented Lean Formalisation (ALF). The 4B-parameter model surpasses DeepSeek-Prover-V2 (671B parameters) on the MiniF2F benchmark, while the 32B variant sets new open-source performance records at 93.0% accuracy and solves 93 of 672 Putnam problems.
Gwynne Shotwell, SpaceX's president and Elon Musk's longtime deputy, has given her first exclusive interview ahead of the company's anticipated initial public offering. Shotwell revealed she harbored IPO doubts for years before shifting her position on taking the space company public.
Researchers introduce PersonaDrive, a vision-language-action (VLA) system that retrieves human driving demonstrations to condition autonomous agents on specific behavioral styles—aggressive, neutral, or conservative—without requiring retraining for each style. The approach uses a three-stage pipeline combining triplet mining, retrieval training, and fine-tuning to achieve 4.6% performance gains over existing baselines on the Bench2Drive benchmark while accurately replicating human-style variation in closed-loop driving simulators.
A new arXiv study evaluates four methods for detecting when language models deliberately lie, using 13 custom-built model organisms with verified hidden beliefs and testing across 31 models ranging from 2B to 1T parameters. While all detectors perform well on prompted lying tasks and scale positively with model capability, activation-based and logprob detectors fail significantly on trained model organisms—the most realistic test case—with only chain-of-thought judges maintaining strong performance at 82% accuracy.
Amazon has rolled out a software update that automatically disables air conditioning in its delivery vans after 10 minutes, or as little as 30 seconds under certain conditions, raising concerns about driver safety during dangerous summer temperatures. The move appears designed to reduce fuel consumption but has sparked criticism from worker safety advocates and reports.
Researchers introduce Evoflux, an inference-time evolutionary search technique that helps compact language models execute complex tool workflows by repairing failed plans in real-time. Testing on 250 tools across live servers, Evoflux increased execution success rates from roughly 3% to 17-24% for small planners, outperforming traditional training methods like supervised fine-tuning and reinforcement learning approaches.
A new arXiv paper examines how artificial intelligence could advance beyond human-level AGI toward artificial superintelligence (ASI), identifying four potential development pathways: scaling, paradigm shifts, recursive self-improvement, and multi-agent collectives. Rather than a single transformative moment, the report suggests AI progress may unfold as a series of breakthroughs across multiple domains, requiring coordinated global preparation.
Researchers at an academic medical center trained a machine learning model to predict which AI-generated clinical responses doctors will reject before they are shown, achieving 71.9% accuracy. By incorporating deployment-specific context like provider type and department alongside query content, the team demonstrated that targeted guardrails can flag problematic LLM outputs in real time, addressing a critical gap in clinical AI evaluation that traditional benchmarks miss.
OpenAI's ChatGPT reached 1 billion monthly active users in May, marking a significant milestone in AI adoption. The achievement comes despite rising public concern about the technology's environmental footprint and ethical implications.
Researchers have found that large language models are picking up antisemitic content and stereotypes from their training data sourced from human-generated text. The study highlights how AI systems can amplify and perpetuate harmful biases present in online sources, raising concerns about the need for better bias detection and mitigation in model development.
Citigroup has unveiled a blockchain-based marketplace enabling investors to trade exposure to private company shares via tokenized depositary receipts. The platform initially targets foreign investors and aims to democratize access to traditionally restricted private markets.