In a research lab somewhere between theory and application, How researchers have been quietly working on a problem that has stumped the AI community for years. This week, they published results that could fundamentally change how we think about machine learning. “The AI landscape is shifting faster than most organizations can adapt. What we’re seeing from How represents a meaningful step forward in how these technologies are being developed and deployed.” — Industry Analyst Inside the Breakthrough arXiv:2602.16039v1 Announce Type: new Abstract: The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students’ learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment. Although the effectiveness of these methods has been demonstrated in many tasks across other domains, their applicability and reliability in educational settings, particularly for automatic grading, remain underexplored. Through comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings, we characterize the uncertainty patterns exhibited by LLMs in grading scenarios. Based on these findings, we evaluate the strengths and limitations of different uncertainty metrics and analyze the influence of key factors, including model families, assessment tasks, and decoding strategies, on uncertainty estimates. Our study provides actionable insights into the characteristics of uncertainty in LLM-based automatic assessment and lays the groundwork for developing more reliable and effective uncertainty-aware grading systems in the future. The development comes at a pivotal moment for the AI industry. Companies across the sector are racing to differentiate their offerings while navigating an increasingly complex regulatory environment. For How, this move represents both an opportunity and a challenge. From Lab to Real World Market positioning has become increasingly critical as the AI sector matures. How is clearly signaling its intent to compete at the highest level, investing resources in capabilities that could define the next phase of the industry’s evolution. Competitive dynamics are also shifting. Rivals will likely need to respond with their own announcements, potentially triggering a wave of activity across the sector. The question isn’t whether others will follow—it’s how quickly and at what scale. Enterprise adoption remains the ultimate test. As organizations move beyond experimental phases to production deployments, they’re demanding concrete returns on AI investments. How’s latest move appears designed to address exactly that demand. “We’re past the hype cycle now. Companies that can demonstrate real value—measurable, repeatable, scalable value—are the ones that will define the next decade of AI.” — Venture Capital Partner What Comes Next Industry observers are watching closely to see how this strategy plays out. Several key questions remain unanswered: How will competitors respond? What does this mean for pricing and accessibility in the research space? Will this accelerate enterprise adoption? The coming months will reveal whether How can deliver on its promises. In a market where announcements often outpace execution, the real test will be what happens after the initial buzz fades. For now, one thing is clear: How has made its move. The rest of the industry is watching to see what happens next. This article was reported by the ArtificialDaily editorial team. For more information, visit ArXiv CS.AI. Related posts: New J-PAL research and policy initiative to test and scale AI innovati A Theoretical Framework for Adaptive Utility-Weighted Benchmarking After all the hype, some AI experts don’t think OpenClaw is all that e A Theoretical Framework for Adaptive Utility-Weighted Benchmarking Post navigation Anthropic launches Cowork, a Claude Desktop agent that works in your f Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinica