The world of Artificial Intelligence is racing ahead at an astonishing pace. A new model arrives every few months, breaking benchmark records and stirring up headlines with claims of superhuman performance on tests for language, reasoning, and coding. But beneath the buzz, one vital question remains overlooked: how long can these AI systems stay competent when tasked with real-world, multi-step challenges requiring sustained effort?
Sure, today’s AI can ace a math problem or write a few lines of code, but can it tackle a task that takes a human 30 minutes? An hour? A full workday?
This blog explores that very question through a fascinating new lens introduced by researchers at METR: the 50% task completion time horizon. It’s a metric designed to measure whether AI can complete a task and the time duration of the task that AI can handle before it starts to fail. In other words, the clock is ticking for AI!
Most AI models today are evaluated using standard benchmarks, and while these tests are useful, they’re often limited to short, isolated tasks. Think about answering a trivia question, translating a sentence, or completing a snippet of code. What they don’t measure well is agency: the ability to plan, execute a sequence of actions, handle tools, recover from errors, and stay focused on a larger goal over time.
But what happens when we ask AI to do something more involved, something that would take a skilled human 15, 30, or even 60 minutes to complete?
That’s exactly the question tackled in a new research paper from the Model Evaluation & Threat Research (METR) team. The paper introduces a bold, intuitive new metric to measure real-world AI performance: the 50% task completion time horizon, a way to track how long an AI can work before it fails.
To move beyond short, synthetic benchmarks, the METR team proposes a much more meaningful way to evaluate AI: the task completion time horizon.
Think of it this way: if an AI model has a time horizon of 30 minutes, that means it can autonomously complete tasks like writing code, fixing bugs, or analyzing data, that a human expert would spend 30 minutes on and succeed half the time.
This shift in evaluation grounds AI performance in human-relevant units of work, making it far easier to understand the real-world value and limitations of today’s most advanced models.
Also Read: 12 Important Model Evaluation Metrics for Machine Learning Everyone Should Know
To calculate the 50% task completion time horizon, the METR team designed a robust methodology using three key elements. Let’s understand each one of them:
The first step was creating a comprehensive set of 169 tasks from various domains, such as software engineering, cybersecurity, general reasoning, and machine learning (ML) research. This diverse mix ensures the methodology captures AI’s ability to handle tasks across different complexity levels:
This diverse suite from seconds to hours, helps form a well-rounded picture of AI’s capabilities across different task types and durations.
To benchmark AI performance, the team first needed to establish a human baseline or the “ground truth.” Skilled professionals with domain expertise (such as software engineers for coding tasks) were timed performing the tasks, providing essential data on how long humans typically take to complete each task.
Next, the researchers evaluated AI models, configured as autonomous agents, on the same tasks. These models were provided with task descriptions and necessary tools (like code execution environments) to complete the tasks. The performance of models such as GPT-2, DaVinci-002 (GPT-3), gpt-3.5-turbo-instruct, multiple versions of GPT-4, and several iterations of Claude were tracked to assess their success rates.
By comparing AI performance against human baseline completion times, the researchers could determine, for each model, the duration of human time at which it achieved 50% success as the model’s time horizon.
One of the most striking findings in the METR paper is the exponential increase in AI’s ability to complete longer tasks. The 50% task completion time horizon; a key metric used to measure AI performance, has been doubling approximately every seven months since 2019. This finding emphasizes how quickly AI models are advancing, not just in handling simple tasks but in managing increasingly complex ones.
Exponential growth is not the same as linear improvement. Instead of AI making small, steady gains over time, we are seeing a rapid acceleration in its capabilities. In simple terms, AI systems are evolving quickly. As time passes, they are handling longer and more complex tasks much faster than ever before.
Doubling Time: The term “doubling time” refers to how often AI models’ abilities to complete tasks double in length.
Current Frontier: As of early 2025, the best AI models, such as Claude 3.7 Sonnet, have reached a 50% success rate for tasks that would typically take a skilled human about 50 minutes to complete.
This exponential trend is visualized in the above graph, which highlights how quickly the 50% task completion time horizon has grown. The graph tracks the performance of various models released between 2019 and 2025, showing a consistent upward trend. The data reveals a strong correlation, with an R² value of 0.98, indicating that the growth pattern is both significant and predictable.
From GPT-2 to GPT-4: Back in 2019, models like GPT-2 could only handle tasks that took mere seconds to complete. Fast-forward to 2025, and we see models like GPT-4 and Claude 3.7 Sonnet nearing the one-hour mark for task completion, demonstrating just how much AI’s task horizon has expanded.
This possibility is exciting because it indicates that we may soon see AI models capable of managing tasks that would traditionally take several hours or even days for humans. If this trend holds, it would mean that AI could soon be autonomously handling more significant, time-consuming tasks, significantly impacting industries such as research, development, and operations.
The answer isn’t just about learning more information; it’s about key advances in AI’s fundamental capabilities. The METR paper identifies three core drivers behind this rapid improvement:
1. Greater Reliability and Error Correction
Newer AI models are less error-prone than their predecessors. Crucially, they are now better at recognizing and correcting mistakes when they happen. This ability is critical for long tasks, which involve multiple steps and the potential for mistakes. Older models might derail after a single error, but today’s models can often get back on track, minimizing disruptions to task completion.
2. Enhanced Logical Reasoning
Complex tasks require more than just following instructions. They demand the ability to break down problems, plan steps logically, and adapt the plan when needed. The latest frontier models exhibit stronger logical reasoning, enabling them to handle intricate, multi-step processes more effectively. This improvement means that AI can tackle challenges requiring careful thought, much like a human expert.
3. Improved Tool Use
Many real-world tasks require AI to interact with external tools, such as searching the web, running code, accessing files, or using APIs. Recent models have shown significant improvement in their ability to use these tools reliably and effectively. This ability is crucial for completing complex tasks that involve many different resources.
In essence, today’s AI models are becoming more robust, adaptable, and skillful. They are not merely pattern matches anymore but autonomous agents capable of maintaining focus and pursuing goals over longer sequences of actions, which is why they are increasingly able to handle tasks of greater length and complexity.
While AI’s overall progress is impressive, the METR paper highlights several key nuances that shape performance: task length, model performance, task messiness, cost, etc.
AI’s success rate tends to decline as the task length increases. For tasks that take only seconds, AI can perform well, but as tasks extend into minutes or hours, success rates drop significantly. The 50% task completion time horizon captures the point where AI can complete tasks half the time and shows how task duration impacts performance.
Different models show significant variations in their ability to handle tasks. For example:
In comparison, older models like GPT-3.5 and GPT-4 0314 fall behind in handling long-duration tasks. Additionally, even within the same family, different fine-tuned versions of a model (like variations of Claude 3.5 Sonnet) can exhibit distinct differences in their time horizon, demonstrating the model’s evolution over time.
A significant factor affecting AI’s performance is a task’s ambiguity or messiness. Task messiness refers to how ill-defined, ambiguous, or unexpected a task is.
While AI models are typically more cost-effective than human labor for shorter tasks, the cost ratio changes for longer, more complex tasks.
The authors of the METR paper acknowledge several limitations in their study, which are important to consider when interpreting the findings:
Despite these limitations, the 50% task completion time horizon provides a valuable and interpretable snapshot of AI’s ability to handle complex, time-consuming tasks.
The fact that AI’s ability to handle long-duration tasks is doubling every 7 months has far-reaching implications:
The METR paper introduces a new way to measure AI’s progress by focusing on its ability to handle complex, long-duration tasks. The 50% task completion time horizon gives us an intuitive, human-centric way to evaluate AI’s capabilities. The doubling time of approximately seven months highlights the rapid pace at which AI is advancing, particularly in terms of its agency and ability to handle tasks over extended periods.
While there are still uncertainties, the trend is clear: AI is rapidly becoming more capable of tackling the kinds of tasks that define much of human work. Watching how this time horizon evolves will be crucial for understanding the future development of AI, offering a new lens through which we can track the unfolding of AI’s potential.
Note: We have taken all the images from this research paper.
A. This metric measures how long an AI can effectively work on complex, multi-step tasks. It’s specifically defined as the typical time a skilled human would need to complete tasks that the AI can succeed at 50% of the time. It helps gauge AI’s ability to sustain effort grounded in human work durations.
A. Traditional benchmarks often use short, isolated tasks (like answering one question). They fail to measure an AI’s “agency”—its critical ability to plan sequences, use tools, handle errors, and maintain focus over time, which is essential for most real-world work.
A. AI’s ability to manage longer tasks is growing exponentially. According to the research, the 50% task completion time horizon has been doubling approximately every seven months since 2019, showing rapid advancement in tackling more time-consuming challenges.
A. Three core drivers identified are:
1. Greater Reliability/Error Correction: Newer AIs are better at spotting and fixing mistakes, keeping them on track longer.
2. Enhanced Logical Reasoning: Improved ability to break down problems, plan steps, and adapt plans.
3. Improved Tool Use: More effective interaction with necessary tools like code interpreters or web searches.
A. As of early 2025, leading models such as Claude 3.7 Sonnet and advanced versions of GPT-4 have reached a time horizon of about 50 minutes. This means they achieve 50% success on tasks that typically take skilled humans nearly an hour to complete.