AI’s Time Horizon: Can AI Complete Long Tasks?

Anu Madan Last Updated : 02 Apr, 2025
11 min read

The world of Artificial Intelligence is racing ahead at an astonishing pace. A new model arrives every few months, breaking benchmark records and stirring up headlines with claims of superhuman performance on tests for language, reasoning, and coding. But beneath the buzz, one vital question remains overlooked: how long can these AI systems stay competent when tasked with real-world, multi-step challenges requiring sustained effort?

Sure, today’s AI can ace a math problem or write a few lines of code, but can it tackle a task that takes a human 30 minutes? An hour? A full workday?

This blog explores that very question through a fascinating new lens introduced by researchers at METR: the 50% task completion time horizon. It’s a metric designed to measure whether AI can complete a task and the time duration of the task that AI can handle before it starts to fail. In other words, the clock is ticking for AI!

Why Traditional Benchmarks Fall Short?

Most AI models today are evaluated using standard benchmarks, and while these tests are useful, they’re often limited to short, isolated tasks. Think about answering a trivia question, translating a sentence, or completing a snippet of code. What they don’t measure well is agency: the ability to plan, execute a sequence of actions, handle tools, recover from errors, and stay focused on a larger goal over time.

But what happens when we ask AI to do something more involved, something that would take a skilled human 15, 30, or even 60 minutes to complete?

That’s exactly the question tackled in a new research paper from the Model Evaluation & Threat Research (METR) team. The paper introduces a bold, intuitive new metric to measure real-world AI performance: the 50% task completion time horizon, a way to track how long an AI can work before it fails.

Introducing AI’s Time Horizon: A Better Way to Measure Real-World Performance

To move beyond short, synthetic benchmarks, the METR team proposes a much more meaningful way to evaluate AI: the task completion time horizon.

  • Rather than simply asking if an AI can succeed at a task, this metric asks: can a task be done (based on the time a human expert would take) before the AI starts to fail?
  • They define the 50% task completion time horizon as “the time it takes a skilled human to complete tasks that AI can succeed at 50% of the time.”
METR's "50% task completion time horizon" metric checks if an AI model can handle long tasks & monitors its performance over time

Think of it this way: if an AI model has a time horizon of 30 minutes, that means it can autonomously complete tasks like writing code, fixing bugs, or analyzing data, that a human expert would spend 30 minutes on and succeed half the time.

This shift in evaluation grounds AI performance in human-relevant units of work, making it far easier to understand the real-world value and limitations of today’s most advanced models.

Also Read: 12 Important Model Evaluation Metrics for Machine Learning Everyone Should Know

Building the Measuring Stick: How AI’s Task Horizon Is Calculated

To calculate the 50% task completion time horizon, the METR team designed a robust methodology using three key elements. Let’s understand each one of them:

1. The Diverse Task Suite: Capturing a Range of Human Work

The first step was creating a comprehensive set of 169 tasks from various domains, such as software engineering, cybersecurity, general reasoning, and machine learning (ML) research. This diverse mix ensures the methodology captures AI’s ability to handle tasks across different complexity levels:

  • HCAST (Human-Compatible Agent Speed Tasks): A set of 97 tasks requiring agency, with human completion times ranging from 1 minute to 30 minutes. These tasks simulate real-world situations where the agent needs to plan steps, interact with tools (like code interpreters or file systems), and adjust its approach as needed.
  • SWAA (Software Agent Action) Suite: A collection of 66 quick tasks from software engineering, each taking humans between 1 and 30 seconds. These tasks help anchor the lower end of the time scale.
  • RE-Bench: A set of 7 complex research engineering tasks, each taking humans about 8 hours. These challenges test AI capabilities at the longer end of the time horizon.

This diverse suite from seconds to hours, helps form a well-rounded picture of AI’s capabilities across different task types and durations.

2. Timing the Humans: Establishing a Ground Truth

To benchmark AI performance, the team first needed to establish a human baseline or the “ground truth.” Skilled professionals with domain expertise (such as software engineers for coding tasks) were timed performing the tasks, providing essential data on how long humans typically take to complete each task.

3. Evaluating the AI Agents: Testing Real-World Performance

Next, the researchers evaluated AI models, configured as autonomous agents, on the same tasks. These models were provided with task descriptions and necessary tools (like code execution environments) to complete the tasks. The performance of models such as GPT-2, DaVinci-002 (GPT-3), gpt-3.5-turbo-instruct, multiple versions of GPT-4, and several iterations of Claude were tracked to assess their success rates.

By comparing AI performance against human baseline completion times, the researchers could determine, for each model, the duration of human time at which it achieved 50% success as the model’s time horizon.

The Exponential Growth of AI Time Horizons: Doubling Every 7 Months

One of the most striking findings in the METR paper is the exponential increase in AI’s ability to complete longer tasks. The 50% task completion time horizon; a key metric used to measure AI performance, has been doubling approximately every seven months since 2019. This finding emphasizes how quickly AI models are advancing, not just in handling simple tasks but in managing increasingly complex ones.

What Does Exponential Growth Mean for AI?

Exponential growth is not the same as linear improvement. Instead of AI making small, steady gains over time, we are seeing a rapid acceleration in its capabilities. In simple terms, AI systems are evolving quickly. As time passes, they are handling longer and more complex tasks much faster than ever before.

Exponential growth of AI

Doubling Time: The term “doubling time” refers to how often AI models’ abilities to complete tasks double in length.

  • Over the past six years, this period has been consistently about seven months.
  • In other words, approximately every half-year, the tasks that AI models can handle with 50% success double in length, allowing AI to take on more challenging tasks.

Current Frontier: As of early 2025, the best AI models, such as Claude 3.7 Sonnet, have reached a 50% success rate for tasks that would typically take a skilled human about 50 minutes to complete.

  • This means that AI can now autonomously handle tasks that, just a few years ago, would have been too complex for any AI to manage reliably.
  • The key point here is that AI can succeed in these tasks about half of the time, offering real-world practical utility in fields like software engineering, cybersecurity, and research.
METR's "50% task completion time horizon" metric

This exponential trend is visualized in the above graph, which highlights how quickly the 50% task completion time horizon has grown. The graph tracks the performance of various models released between 2019 and 2025, showing a consistent upward trend. The data reveals a strong correlation, with an R² value of 0.98, indicating that the growth pattern is both significant and predictable.

AI’s Growth Over Time

From GPT-2 to GPT-4: Back in 2019, models like GPT-2 could only handle tasks that took mere seconds to complete. Fast-forward to 2025, and we see models like GPT-4 and Claude 3.7 Sonnet nearing the one-hour mark for task completion, demonstrating just how much AI’s task horizon has expanded.

AI's growth over time
  • Interestingly, the paper also points out that this exponential growth may be accelerating even further.
  • The doubling time seems to have shortened between 2023 and 2024, suggesting that AI’s ability to handle longer tasks might continue to grow at a faster pace.
  • However, the paper also notes that more data points are needed to fully confirm whether this acceleration is a sustained trend or just a temporary spike.

This possibility is exciting because it indicates that we may soon see AI models capable of managing tasks that would traditionally take several hours or even days for humans. If this trend holds, it would mean that AI could soon be autonomously handling more significant, time-consuming tasks, significantly impacting industries such as research, development, and operations.

How is AI Beating the Clock?

The answer isn’t just about learning more information; it’s about key advances in AI’s fundamental capabilities. The METR paper identifies three core drivers behind this rapid improvement:

1. Greater Reliability and Error Correction

Newer AI models are less error-prone than their predecessors. Crucially, they are now better at recognizing and correcting mistakes when they happen. This ability is critical for long tasks, which involve multiple steps and the potential for mistakes. Older models might derail after a single error, but today’s models can often get back on track, minimizing disruptions to task completion.

2. Enhanced Logical Reasoning

Complex tasks require more than just following instructions. They demand the ability to break down problems, plan steps logically, and adapt the plan when needed. The latest frontier models exhibit stronger logical reasoning, enabling them to handle intricate, multi-step processes more effectively. This improvement means that AI can tackle challenges requiring careful thought, much like a human expert.

3. Improved Tool Use

Many real-world tasks require AI to interact with external tools, such as searching the web, running code, accessing files, or using APIs. Recent models have shown significant improvement in their ability to use these tools reliably and effectively. This ability is crucial for completing complex tasks that involve many different resources.

In essence, today’s AI models are becoming more robust, adaptable, and skillful. They are not merely pattern matches anymore but autonomous agents capable of maintaining focus and pursuing goals over longer sequences of actions, which is why they are increasingly able to handle tasks of greater length and complexity.

Nuances in AI’s Task Performance

While AI’s overall progress is impressive, the METR paper highlights several key nuances that shape performance: task length, model performance, task messiness, cost, etc.

1. Task Length vs. Success Rate

AI’s success rate tends to decline as the task length increases. For tasks that take only seconds, AI can perform well, but as tasks extend into minutes or hours, success rates drop significantly. The 50% task completion time horizon captures the point where AI can complete tasks half the time and shows how task duration impacts performance.

2. Variations in Model Performance

Different models show significant variations in their ability to handle tasks. For example:

  • Claude 3.7 Sonnet: A newer model by Anthropic, Claude 3.7 Sonnet is known for its strong reasoning and ability to handle complex, multi-step tasks more consistently than its predecessors.
  • GPT-4o: This version of OpenAI’s GPT-4 is an upgraded, more efficient model that excels at handling longer tasks with improved coherence and reduced error rates.
  • Claude 3 Opus: This version of Claude builds on its predecessors, showing a marked improvement in task completion over extended periods, with more sophisticated understanding and reasoning capabilities.

In comparison, older models like GPT-3.5 and GPT-4 0314 fall behind in handling long-duration tasks. Additionally, even within the same family, different fine-tuned versions of a model (like variations of Claude 3.5 Sonnet) can exhibit distinct differences in their time horizon, demonstrating the model’s evolution over time.

3. Task “Messiness” and AI Performance

A significant factor affecting AI’s performance is a task’s ambiguity or messiness. Task messiness refers to how ill-defined, ambiguous, or unexpected a task is.

task messiness and performance
  • The paper shows that tasks with high messiness scores tend to result in lower AI performance, especially for longer-duration tasks.
  • Tasks requiring more interpretation or dealing with vague requirements are harder for AI, causing slower improvements in these areas compared to well-defined tasks.
  • This indicates that robustness to ambiguity is a critical area for further AI development.

4. The Cost of Running AI Models

While AI models are typically more cost-effective than human labor for shorter tasks, the cost ratio changes for longer, more complex tasks.

  • The computational cost of running these AI agents increases as the tasks become longer and more involved, particularly when the models require multiple attempts to complete the task.
  • For many tasks, AI is still significantly cheaper than human work, but this difference diminishes as the tasks become more intricate and time-consuming.

Limitations in AI Time Horizon Research

The authors of the METR paper acknowledge several limitations in their study, which are important to consider when interpreting the findings:

  1. Task Set Specificity: The study’s results are based on a specific set of 169 tasks. While these tasks are diverse, they may not fully represent all real-world scenarios. For example, tasks requiring physical interaction, emotional understanding, or creative thinking might yield different results.
  2. Human Baseline Variation: Human performance varies from person to person. Although the researchers used experts and averaged completion times, these baselines are still estimates, which could introduce variability in the results.
  3. Agent Setup: The configuration of the AI models like prompting and tool access can influence performance. Different setups might produce different results, making it essential to account for how models are implemented during testing.
  4. Extrapolation Uncertainty: Although the trend of AI’s improvement is clear, predicting future growth is inherently uncertain. Factors like data limitations, potential algorithmic breakthroughs, or unforeseen bottlenecks could alter the trajectory.
  5. Definition of “Success”: The study uses a binary success/failure criterion, which may not capture partial successes or solutions that are mostly correct but contain minor flaws.

Despite these limitations, the 50% task completion time horizon provides a valuable and interpretable snapshot of AI’s ability to handle complex, time-consuming tasks.

What Does AI’s Rapid Growth Mean for the World?

The fact that AI’s ability to handle long-duration tasks is doubling every 7 months has far-reaching implications:

  1. Economic Impact: AI’s improving ability to automate long tasks will reduce labor costs and increase efficiency, enabling automation of tasks that currently take hours, potentially spanning entire workflows.
  2. AI Safety and Alignment: As AI handles more complex, long-term tasks, aligning these systems with human values becomes critical to ensure safe and ethical autonomy.
  3. Benchmarking the Future: The time horizon metric offers a new way to assess AI’s progress by focusing on task duration and agency, helping evaluate its real-world capabilities.
  4. Near-Term AI Capabilities: While AGI is not yet realized, AI systems capable of handling multi-hour tasks are emerging quickly, signaling the potential for highly useful, disruptive AI capabilities.

Conclusion

The METR paper introduces a new way to measure AI’s progress by focusing on its ability to handle complex, long-duration tasks. The 50% task completion time horizon gives us an intuitive, human-centric way to evaluate AI’s capabilities. The doubling time of approximately seven months highlights the rapid pace at which AI is advancing, particularly in terms of its agency and ability to handle tasks over extended periods.

While there are still uncertainties, the trend is clear: AI is rapidly becoming more capable of tackling the kinds of tasks that define much of human work. Watching how this time horizon evolves will be crucial for understanding the future development of AI, offering a new lens through which we can track the unfolding of AI’s potential.

Note: We have taken all the images from this research paper.

Frequently Asked Questions

Q1. What is the “50% task completion time horizon” for AI?

A. This metric measures how long an AI can effectively work on complex, multi-step tasks. It’s specifically defined as the typical time a skilled human would need to complete tasks that the AI can succeed at 50% of the time. It helps gauge AI’s ability to sustain effort grounded in human work durations.

Q2. Why are traditional AI benchmarks not enough to measure real-world capabilities?

A. Traditional benchmarks often use short, isolated tasks (like answering one question). They fail to measure an AI’s “agency”—its critical ability to plan sequences, use tools, handle errors, and maintain focus over time, which is essential for most real-world work.

Q3. How quickly is AI improving at handling longer tasks?

A. AI’s ability to manage longer tasks is growing exponentially. According to the research, the 50% task completion time horizon has been doubling approximately every seven months since 2019, showing rapid advancement in tackling more time-consuming challenges.

Q4. What factors are driving this rapid improvement in AI’s task duration capability?

A. Three core drivers identified are:
1. Greater Reliability/Error Correction: Newer AIs are better at spotting and fixing mistakes, keeping them on track longer.
2. Enhanced Logical Reasoning: Improved ability to break down problems, plan steps, and adapt plans.
3. Improved Tool Use: More effective interaction with necessary tools like code interpreters or web searches.

Q5. What is the current capability of the best AI models in terms of task duration?

A. As of early 2025, leading models such as Claude 3.7 Sonnet and advanced versions of GPT-4 have reached a time horizon of about 50 minutes. This means they achieve 50% success on tasks that typically take skilled humans nearly an hour to complete.

Anu Madan is an expert in instructional design, content writing, and B2B marketing, with a talent for transforming complex ideas into impactful narratives. With her focus on Generative AI, she crafts insightful, innovative content that educates, inspires, and drives meaningful engagement.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details