AI’s Time Horizon: Can AI Complete Long Tasks?

Anu Madan Last Updated : 02 Apr, 2025

11 min read

The world of Artificial Intelligence is racing ahead at an astonishing pace. A new model arrives every few months, breaking benchmark records and stirring up headlines with claims of superhuman performance on tests for language, reasoning, and coding. But beneath the buzz, one vital question remains overlooked: how long can these AI systems stay competent when tasked with real-world, multi-step challenges requiring sustained effort?

Sure, today’s AI can ace a math problem or write a few lines of code, but can it tackle a task that takes a human 30 minutes? An hour? A full workday?

This blog explores that very question through a fascinating new lens introduced by researchers at METR: the 50% task completion time horizon. It’s a metric designed to measure whether AI can complete a task and the time duration of the task that AI can handle before it starts to fail. In other words, the clock is ticking for AI!

Why Traditional Benchmarks Fall Short?
Introducing AI’s Time Horizon: A Better Way to Measure Real-World Performance
Building the Measuring Stick: How AI’s Task Horizon Is Calculated
The Exponential Growth of AI Time Horizons: Doubling Every 7 Months
- What Does Exponential Growth Mean for AI?
- AI’s Growth Over Time
How is AI Beating the Clock?
Nuances in AI’s Task Performance
Limitations in AI Time Horizon Research
What Does AI’s Rapid Growth Mean for the World?
Conclusion
Frequently Asked Questions

Why Traditional Benchmarks Fall Short?

Most AI models today are evaluated using standard benchmarks, and while these tests are useful, they’re often limited to short, isolated tasks. Think about answering a trivia question, translating a sentence, or completing a snippet of code. What they don’t measure well is agency: the ability to plan, execute a sequence of actions, handle tools, recover from errors, and stay focused on a larger goal over time.

But what happens when we ask AI to do something more involved, something that would take a skilled human 15, 30, or even 60 minutes to complete?

That’s exactly the question tackled in a new research paper from the Model Evaluation & Threat Research (METR) team. The paper introduces a bold, intuitive new metric to measure real-world AI performance: the 50% task completion time horizon, a way to track how long an AI can work before it fails.

Introducing AI’s Time Horizon: A Better Way to Measure Real-World Performance

To move beyond short, synthetic benchmarks, the METR team proposes a much more meaningful way to evaluate AI: the task completion time horizon.

Rather than simply asking if an AI can succeed at a task, this metric asks: can a task be done (based on the time a human expert would take) before the AI starts to fail?
They define the 50% task completion time horizon as “the time it takes a skilled human to complete tasks that AI can succeed at 50% of the time.”

METR's "50% task completion time horizon" metric checks if an AI model can handle long tasks & monitors its performance over time

Think of it this way: if an AI model has a time horizon of 30 minutes, that means it can autonomously complete tasks like writing code, fixing bugs, or analyzing data, that a human expert would spend 30 minutes on and succeed half the time.

This shift in evaluation grounds AI performance in human-relevant units of work, making it far easier to understand the real-world value and limitations of today’s most advanced models.

Also Read: 12 Important Model Evaluation Metrics for Machine Learning Everyone Should Know

Building the Measuring Stick: How AI’s Task Horizon Is Calculated

To calculate the 50% task completion time horizon, the METR team designed a robust methodology using three key elements. Let’s understand each one of them:

1. The Diverse Task Suite: Capturing a Range of Human Work

The first step was creating a comprehensive set of 169 tasks from various domains, such as software engineering, cybersecurity, general reasoning, and machine learning (ML) research. This diverse mix ensures the methodology captures AI’s ability to handle tasks across different complexity levels:

HCAST (Human-Compatible Agent Speed Tasks): A set of 97 tasks requiring agency, with human completion times ranging from 1 minute to 30 minutes. These tasks simulate real-world situations where the agent needs to plan steps, interact with tools (like code interpreters or file systems), and adjust its approach as needed.
SWAA (Software Agent Action) Suite: A collection of 66 quick tasks from software engineering, each taking humans between 1 and 30 seconds. These tasks help anchor the lower end of the time scale.
RE-Bench: A set of 7 complex research engineering tasks, each taking humans about 8 hours. These challenges test AI capabilities at the longer end of the time horizon.

This diverse suite from seconds to hours, helps form a well-rounded picture of AI’s capabilities across different task types and durations.

2. Timing the Humans: Establishing a Ground Truth

To benchmark AI performance, the team first needed to establish a human baseline or the “ground truth.” Skilled professionals with domain expertise (such as software engineers for coding tasks) were timed performing the tasks, providing essential data on how long humans typically take to complete each task.

3. Evaluating the AI Agents: Testing Real-World Performance

Next, the researchers evaluated AI models, configured as autonomous agents, on the same tasks. These models were provided with task descriptions and necessary tools (like code execution environments) to complete the tasks. The performance of models such as GPT-2, DaVinci-002 (GPT-3), gpt-3.5-turbo-instruct, multiple versions of GPT-4, and several iterations of Claude were tracked to assess their success rates.

By comparing AI performance against human baseline completion times, the researchers could determine, for each model, the duration of human time at which it achieved 50% success as the model’s time horizon.

The Exponential Growth of AI Time Horizons: Doubling Every 7 Months

One of the most striking findings in the METR paper is the exponential increase in AI’s ability to complete longer tasks. The 50% task completion time horizon; a key metric used to measure AI performance, has been doubling approximately every seven months since 2019. This finding emphasizes how quickly AI models are advancing, not just in handling simple tasks but in managing increasingly complex ones.

What Does Exponential Growth Mean for AI?

Exponential growth is not the same as linear improvement. Instead of AI making small, steady gains over time, we are seeing a rapid acceleration in its capabilities. In simple terms, AI systems are evolving quickly. As time passes, they are handling longer and more complex tasks much faster than ever before.

Doubling Time: The term “doubling time” refers to how often AI models’ abilities to complete tasks double in length.

Over the past six years, this period has been consistently about seven months.
In other words, approximately every half-year, the tasks that AI models can handle with 50% success double in length, allowing AI to take on more challenging tasks.

Current Frontier: As of early 2025, the best AI models, such as Claude 3.7 Sonnet, have reached a 50% success rate for tasks that would typically take a skilled human about 50 minutes to complete.

This means that AI can now autonomously handle tasks that, just a few years ago, would have been too complex for any AI to manage reliably.
The key point here is that AI can succeed in these tasks about half of the time, offering real-world practical utility in fields like software engineering, cybersecurity, and research.

METR's "50% task completion time horizon" metric

This exponential trend is visualized in the above graph, which highlights how quickly the 50% task completion time horizon has grown. The graph tracks the performance of various models released between 2019 and 2025, showing a consistent upward trend. The data reveals a strong correlation, with an R² value of 0.98, indicating that the growth pattern is both significant and predictable.

AI’s Growth Over Time

From GPT-2 to GPT-4: Back in 2019, models like GPT-2 could only handle tasks that took mere seconds to complete. Fast-forward to 2025, and we see models like GPT-4 and Claude 3.7 Sonnet nearing the one-hour mark for task completion, demonstrating just how much AI’s task horizon has expanded.

Interestingly, the paper also points out that this exponential growth may be accelerating even further.
The doubling time seems to have shortened between 2023 and 2024, suggesting that AI’s ability to handle longer tasks might continue to grow at a faster pace.
However, the paper also notes that more data points are needed to fully confirm whether this acceleration is a sustained trend or just a temporary spike.

This possibility is exciting because it indicates that we may soon see AI models capable of managing tasks that would traditionally take several hours or even days for humans. If this trend holds, it would mean that AI could soon be autonomously handling more significant, time-consuming tasks, significantly impacting industries such as research, development, and operations.

How is AI Beating the Clock?

The answer isn’t just about learning more information; it’s about key advances in AI’s fundamental capabilities. The METR paper identifies three core drivers behind this rapid improvement:

1. Greater Reliability and Error Correction

Newer AI models are less error-prone than their predecessors. Crucially, they are now better at recognizing and correcting mistakes when they happen. This ability is critical for long tasks, which involve multiple steps and the potential for mistakes. Older models might derail after a single error, but today’s models can often get back on track, minimizing disruptions to task completion.

2. Enhanced Logical Reasoning

Complex tasks require more than just following instructions. They demand the ability to break down problems, plan steps logically, and adapt the plan when needed. The latest frontier models exhibit stronger logical reasoning, enabling them to handle intricate, multi-step processes more effectively. This improvement means that AI can tackle challenges requiring careful thought, much like a human expert.

3. Improved Tool Use

Many real-world tasks require AI to interact with external tools, such as searching the web, running code, accessing files, or using APIs. Recent models have shown significant improvement in their ability to use these tools reliably and effectively. This ability is crucial for completing complex tasks that involve many different resources.

In essence, today’s AI models are becoming more robust, adaptable, and skillful. They are not merely pattern matches anymore but autonomous agents capable of maintaining focus and pursuing goals over longer sequences of actions, which is why they are increasingly able to handle tasks of greater length and complexity.

Nuances in AI’s Task Performance

While AI’s overall progress is impressive, the METR paper highlights several key nuances that shape performance: task length, model performance, task messiness, cost, etc.

1. Task Length vs. Success Rate

AI’s success rate tends to decline as the task length increases. For tasks that take only seconds, AI can perform well, but as tasks extend into minutes or hours, success rates drop significantly. The 50% task completion time horizon captures the point where AI can complete tasks half the time and shows how task duration impacts performance.

2. Variations in Model Performance

Different models show significant variations in their ability to handle tasks. For example:

Claude 3.7 Sonnet: A newer model by Anthropic, Claude 3.7 Sonnet is known for its strong reasoning and ability to handle complex, multi-step tasks more consistently than its predecessors.
GPT-4o: This version of OpenAI’s GPT-4 is an upgraded, more efficient model that excels at handling longer tasks with improved coherence and reduced error rates.
Claude 3 Opus: This version of Claude builds on its predecessors, showing a marked improvement in task completion over extended periods, with more sophisticated understanding and reasoning capabilities.

In comparison, older models like GPT-3.5 and GPT-4 0314 fall behind in handling long-duration tasks. Additionally, even within the same family, different fine-tuned versions of a model (like variations of Claude 3.5 Sonnet) can exhibit distinct differences in their time horizon, demonstrating the model’s evolution over time.

3. Task “Messiness” and AI Performance

A significant factor affecting AI’s performance is a task’s ambiguity or messiness. Task messiness refers to how ill-defined, ambiguous, or unexpected a task is.

The paper shows that tasks with high messiness scores tend to result in lower AI performance, especially for longer-duration tasks.
Tasks requiring more interpretation or dealing with vague requirements are harder for AI, causing slower improvements in these areas compared to well-defined tasks.
This indicates that robustness to ambiguity is a critical area for further AI development.

4. The Cost of Running AI Models

While AI models are typically more cost-effective than human labor for shorter tasks, the cost ratio changes for longer, more complex tasks.

The computational cost of running these AI agents increases as the tasks become longer and more involved, particularly when the models require multiple attempts to complete the task.
For many tasks, AI is still significantly cheaper than human work, but this difference diminishes as the tasks become more intricate and time-consuming.

Limitations in AI Time Horizon Research

The authors of the METR paper acknowledge several limitations in their study, which are important to consider when interpreting the findings:

Task Set Specificity: The study’s results are based on a specific set of 169 tasks. While these tasks are diverse, they may not fully represent all real-world scenarios. For example, tasks requiring physical interaction, emotional understanding, or creative thinking might yield different results.
Human Baseline Variation: Human performance varies from person to person. Although the researchers used experts and averaged completion times, these baselines are still estimates, which could introduce variability in the results.
Agent Setup: The configuration of the AI models like prompting and tool access can influence performance. Different setups might produce different results, making it essential to account for how models are implemented during testing.
Extrapolation Uncertainty: Although the trend of AI’s improvement is clear, predicting future growth is inherently uncertain. Factors like data limitations, potential algorithmic breakthroughs, or unforeseen bottlenecks could alter the trajectory.
Definition of “Success”: The study uses a binary success/failure criterion, which may not capture partial successes or solutions that are mostly correct but contain minor flaws.

Despite these limitations, the 50% task completion time horizon provides a valuable and interpretable snapshot of AI’s ability to handle complex, time-consuming tasks.

What Does AI’s Rapid Growth Mean for the World?

The fact that AI’s ability to handle long-duration tasks is doubling every 7 months has far-reaching implications:

Economic Impact: AI’s improving ability to automate long tasks will reduce labor costs and increase efficiency, enabling automation of tasks that currently take hours, potentially spanning entire workflows.
AI Safety and Alignment: As AI handles more complex, long-term tasks, aligning these systems with human values becomes critical to ensure safe and ethical autonomy.
Benchmarking the Future: The time horizon metric offers a new way to assess AI’s progress by focusing on task duration and agency, helping evaluate its real-world capabilities.
Near-Term AI Capabilities: While AGI is not yet realized, AI systems capable of handling multi-hour tasks are emerging quickly, signaling the potential for highly useful, disruptive AI capabilities.

Conclusion

The METR paper introduces a new way to measure AI’s progress by focusing on its ability to handle complex, long-duration tasks. The 50% task completion time horizon gives us an intuitive, human-centric way to evaluate AI’s capabilities. The doubling time of approximately seven months highlights the rapid pace at which AI is advancing, particularly in terms of its agency and ability to handle tasks over extended periods.

While there are still uncertainties, the trend is clear: AI is rapidly becoming more capable of tackling the kinds of tasks that define much of human work. Watching how this time horizon evolves will be crucial for understanding the future development of AI, offering a new lens through which we can track the unfolding of AI’s potential.

Note: We have taken all the images from this research paper.

Frequently Asked Questions

Q1. What is the “50% task completion time horizon” for AI?

A. This metric measures how long an AI can effectively work on complex, multi-step tasks. It’s specifically defined as the typical time a skilled human would need to complete tasks that the AI can succeed at 50% of the time. It helps gauge AI’s ability to sustain effort grounded in human work durations.

Q2. Why are traditional AI benchmarks not enough to measure real-world capabilities?

A. Traditional benchmarks often use short, isolated tasks (like answering one question). They fail to measure an AI’s “agency”—its critical ability to plan sequences, use tools, handle errors, and maintain focus over time, which is essential for most real-world work.

Q3. How quickly is AI improving at handling longer tasks?

A. AI’s ability to manage longer tasks is growing exponentially. According to the research, the 50% task completion time horizon has been doubling approximately every seven months since 2019, showing rapid advancement in tackling more time-consuming challenges.

Q4. What factors are driving this rapid improvement in AI’s task duration capability?

A. Three core drivers identified are:
1. Greater Reliability/Error Correction: Newer AIs are better at spotting and fixing mistakes, keeping them on track longer.
2. Enhanced Logical Reasoning: Improved ability to break down problems, plan steps, and adapt plans.
3. Improved Tool Use: More effective interaction with necessary tools like code interpreters or web searches.

Q5. What is the current capability of the best AI models in terms of task duration?

A. As of early 2025, leading models such as Claude 3.7 Sonnet and advanced versions of GPT-4 have reached a time horizon of about 50 minutes. This means they achieve 50% success on tasks that typically take skilled humans nearly an hour to complete.

Anu Madan

Anu Madan is an expert in instructional design, content writing, and B2B marketing, with a talent for transforming complex ideas into impactful narratives. With her focus on Generative AI, she crafts insightful, innovative content that educates, inspires, and drives meaningful engagement.

Beginner Generative AI Research Paper

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

AI’s Time Horizon: Can AI Complete Long Tasks?

Table of Contents

Why Traditional Benchmarks Fall Short?

Introducing AI’s Time Horizon: A Better Way to Measure Real-World Performance

Building the Measuring Stick: How AI’s Task Horizon Is Calculated

1. The Diverse Task Suite: Capturing a Range of Human Work

2. Timing the Humans: Establishing a Ground Truth

3. Evaluating the AI Agents: Testing Real-World Performance

The Exponential Growth of AI Time Horizons: Doubling Every 7 Months

What Does Exponential Growth Mean for AI?

AI’s Growth Over Time

How is AI Beating the Clock?

Nuances in AI’s Task Performance

1. Task Length vs. Success Rate

2. Variations in Model Performance

3. Task “Messiness” and AI Performance

4. The Cost of Running AI Models

Limitations in AI Time Horizon Research

What Does AI’s Rapid Growth Mean for the World?

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt