Since last June, Anthropic has ruled over the coding benchmarks with its Claude 3.5 Sonnet. Today with its latest Claude 3.7 Sonnet LLM, it’s here to shake the world of generative AI even more. Claude 3.7 Sonnet much like Grok 3, released a week ago – comes with advanced reasoning, mathematical, and coding abilities. Both these latest models are more powerful and capable than any existing LLM – be it o3-mini, DeepSeek-R1, or Gemini 2.0 Flash. In this blog, I will test Claude 3.7 Sonnet’s coding abilities against Grok 3 to see which LLM is a better coding sidekick! So let’s start with our Claude 3.7 Sonnet vs Grok 3 comparison.
Claude 3.7 Sonnet is Anthropic’s most advanced AI model, featuring hybrid reasoning, state-of-the-art coding capabilities, and an extended 200K context window. It excels in content generation, data analysis, and complex planning, making it a powerful tool for both developers and enterprises. Succeeding Claude 3.5 Sonnet, a model that beat OpenAI’s o1 on the latest SWE Lancer benchmark – Claude 3.7 is already being labelled as the most intelligent coding & general purpose chatbot!
Claude 3.7 Sonnet is available via Anthropic API, Amazon Bedrock, and Google Vertex AI, with pricing starting at $3 per million input tokens. Claude 3.7 Sonnet and its “extended thinking” feature can be accessed by the paid users for $18 per month. Although everyone can try it for a limited number of times in a day under the free plan.
Learn More: Claude Sonnet 3.7: Performance, How to Access and More
Grok 3 is the latest AI model from Elon Musk’s x.AI, succeeding Grok 2 and offering cutting-edge capabilities powered by 100K+ GPUs. It is designed for enhanced reasoning, creative content generation, deep research, and advanced multimodal interactions. This makes it yet another powerful tool for both individual users and businesses.
Grok 3 is a premium model, available through X’s Premium+ subscription or through Supergrok subscription for almost $40 per month. However, for a limited period, it is free to use for all users on the X platform and the Grok website.
There are 2 ways to access Grok 3:
Learn More: Grok 3 is Here! And What It Can Do Will Blow Your Mind!
Both Claude 3.7 Sonnet and Grok 3, being the latest and most advanced models from their respective companies, boast of exceptional coding skills. So let’s put these models to test and find out if they live up to the hype and expectations. I’ll be testing both the models on the following coding tasks:
At the end of each task, I’ll share my review on how both of these models performed on the given task and pick a winner based on their outputs. Let’s start.
Prompt: “Find error/errors in the following code, explain them to me and share the corrected code”
Input Code:
import requests
import os
import json
bearer_token = "<my bearer token hear>"
# To set your environment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
os.environ["BEARER_TOKEN"] =bearer_token
search_url = "https://api.twitter.com/2/spaces/search"
search_term = 'AI' # Replace this value with your search term
# Optional params: host_ids,conversation_controls,created_at,creator_id,id,invited_user_ids,is_ticketed,lang,media_key,participants,scheduled_start,speaker_ids,started_at,state,title,updated_at
query_params = {'query': search_term, 'space.fields': 'title,created_at', 'expansions': 'creator_id'}
def create_headers(bearer_token):
headers = {
"Authorization": "Bearer {}".format(bearer_token),
"User-Agent": "v2SpacesSearchPython"
}
return headers
def connect_to_endpoint(url, headers, params):
response = requests.request("GET", search_url, headers=headers, params=params)
print(response.status_code)
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
def main():
headers = create_headers(bearer_token)
json_response = connect_to_endpoint(search_url, headers, query_params)
print(json.dumps(json_response, indent=4, sort_keys=True))
if __name__ == "__main__":
main()
By Claude 3.7 Sonnet
By Grok 3
Models | Claude 3.7 Sonnet | Grok 3 |
Response quality | The model lists down all the 5 errors that it found in a very simple yet brief way. It then gives the corrected Python code. At the end, it gives a detailed explanation of all the changes done to the code. | The model points out all the 5 errors and explains them in quite simple language. Then it gives the corrected code and follows it up with additional notes and some tips on how to run the code. |
Code quality | The new code generated ran seamlessly without any errors. | The code generated by it did not run as it still had errors. |
Both the models identified the errors correctly and explained them well. Although both made code corrections, it was Claude 3.7’s code output that was perfect, while Grok 3’s code still had errors. The output generated by Claude 3.7 Sonnet in fact is a strong indicator of model’s improvement on the “if eval” (a very important coding) benchmark – a parameter on which h=it scores higher than any other LLM!
Prompt: “Create a ragdoll physics simulation using Matter.js and HTML5 Canvas in JavaScript. The simulation features a stick-figure-like humanoid composed of rigid bodies connected by joints, standing on a flat surface. When a force is applied, the ragdoll falls, tumbles, and reacts realistically to gravity and obstacles. Implement mouse interactions to push the ragdoll, a reset button, and a slow-motion mode for detailed physics observation.”
(Source: https://x.com/pandeyparul/status/1894209299716739200?s=46)
By Claude 3.7 Sonnet
By Grok 3
Models | Claude 3.7 Sonnet | Grok 3 |
Response quality | The model starts with mentioning all the libraries it will use and then generates detailed code for the visualisation. At the end it provides a comprehensive breakdown of the entire code, including all its possibilities, the structure of the doll, its features and all possible motions. | The model gives a detailed code for the visualization. It starts with a brief introduction about the code and mentions all the features that it will include in the final output. The LLM provides a very simple yet enhanced code. It also adds explanations at the end, including the doll’s physics, features, interactions, and more. |
Ease of use | For this model, the output is available right within the interface, making its experience more seamless. | You will have to copy the entire output and test it in a terminal to see the visualization generated. |
Code quality | The doll had an entire range of motion as was expected. The model also added some extra features of playing with the speed. | It gave the features we had asked for and the doll generated by it was impressive too. But at places, the doll was vibrating even when no force was acting on it. |
Both the models generated stunning outputs. However, the additional features and better motion control that Claude 3.7 Sonnet’s ragdoll showcased, makes it a winner.
Prompt: “You are a data analyst, analyse the following data give key insights and create graphs and plots to help me visualise the trends in the data”
By Claude 3.7 Sonnet
By Grok 3
Models | Claude 3.7 Sonnet | Grok 3 |
Response quality | The model gave several key insights from the data including outcome distribution, trends and health metrics. | The model at first gave the code for all the plots that it thought were relevant for the given dataset and then gave key insights from the analysis. |
Ease of use | It gave the diabetes analysis dashboard and scatter plots right within the chat, making it quite simple to visualise the trends. | The python code for various plots that it had generated ran into errors. |
Explanation | Based on the plots, it gave its key findings on the overall health patterns. | It did give explanations for all the visualizations that it had created, however, I was unable to see them because of incorrect code given by the model. |
Both the models did a good job with explaining the data and key insights from it. But Claude 3.7 Sonnet knocked the ball out of park – with the dashboards that it had created. On the other hand the code for various plots that was generated by Grok 3 didn’t work.
Prompt: “Refactor this Code”
Input Code:
def find_max_value(lst):
max_value = None
if lst:
max_value = lst[0]
for i in range(len(lst)):
if lst[i] > max_value:
max_value = lst[i]
print(f"New max found: {max_value}")
if max_value != None:
return max_value
else:
print("Max value was not found")
return None
else:
print("List is empty")
return None
def check_for_max(lst):
if len(lst) == 0:
return None
else:
if lst[0] > lst[1]:
return lst[0]
else:
return lst[1]
def check_empty(lst):
if len(lst) == 0:
return True
else:
return False
# Driver code (calling all functions for messy logic)
numbers = [10, 50, 20, 35, 5, 100, 0, -1]
result = find_max_value(numbers)
print(f"Max value found: {result}")
You, Yesterday 4:40 PM
from typing import List, Optional
def find_max_value(numbers: List[float]) -> Optional[float]:
"""
Find the maximum value in a list of numbers.
Args:
numbers: List of numbers (integers or floats)
Returns:
The maximum value in the list, or None if the list is empty
Examples:
>>> find_max_value([1, 2, 3])
3
>>> find_max_value([])
None
"""
if not numbers: # Check if list is empty
print("List is empty")
return None
max_value = numbers[0] # Initialize with first element
for num in numbers: # Iterate directly over values instead of indices
if num > max_value:
max_value = num
print(f"New max found: {max_value}")
return max_value
def get_larger_of_two(numbers: List[float]) -> Optional[float]:
"""
Compare the first two elements of a list and return the larger one.
Args:
numbers: List of numbers (minimum 2 elements expected)
Returns:
The larger of the first two elements, or None if list has fewer than 2 elements
"""
if len(numbers) < 2:
print("List must have at least 2 elements")
return None
return max(numbers[0], numbers[1]) # Using built-in max() for simplicity
def is_list_empty(numbers: List[float]) -> bool:
"""
Check if a list is empty.
Args:
numbers: List to check
Returns:
True if empty, False otherwise
"""
return len(numbers) == 0
# Driver code - organized and clear
def main():
numbers = [10, 50, 20, 35, 5, 100, 0, -1]
# Test find_max_value
max_result = find_max_value(numbers)
print(f"Maximum value found: {max_result}")
# Test get_larger_of_two
larger_result = get_larger_of_two(numbers)
print(f"Larger of first two elements: {larger_result}")
# Test is_list_empty
empty_check = is_list_empty(numbers)
print(f"Is list empty? {empty_check}")
if __name__ == "__main__":
main()
By Claude 3.7 Sonnet
def find_max_value(lst):
"""
Find the maximum value in a list.
Returns None if the list is empty.
"""
if not lst:
print("List is empty")
return None
max_value = lst[0]
for value in lst[1:]: # Start from second element since we already set max_value to the first
if value > max_value:
max_value = value
print(f"New max found: {max_value}")
return max_value
def check_for_max(lst):
"""
Return the maximum of the first two elements in a list.
Returns None if the list is empty.
"""
if len(lst) == 0:
return None
elif len(lst) == 1:
return lst[0]
else:
return max(lst[0], lst[1])
def check_empty(lst):
"""
Check if a list is empty.
"""
return len(lst) == 0
# Driver code
if __name__ == "__main__":
numbers = [10, 50, 20, 35, 5, 100, 0, -1]
result = find_max_value(numbers)
print(f"Max value found: {result}")
By Grok 3
from typing import List, Optional
def find_max_value(numbers: List[float]) -> Optional[float]:
"""
Find the maximum value in a list of numbers.
Args:
numbers: List of numbers (integers or floats)
Returns:
The maximum value in the list, or None if the list is empty
Examples:
>>> find_max_value([1, 2, 3])
3
>>> find_max_value([])
None
"""
if not numbers: # Check if list is empty
print("List is empty")
return None
max_value = numbers[0] # Initialize with first element
for num in numbers: # Iterate directly over values instead of indices
if num > max_value:
max_value = num
print(f"New max found: {max_value}")
return max_value
def get_larger_of_two(numbers: List[float]) -> Optional[float]:
"""
Compare the first two elements of a list and return the larger one.
Args:
numbers: List of numbers (minimum 2 elements expected)
Returns:
The larger of the first two elements, or None if list has fewer than 2 elements
"""
if len(numbers) < 2:
print("List must have at least 2 elements")
return None
return max(numbers[0], numbers[1]) # Using built-in max() for simplicity
def is_list_empty(numbers: List[float]) -> bool:
"""
Check if a list is empty.
Args:
numbers: List to check
Returns:
True if empty, False otherwise
"""
return len(numbers) == 0
# Driver code - organized and clear
def main():
numbers = [10, 50, 20, 35, 5, 100, 0, -1]
# Test find_max_value
max_result = find_max_value(numbers)
print(f"Maximum value found: {max_result}")
# Test get_larger_of_two
larger_result = get_larger_of_two(numbers)
print(f"Larger of first two elements: {larger_result}")
# Test is_list_empty
empty_check = is_list_empty(numbers)
print(f"Is list empty? {empty_check}")
if __name__ == "__main__":
main()
Model | Claude 3.7 Sonnet | Grok 3 |
Code efficiency & optimization | Uses list slicing (lst[1:]) for optimized iteration but lacks formal type hints. | Uses direct iteration and built-in functions (max()), making it simpler and more Pythonic. |
Structure | Good structure, but lacks type hints and relies on debugging prints. | More structured. Includes type hints (List[float], Optional[float]), making it easier to maintain. |
Code quality | Great for debugging and iteration efficiency, but slightly informal. | Cleaner, more modular, and production-ready, making it a better refactor, overall. |
Claude 3.7 Sonnet did well in optimization and iteration efficiency. However, Grok 3 aligns better with the refactoring goal by making the code cleaner, clearer, and more maintainable – which is the true purpose of refactoring.
Prompt: “Suppose I have an image url. Give me the Python code for doing the image masking.”
Note: Image masking is a technique used to hide or reveal specific parts of an image by applying a mask, which defines the visible and hidden areas.
By Claude 3.7 Sonnet:
By Grok 3:
Models | Claude 3.7 Sonnet | Grok 3 |
Image augmentation approach | Its output uses ImageDraw to create masks based on shape (circle, rectangle, polygon). It employs matplotlib for displaying images and works in all environments (including notebooks). | Its output uses thresholding on grayscale images to generate a mask based on brightness. It incorporates cv2.imshow(), requiring a GUI, making it less suitable for non-interactive environments. |
Flexibility | Supports custom shapes with adjustable parameters. | Best suited for brightness-based segmentation. Shape-based masking would need extra logic. |
Output | It only crops the output as we get a circular image with a cropped background. This is almost a reverse augmentation as instead of “masking” an object it is highlighting the same. | It gives a much better output. The LLM uses a threshold-based segmentation which results in a high-contrast, binary mask. This ensures that we cannot make the exact breed of the dog in the image. |
Grok used thresholding, which augmented the image based on the way it was required. Its technique of masking resulted in an image in which the main object was not decipherable. Claude on the other hand just cropped the image, highlighting the main element of the image further. This is the exact opposite of image augmentation.
Tasks | Claude 3.7 Sonnet | Grok 3 |
Debugging | ✅ | ❌ |
Gaming | ✅ | ❌ |
Data Analysing | ✅ | ❌ |
Refactoring | ❌ | ✅ |
Image Augmenting | ❌ | ✅ |
Claude 3.7 Sonnet is the clear winner over Grok 3 for tasks that involve coding.
Being recent models, both Grok 3 and Claude 3.7 are obviously far ahead of the existing models by Open AI, Google, and DeepSeek. Now that we have seen the performance of both the models when it comes to coding tasks, let’s find out how they’ve done in standard benchmark tests.
The following graph gives us an idea regarding the performance of the two models on various benchmarks.
Key Points:
The following table consists of a comparison of the features that either of the two models offer. You can refer to this table while choosing the right LLM for your task.
Feature | Claude 3.7 Sonnet | Grok -3 |
Multimodality | Yes | Yes |
Extended Thinking | Yes | Yes |
Big Brain | No | Yes |
Deep Search | No | Yes |
200 K Context Window | Yes | No |
Computer Use | Yes | No |
Reasoning | Hybrid | Advanced |
Claude 3.7 Sonnet emerges as the superior coding assistant over Grok 3, excelling in debugging, game creation, data analysis, and image augmentation. Its ability to apply structured reasoning, generate high-quality, error-free code, and seamlessly integrate visualization tools gives it a clear edge in coding-related tasks. While Grok 3 shows promise, particularly in refactoring with a more structured approach, it struggles with execution errors and lacks fine-tuned control over coding outputs.
But this is still quite early to pass a clear judgement. If Elon Musk is to be believed, then Grok -3 is going to get better, with each passing day. Meanwhile, Claude 3.7 Sonnet will soon feature a Claude Coder – an agent that will do the coding for us! With newer, more advanced models being launched one after the other, the times ahead are surely going to be exciting for us users.
A. Claude 3.7 Sonnet performed better in debugging, game creation, data analysis, and image augmentation, making it the preferred choice for coding tasks.
A. Yes, it can analyze charts, graphs, and documents, but Grok 3 also has multimodal capabilities.
A. While it can generate code, it struggled with debugging and produced outputs with errors compared to Claude 3.7.
A. Claude 3.7 Sonnet supports a 200K token context window, whereas Grok 3 does not.
A. Yes, Grok 3 includes Deep Search and extended reasoning, making it ideal for gathering and analyzing online information.
A. Claude 3.7 Sonnet is available via Anthropic’s API and Claude.ai. Grok 3 is accessible at Grok.com and the X platform.
A. If coding is your priority, go with Claude 3.7 Sonnet. If you need broader AI reasoning, Grok 3 may be more useful.