With the release of GPT-4o, this model is getting huge attention for its multimodal capabilities. GPT-4o is known for its advanced language processing skills and has been enhanced to interpret and generate visual content. However, we shouldn’t overlook Gemini, a model that has been highly praised for its multimodal abilities long before GPT-4o arrived. Gemini excels at combining image recognition with strong language understanding, making it a worthy competitor to GPT-4o.
In this article, we’ll compare GPT-4o vs Gemini by examining how well they perform in different tasks. By looking at their performance, we aim to determine which model is superior. This comparison is crucial because the ability to work with both text and images is valuable for many applications, such as automatic content creation and data analysis.
Let’s compare GPT-4o and Gemini to see which one performs better.
A very basic thing for a multimodal LLM would be to correctly identify the text/numbers written in a given image. So here we’ve provided an image with certain text and have asked our 2 competitors, GPT-4o and Gemini, to calculate the sum of the given numbers in the image. Let’s see who’s going to win this round.
Here you can see, GPT-4o provided the correct output. Easy peasy.
Not sure what Gemini understood with the given prompt. Although it was pretty simple, Gemini couldn’t understand the context.
GPT-4o won!
In this round, we’ve provided an image of a tic-tac-toe game but we’ve not mentioned specifically in the prompt. The task for our models is to first identify the game and then write a python code to implement it.
def print_board(board):
for row in board:
print(" | ".join(row))
print("-" * 9)
def check_winner(board):
# Check rows
for row in board:
if row[0] == row[1] == row[2] != " ":
return row[0]
# Check columns
for col in range(3):
if board[0][col] == board[1][col] == board[2][col] != " ":
return board[0][col]
# Check diagonals
if board[0][0] == board[1][1] == board[2][2] != " ":
return board[0][0]
if board[0][2] == board[1][1] == board[2][0] != " ":
return board[0][2]
return None
def is_board_full(board):
for row in board:
if " " in row:
return False
return True
def main():
board = [
[" ", " ", "O"],
["O", "X", " "],
[" ", " ", " "]
]
]
current_player = "X" # X's turn to play
print("Initial Board:")
print_board(board)
while True:
row = int(input(f"Player {current_player}, enter the row (0, 1, 2): "))
col = int(input(f"Player {current_player}, enter the column (0, 1, 2): "))
if board[row][col] == " ":
board[row][col] = current_player
else:
print("Cell already taken! Try again.")
continue
print_board(board)
winner = check_winner(board)
if winner:
print(f"Player {winner} wins!")
break
if is_board_full(board):
print("It's a draw!")
break
current_player = "X" if current_player == "O" else "O"
if __name__ == "__main__":
main()
Output:
The code provided by GPT-4o gave this above output. Great job in identifying the game correctly and creating a grid as well. Although one of the “o” is misplaced in the given output but otherwise we’ve received a fully functional tic-tac-toe game.
def display_board(board):
"""Displays the current board state"""
for row in board:
for cell in row:
print(cell, end=" ")
print()
def is_valid_move(board, row, col)
"""Checks if a move is valid (i.e. the cell is empty)"""
return board[row][col] == " "
def make_move(board, player, row, col):
"""Places a player's mark on the board"""
board[row][col] = player
def is_winner(board, player):
"""Checks if a player has won"""
# Check rows
for row in board:
if all(cell == player for cell in row):
return True
# Check columns
for col in range(len(board[0])):
if all(board[row][col] == player for row in range(len(board))):
return True
# Check diagonals
if all(board[row][row] == player for row in range(len(board))):
return True
if all(board[row][len(board) - row - 1] == player for row in range(len(board))):
return True
return False
def is_board_full(board):
"""Checks if the board is full"""
for row in board:
for cell in row:
if cell == " ":
return False
return True
def main():
"""Main game loop"""
board = [[" " for _ in range(3)] for _ in range(3)]
current_player = "X"
while True:
display_board(board)
# Get player move
while True:
row = int(input("Player {} enter row (1-3): ".format(current_player))) - 1
col = int(input("Player {} enter column (1-3): ".format(current_player))) - 1
if is_valid_move(board, row, col):
break
else:
print("Invalid move. Please try again.")
# Make move and check for winner
make_move(board, current_player, row, col)
if is_winner(board, current_player):
display_board(board)
print("Player {} wins!".format(current_player))
break
# Check for tie
if is_board_full(board):
display_board(board)
print("It's a tie!")
break
# Switch player
current_player = "O" if current_player == "X" else "X"
if __name__ == "__main__":
main()
Output:
Now, when we run the code provided by Gemini, then it identified the game clearly but as you can see no we did not get any grid. This made it difficult to play the game. Below you can check how the game looks without the grid.
GPT-4o won!
In this round we’ve given an image of a bar chart to both our models. They need to analyze the chart and generate the Python code required to recreate it using Matplotlib. Additionally, they should ensure that the code produces the same bar chart when run. Let’s see which model wins this round.
import matplotlib.pyplot as plt
import numpy as np
# Data
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
sales_A = [10, 15, 17, 12]
sales_B = [10, 15, 20, 15]
# Plotting
fig, ax = plt.subplots()
bar_width = 0.35
index = np.arange(len(quarters))
bar1 = ax.bar(index, sales_A, bar_width, label='A')
bar2 = ax.bar(index, sales_B, bar_width, bottom=sales_A, label='B')
ax.set_xlabel('Quarter')
ax.set_ylabel('Sales')
ax.set_title('Sales by Product & Quarter')
ax.set_xticks(index)
ax.set_xticklabels(quarters)
ax.legend()
plt.show()
Output:
import matplotlib.pyplot as plt
# Data
products = ["A", "B"]
sales_q1 = [30, 25]
sales_q2 = [40, 35]
sales_q3 = [20, 30]
sales_q4 = [10, 15]
# Create a bar chart
fig, ax = plt.subplots()
width = 0.35 # the width of the bars
index = range(len(products)) # the x-axis positions for the bars
bar1 = plt.bar(index, sales_q1, width, label='Q1')
bar2 = plt.bar([p + width for p in index], sales_q2, width, label='Q2')
bar3 = plt.bar([p + width * 2 for p in index], sales_q3, width, label='Q3')
bar4 = plt.bar([p + width * 3 for p in index], sales_q4, width, label='Q4')
# Add labels and title
plt.xlabel('Products')
plt.ylabel('Sales')
plt.title('Sales by Product & Quarter')
# X-axis labels
plt.xticks([p + 1.5 * width for p in index], products)
# Legend
plt.legend()
plt.show()
Output:
GPT-4o won!
Here we’ve given an image input to both our models and they have to understand the code written in the provided screenshot and additionally provide the output for the same. Let’s see how they perform on this test.
Provided a very long summary but here’s the summary and output:
Got the below explanation but no output for the code.
Point goes to GPT-4o for properly understanding the prompt and providing correct output as well.
GPT-4o won!
In this prompt the models were asked for a detailed analysis of a user interface (UI) design to locate and describe interactive elements such as buttons and input fields. The goal is to specify what each element is, its purpose, and any associated labels or features.
Impressive how accurately GPT-4o can identify items in a design with a clear understanding of each button, checkbox and textbox.
Gemini got the input fields correct but there was some uncertainty in the submit button which was square in shape.
GPT-4o won!
Tasks | Winner |
Calculating the sum of numbers from an image. | GPT-4o |
Writing Python code for a tic-tac-toe game based on an image. | GPT-4o |
Creating Python code to recreate a bar chart from an image. | GPT-4o |
Explaining code from a screenshot and providing the output. | GPT-4o |
Identifying buttons and input fields in a user interface design. | GPT-4o |
In this head-to-head comparison, GPT-4o clearly outperformed Gemini. GPT-4o consistently provided accurate and detailed results across all tasks, from calculating sums and coding games to generating bar charts and analyzing UI designs. It showed a strong ability to understand and process both text and images effectively.
Gemini, on the other hand, struggled in several areas. While it performed adequately in some tasks, it often failed to provide detailed explanations or accurate coding. Its performance was inconsistent, highlighting its limitations compared to GPT-4o.
Overall, GPT-4o proved to be the more reliable and versatile model. Its superior performance across multiple tasks makes it the clear winner in this comparison. If you need a model that can handle both text and images with high accuracy, GPT-4o is the better choice. In this article we explored GPT-4o vs Gemini.
Thanks for this comparative analysis. I have just one question: did you use Gemini Ultra 1.0 or Gemini 1.0 Pro in your analysis?
Hi, this was Gemini Pro since it is free to use and GPT-4o is also available with limited use for free users.
Thanks for the article. It was very interesting. I do think it would be useful to compare GPT-4.0 with Gemini Ultra for a more comprehensive evaluation. Also, it’s worth you clarifying at the start of the article which Gemini model was used and its limitations compared to the Ultra version.