40 Data Science Coding Questions and Answers for 2024

Chirag Goyal Last Updated : 20 Jun, 2024
21 min read

Introduction

The field of data science is ever evolving. New tools and techniques keep emerging every day. In the current job scenario, particularly in 2024, professionals are expected to keep themselves updated on these changes. All types of businesses are seeking skilled data scientists who can help them decipher their data sensibly and keep pace with others. Regardless of whether you are experienced or a novice, acing those coding interview questions plays a major role in securing that dream data science job. We are here to help you get through these new-age interviews of 2024, with this comprehensive guide of data science coding questions and answers.

Also Read: How to Prepare for Data Science Interview in 2024?

Data Science Coding Questions and Answers

Data Science Coding Questions and Answers

The intention behind today’s data science coding interviews is to evaluate your problem-solving capabilities. They also test your efficiency in coding, as well as your grasp of various algorithms and data structures. The questions typically mirror real-life scenarios which allow the evaluators to test more than just your technical skills. They also assess your capacity for critical thinking and how practically you can apply your knowledge in real-life situations.

We’ve compiled a list of the 40 most-asked and most educational data science coding questions and answers that you may come across in interviews in 2024. If you’re getting ready for an interview or simply looking to enhance your abilities, this list will provide you with a strong base to approach the hurdles of data science coding.

In case you are wondering how knowing these coding questions and training on them will help you, let me explain. Firstly, it helps you prepare for difficult interviews with major tech companies, during which you can stand out if you know common problems and patterns, well in advance. Secondly, going through such problems improves your analytical skills, helping you become a more effective data scientist in your day-to-day work. Thirdly, these coding questions will improve the cleanliness and efficiency of your code writing — an important advantage in any data-related position.

So let’s get started — let’s begin writing our code towards triumph in the field of data science!

Also Read: Top 100 Data Science Interview Questions & Answers 2024

Python Coding Questions

Python coding question | data science interview

Q1. Write a Python function to reverse a string.

Ans. To reverse a string in Python, you can use slicing. Here’s how you can do it:

def reverse_string(s):
return s[::-1]

The slicing notation s[::-1] starts from the end of the string and moves to the beginning, effectively reversing it. It’s a concise and efficient way to achieve this.

Q2. Explain the difference between a list and a tuple in Python.

Ans. The main difference between a list and a tuple in Python is mutability. A list is mutable, meaning you can change its content after it’s created. You can add, remove, or modify elements. Here’s an example:

my_list = [1, 2, 3]
my_list.append(4)  # Now my_list is [1, 2, 3, 4]

On the other hand, a tuple is immutable. Once it’s created, you can’t change its content. Tuples are defined using parentheses. Here’s an example:

my_tuple = (1, 2, 3)

# my_tuple.append(4) would raise an error because tuples don’t support item assignment

Choosing between a list and a tuple depends on whether you need to modify the data. Tuples can also be slightly faster and are often used when the data should not change.

Q3. Write a Python function to check if a given number is prime.

Ans. To check if a number is prime, you need to test if it’s only divisible by 1 and itself. Here’s a simple function to do that:

def is_prime(n):
if n <= 1:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True

This function first checks if the number is less than or equal to 1, which are not prime numbers. Then it checks divisibility from 2 up to the square root of the number. If any number divides evenly, it’s not prime.

Q4. Explain the difference between == and is in Python.

Ans. In Python, == checks for value equality. Meaning, it checks if the values of two variables are the same. For example:

a = [1, 2, 3]
b = [1, 2, 3]print(a == b)  # True, because the values are the same
print(a == b)  # True, because the values are the same

On the other hand, it checks for identity, meaning it checks if two variables point to the same object in memory. For example:

a = [1, 2, 3]
b = [1, 2, 3]
print(a is b)  # False, because they are different objects in memory
c = a
print(a is c)  # True, because c points to the same object as a

This distinction is important when dealing with mutable objects like lists.

Q5. Write a Python function to calculate the factorial of a number.

Ans. Calculating the factorial of a number can be done using either a loop or recursion. Here’s an example using a loop:

def factorial(n):
if n < 0:
return "Invalid input"
result = 1
for i in range(1, n + 1):
result *= I
return result

This function initializes the result to 1 and multiplies it by each integer up to n. It’s straightforward and avoids the risk of stack overflow with large numbers that recursion might encounter.

Q6. What is a generator in Python? Provide an example.

Ans. Generators are a special type of iterator in Python that allows you to iterate through a sequence of values lazily, meaning they generate values on the fly and save memory. You create a generator using a function and the yield keyword. Here’s a simple example:

def my_generator():
for i in range(1, 4):
yield I
gen = my_generator()
print(next(gen))  # 1
print(next(gen))  # 2
print(next(gen))  # 3

Using yield instead of return allows the function to produce a series of values over time, pausing and resuming as needed. This is very useful for handling large datasets or streams of data.

Q7. Explain the difference between map and filter functions in Python.

Ans. Both map and filter are built-in functions in Python used for functional programming, but they serve different purposes. The map function applies a given function to all items in an input list (or any iterable) and returns a new list of results. For example:

def square(x):
return x * x
numbers = [1, 2, 3, 4]
squared = map(square, numbers)
print(list(squared))  # [1, 4, 9, 16]

On the other hand, the filter function applies a given function to all items in an input list and returns only the items for which the function returns True. Here’s an example:

def is_even(x):
return x % 2 == 0
numbers = [1, 2, 3, 4]
evens = filter(is_even, numbers)
print(list(evens))  # [2, 4]

So, map transforms each item, while filter selects items based on a condition. Both are very powerful tools for processing data efficiently.

Check out more Python interview questions.

Data Structures and Algorithms Coding Questions

Data structure coding questions

Q8. Implement a binary search algorithm in Python.

Ans. Binary search is an efficient algorithm for finding an item from a sorted list of items. It works by repeatedly dividing the search interval in half. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise, narrow it to the upper half. Here’s how you can implement it in Python:

def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1  # Target not found

In this function, we initialize two pointers, left and right, to the start and end of the list, respectively. We then repeatedly check the middle element and adjust the pointers based on the comparison with the target value.

Q9. Explain how a hash table works. Provide an example.

Ans. A hash table is a data structure that stores key-value pairs. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. The main advantage of hash tables is their efficient data retrieval, as they allow for average-case constant-time complexity, O(1), for lookups, insertions, and deletions.

Here’s a simple example in Python using a dictionary, which is essentially a hash table:

# Creating a hash table (dictionary)
hash_table = {}

# Adding key-value pairs
hash_table["name"] = "Alice"
hash_table["age"] = 25
hash_table["city"] = "New York"

# Retrieving values
print(hash_table["name"])  # Output: Alice
print(hash_table["age"])   # Output: 25
print(hash_table["city"])  # Output: New York

In this example, the hash function is implicitly handled by Python’s dictionary implementation. Keys are hashed to produce a unique index where the corresponding value is stored.

Q10. Implement a bubble sort algorithm in Python.

Ans. Bubble sort is a simple sorting algorithm that repeatedly steps through the list, compares adjacent elements, and swaps them if they are in the wrong order. The pass through the list is repeated until the list is sorted. Here’s a Python implementation:

def bubble_sort(arr):
n = len(arr)
for i in range(n):
for j in range(0, n-i-1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
# Example usage
arr = [64, 34, 25, 12, 22, 11, 90]
bubble_sort(arr)
print("Sorted array:", arr)

In this function, we have two nested loops. The inner loop performs the comparisons and swaps, and the outer loop ensures that the process is repeated until the entire list is sorted.

Q11. Explain the difference between depth-first search (DFS) and breadth-first search (BFS).

Ans. Depth-first search (DFS) and breadth-first search (BFS) are two fundamental algorithms for traversing or searching through a graph or tree data structure.

DFS (Depth-First Search): This algorithm starts at the root (or an arbitrary node) and explores as far as possible along each branch before backtracking. It uses a stack data structure, either implicitly with recursion or explicitly with an iterative approach.

def dfs(graph, start, visited=None):
if visited is None:
visited = set()
visited.add(start)
for next in graph[start] - visited:
dfs(graph, next, visited)
return visited

BFS (Breadth-First Search): This algorithm starts at the root (or an arbitrary node) and explores the neighbor nodes at the present depth prior to moving on to nodes at the next depth level. It uses a queue data structure.

from collections import deque
def bfs(graph, start):
visited = set()
queue = deque([start])
while queue:
vertex = queue.popleft()
if vertex not in visited:
visited.add(vertex)
queue.extend(graph[vertex] - visited)
return visited

The primary difference is in their approach: DFS goes deep into the graph first, while BFS explores all neighbors at the current depth before going deeper. DFS can be useful for pathfinding and connectivity checking, while BFS is often used for finding the shortest path in an unweighted graph.

Q12. Implement a linked list in Python.

Ans. A linked list is a data structure in which elements are stored in nodes, and each node points to the next node in the sequence. Here’s how you can implement a simple singly linked list in Python:

class Node:
def __init__(self, data):
self.data = data
self.next = None
class LinkedList:
def __init__(self):
self.head = None
def append(self, data):
new_node = Node(data)
if not self.head:
self.head = new_node
return
last_node = self.head
while last_node.next:
last_node = last_node.next
last_node.next = new_node
def print_list(self):
current = self.head
while current:
print(current.data, end=" -> ")
current = current.next
print("None")

# Example usage
ll = LinkedList()
ll.append(1)
ll.append(2)
ll.append(3)
ll.print_list()  # Output: 1 -> 2 -> 3 -> None

In this implementation, we have a Node class to represent each element in the list and a LinkedList class to manage the nodes. The append method adds a new node to the end of the list, and the print_list method prints all elements.

Q13. Write a function to find the nth Fibonacci number using recursion.

Ans. The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Here’s a recursive function to find the nth Fibonacci number:

def fibonacci(n):
if n <= 0:

return "Invalid input"
elif n == 1:

return 0
elif n == 2:

return 1
else:
return fibonacci(n-1) + fibonacci(n-2)

# Example usage
print(fibonacci(10))  # Output: 34

This function uses recursion to compute the Fibonacci number. The base cases handle the first two Fibonacci numbers (0 and 1), and the recursive case sums the previous two Fibonacci numbers.

Q14. Explain time complexity and space complexity.

Ans. Time complexity and space complexity are used to describe the efficiency of an algorithm.

Time Complexity: This measures the amount of time an algorithm takes to complete as a function of the length of the input. It’s typically expressed using Big O notation, which describes the upper bound of the running time. For example, a linear search has a time complexity of O(n), meaning its running time increases linearly with the size of the input.

# Example of O(n) time complexity
def linear_search(arr, target):
for i in range(len(arr)):
if arr[i] == target:
return I
return -1

Space Complexity: This measures the amount of memory an algorithm uses as a function of the length of the input. It’s also expressed using Big O notation. For example, the space complexity of an algorithm that uses a constant amount of extra memory is O(1).

# Example of O(1) space complexity
def example_function(arr):
total = 0
for i in arr:
total += I
return total

Understanding these concepts helps you choose the most efficient algorithm for a given problem, especially when dealing with large datasets or constrained resources.

Check out more interview questions on data structures.

Pandas Coding Questions

Pandas coding questions | data science interview

Q15. Given a dataset of retail transactions, write a Pandas script to perform the following tasks:

  1. Load the dataset from a CSV file named retail_data.csv.
  2. Display the first 5 rows of the dataset.
  3. Clean the data by removing any rows with missing values.
  4. Create a new column named TotalPrice that is the product of Quantity and UnitPrice.
  5. Group the data by Country and calculate the total TotalPrice for each country.
  6. Sort the resulting grouped data by TotalPrice in descending order and display the top 10 countries.

Assume the dataset has the following columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country

Ans. Here’s how you can do it:

import pandas as pd

# Step 1: Load the dataset from a CSV file named 'retail_data.csv'
df = pd.read_csv('retail_data.csv')

# Step 2: Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Step 3: Clean the data by removing any rows with missing values
df_cleaned = df.dropna()

# Step 4: Create a new column named 'TotalPrice' that is the product of 'Quantity' and 'UnitPrice'
df_cleaned['TotalPrice'] = df_cleaned['Quantity'] * df_cleaned['UnitPrice']

# Step 5: Group the data by 'Country' and calculate the total 'TotalPrice' for each country
country_totals = df_cleaned.groupby('Country')['TotalPrice'].sum().reset_index()

# Step 6: Sort the resulting grouped data by 'TotalPrice' in descending order and display the top 10 countries
top_countries = country_totals.sort_values(by='TotalPrice', ascending=False).head(10)
print("Top 10 countries by total sales:")
print(top_countries)

Q16. How do you read a CSV file into a DataFrame in Pandas?

Ans. Reading a CSV file into a DataFrame is straightforward with Pandas. You use the read_csv function. Here’s how you can do it:

import pandas as pd
# Reading a CSV file into a DataFrame
df = pd.read_csv('path_to_file.csv')
# Displaying the first few rows of the DataFrame
print(df.head())

This function reads the CSV file from the specified path and loads it into a DataFrame, which is a powerful data structure for data manipulation and analysis.

Q17. How do you select specific rows and columns in a DataFrame?

Ans. Selecting specific rows and columns in a DataFrame can be done using various methods. Here are a few examples:

1. Selecting columns:

# Select a single column
column = df['column_name']
# Select multiple columns
columns = df[['column1', 'column2']]

2. Selecting rows:

# Select rows by index
rows = df[0:5]  # First 5 rows

3. Selecting rows and columns:

# Select specific rows and columns
subset = df.loc[0:5, ['column1', 'column2']]  # Using labels
subset_iloc = df.iloc[0:5, [0, 1]]  # Using integer positions

These methods allow you to access and manipulate specific parts of your data efficiently.

Q18. What is the difference between loc and iloc in Pandas?

Ans. The main difference between loc and iloc lies in how you select data from a DataFrame:

loc: Uses labels or boolean arrays to select data. It is label-based.

# Select rows and columns by label
df.loc[0:5, ['column1', 'column2']]

iloc: Uses integer positions to select data. It is position-based.

# Select rows and columns by integer position
df.iloc[0:5, [0, 1]]

Essentially, loc is used when you know the labels of your data, and iloc is used when you know the index positions.

Q19. How do you handle missing values in a DataFrame?

Ans. Handling missing values is crucial for data analysis. Pandas provides several methods to deal with missing data.

Detecting missing values:

# Detect missing values
missing_values = df.isnull()

Dropping missing values:

# Drop rows with missing values
df_cleaned = df.dropna()
# Drop columns with missing values
df_cleaned = df.dropna(axis=1)

Filling missing values:

# Fill missing values with a specific value
df_filled = df.fillna(0)
# Fill missing values with the mean of the column
df_filled = df.fillna(df.mean())

These methods allow you to clean your data, making it ready for analysis.

Q20. How do you merge two DataFrames in Pandas?

Ans. To merge two DataFrames, you can use the merge function, which is similar to SQL joins. Here’s an example:

# Creating two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
# Merging DataFrames on the 'key' column
merged_df = pd.merge(df1, df2, on='key', how='inner')
# Displaying the merged DataFrame
print(merged_df)

In this example, how=’inner’ specifies an inner join. You can also use ‘left’, ‘right’, or ‘outer’ for different types of joins.

Q21. What is groupby in Pandas? Provide an example.

Ans. The groupby function in Pandas is used to split the data into groups based on some criteria, apply a function to each group, and then combine the results. Here’s a simple example:

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B'], 'Values': [10, 20, 30, 40]}
df = pd.DataFrame(data)
# Grouping by 'Category' and calculating the sum of 'Values'
grouped = df.groupby('Category').sum()
# Displaying the grouped DataFrame
print(grouped)

In this example, the DataFrame is grouped by the ‘Category’ column, and the sum of the ‘Values’ column is calculated for each group. Grouping data is very powerful for aggregation and summary statistics.

Learn more about Pandas with this comprehensive course from Analytics Vidhya.

NumPy Coding Questions

NumPy

Q22. Given a 2D array, write a NumPy script to perform the following tasks:

  1. Create a 5×5 matrix with values ranging from 1 to 25.
  2. Reshape the matrix to 1×25 and then back to 5×5.
  3. Compute the sum of all elements in the matrix.
  4. Calculate the mean of each row.
  5. Replace all values greater than 10 with 10.
  6. Transpose of the matrix.

Ans. Here’s how you can do it:

import numpy as np

# Step 1: Create a 5x5 matrix with values ranging from 1 to 25
matrix = np.arange(1, 26).reshape(5, 5)
print("Original 5x5 matrix:")
print(matrix)

# Step 2: Reshape the matrix to 1x25 and then back to 5x5
matrix_reshaped = matrix.reshape(1, 25)
print("Reshaped to 1x25:")
print(matrix_reshaped)
matrix_back_to_5x5 = matrix_reshaped.reshape(5, 5)
print("Reshaped back to 5x5:")
print(matrix_back_to_5x5)

# Step 3: Compute the sum of all elements in the matrix
sum_of_elements = np.sum(matrix)
print("Sum of all elements:")
print(sum_of_elements)

# Step 4: Calculate the mean of each row
mean_of_rows = np.mean(matrix, axis=1)
print("Mean of each row:")
print(mean_of_rows)

# Step 5: Replace all values greater than 10 with 10
matrix_clipped = np.clip(matrix, None, 10)
print("Matrix with values greater than 10 replaced with 10:")
print(matrix_clipped)

# Step 6: Transpose the matrix
matrix_transposed = np.transpose(matrix)
print("Transposed matrix:")
print(matrix_transposed)

Q23. How do you create a NumPy array?

Ans. Creating a NumPy array is straightforward. You can use the array function from the NumPy library. Here’s an example:

import numpy as np
# Creating a NumPy array from a list
my_array = np.array([1, 2, 3, 4, 5])
# Displaying the array
print(my_array)

This code converts a Python list into a NumPy array. You can also create arrays with specific shapes and values using functions like np.zeros, np.ones, and np.arange.

Q24. Explain the difference between a Python list and a NumPy array with an example.

Ans. While both Python lists and NumPy arrays can store collections of items, there are key differences between them:

  • Homogeneity: NumPy arrays require all elements to be of the same data type, which makes them more efficient for numerical operations. Python lists can contain elements of different data types.
  • Performance: NumPy arrays are more memory efficient and faster due to their homogeneous nature and the underlying implementation in C.
  • Functionality: NumPy provides a vast array of functions and methods for mathematical and statistical operations that are optimized for arrays, which are not available with Python lists.

Here’s an example comparing a Python list and a NumPy array:

import numpy as np

# Python list
py_list = [1, 2, 3, 4, 5]

# NumPy array
np_array = np.array([1, 2, 3, 4, 5])

# Element-wise addition
np_array += 1

# Python list requires a loop or comprehension for the same operation
py_list = [x + 1 for x in py_list]

NumPy arrays are the go-to for performance-critical applications, especially in data science and numerical computing.

Q25. How do you perform element-wise operations in NumPy?

Ans. Element-wise operations in NumPy are straightforward and efficient. NumPy allows you to perform operations directly on arrays without the need for explicit loops. Here’s an example:

import numpy as np
# Creating two NumPy arrays
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
# Element-wise addition
result_add = array1 + array2
# Element-wise multiplication
result_mul = array1 * array2
# Displaying the results
print("Addition:", result_add)  # [5, 7, 9]
print("Multiplication:", result_mul)  # [4, 10, 18]

In this example, addition and multiplication are performed element-wise, meaning each element of array1 is added to the corresponding element of array2, and the same for multiplication.

Q26. What is broadcasting in NumPy? Provide an example.

Ans. Broadcasting is a powerful feature in NumPy that allows you to perform operations on arrays of different shapes. NumPy automatically expands the smaller array to match the shape of the larger array without making copies of data. Here’s an example:

import numpy as np
# Creating a 1D array
array1 = np.array([1, 2, 3])
# Creating a 2D array
array2 = np.array([[4], [5], [6]])
# Broadcasting array1 across array2
result = array1 + array2
# Displaying the result
print(result)

The output will be:

[[5 6 7]
[6 7 8]
[7 8 9]]

In this example, array1 is broadcasted across array2 to perform element-wise addition. Broadcasting simplifies code and improves efficiency.

Q27. How do you transpose a NumPy array?

Ans. Transposing an array means swapping its rows and columns. You can use the transpose method or the .T attribute. Here’s how you can do it:

import numpy as np
# Creating a 2D array
array = np.array([[1, 2, 3], [4, 5, 6]])
# Transposing the array
transposed_array = array.T
# Displaying the transposed array
print(transposed_array)

The output will be:

[[1 4]
[2 5]
[3 6]]

This operation is particularly useful in linear algebra and data manipulation.

Q28. How do you perform matrix multiplication in NumPy?

Ans. Matrix multiplication in NumPy can be performed using the dot function or the @ operator. Here’s an example:

import numpy as np
# Creating two matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
# Performing matrix multiplication
result = np.dot(matrix1, matrix2)
# Alternatively, using the @ operator
result_alt = matrix1 @ matrix2
# Displaying the result
print(result)

The output will be:

[[19 22]
[43 50]]

Matrix multiplication combines rows of the first matrix with columns of the second matrix, which is a common operation in various numerical and machine-learning applications.

SQL Coding Questions

SQL

Q29. Write a SQL query that finds all customers who placed an order with a total amount greater than $100 in the last month (from today’s date). Assume the database has the following tables: 

  • customers: Contains customer information like customer_id, name, email
  • orders: Contains order details like order_id, customer_id, order_date, total_amount

Ans: Here’s how you write the query for it:

SELECT customers.name, orders.order_date, orders.total_amount
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id
WHERE orders.order_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
  AND orders.total_amount > 100;

Q30. Write an SQL query to select all records from a table.

Ans. To select all records from a table, you use the SELECT statement with the asterisk (*) wildcard, which means ‘all columns’. Here’s the syntax:

SELECT * FROM table_name;

For example, if you have a table named employees, the query would be:

SELECT * FROM employees;

This query retrieves all columns and rows from the employees table.

Q31. Explain the difference between GROUP BY and HAVING clauses in SQL.

Ans. Both GROUP BY and HAVING are used in SQL to organize and filter data, but they serve different purposes:

GROUP BY: This clause is used to group rows that have the same values in specified columns into aggregated data. It is often used with aggregate functions like COUNT, SUM, AVG, etc.

SELECT department, COUNT(*)
FROM employees
GROUP BY department;

HAVING: This clause is used to filter groups created by the GROUP BY clause. It acts like a WHERE clause, but is used after the aggregation.

SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

In summary, GROUP BY creates the groups, and HAVING filters those groups based on a condition.

Q32. Write an SQL query to find the second-highest salary from an Employee table.

Ans. To find the second-highest salary, you can use the LIMIT clause along with a subquery. Here’s one way to do it:

SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

This query first finds the highest salary and then uses it to find the maximum salary that is less than this highest salary, effectively giving you the second-highest salary.

Q33. Explain the difference between INNER JOIN, LEFT JOIN, and RIGHT JOIN.

Ans. These JOIN operations are used to combine rows from two or more tables based on a related column between them:

INNER JOIN: Returns only the rows that have matching values in both tables.

SELECT a.column1, b.column2
FROM table1 a
INNER JOIN table2 b ON a.common_column = b.common_column;

LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table, and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table.

SELECT a.column1, b.column2
FROM table1 a
LEFT JOIN table2 b ON a.common_column = b.common_column;

RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table, and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table.

SELECT a.column1, b.column2
FROM table1 a
RIGHT JOIN table2 b ON a.common_column = b.common_column;

These different JOIN types help in retrieving the data as per the specific needs of the query.

Q34. Write an SQL query to count the number of employees in each department.

Ans. To count the number of employees in each department, you can use the GROUP BY clause along with the COUNT function. Here’s how:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

This query groups the employees by their department and counts the number of employees in each group.

Q35. What is a subquery in SQL? Provide an example.

Ans. A subquery, or inner query, is a query nested within another query. It can be used in various places like the SELECT, INSERT, UPDATE, and DELETE statements, or inside other subqueries. Here’s an example:

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

In this example, the subquery (SELECT AVG(salary) FROM employees) calculates the average salary of all employees. The outer query then selects the names and salaries of employees who earn more than this average salary.

Check out more SQL coding questions.

Machine Learning Coding Questions

Machine Learning coding questions

Q36. What is overfitting? How do you prevent it?

Ans. Overfitting occurs when a machine learning model learns not only the underlying patterns in the training data but also the noise and outliers. This results in excellent performance on the training data but poor generalization to new, unseen data. Here are a few strategies to prevent overfitting:

  • Cross-Validation: Use techniques like k-fold cross-validation to ensure the model performs well on different subsets of the data.
  • Regularization: Add a penalty for larger coefficients (L1 or L2 regularization) to simplify the model.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
  • Pruning (for decision trees): Trim the branches of a tree that have little importance.
  • Early Stopping: Stop training when the model performance on a validation set starts to degrade.
  • Dropout (for neural networks): Randomly drop neurons during training to prevent co-adaptation.
from tensorflow.keras.layers import Dropout
model.add(Dropout(0.5))
  • More Data: Increasing the size of the training dataset can help the model generalize better.

Preventing overfitting is crucial for building robust models that perform well on new data.

Q37. Explain the difference between supervised and unsupervised learning. Give an example.

Ans. Supervised and unsupervised learning are two fundamental types of machine learning.

Supervised Learning: In this approach, the model is trained on labeled data, meaning that each training example comes with an associated output label. The goal is to learn a mapping from inputs to outputs. Common tasks include classification and regression.

# Example: Supervised learning with a classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

Unsupervised Learning: In this approach, the model is trained on data without labeled responses. The goal is to find hidden patterns or intrinsic structures in the input data. Common tasks include clustering and dimensionality reduction.

# Example: Unsupervised learning with a clustering algorithm
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(X_train)

The main difference lies in the presence or absence of labeled outputs during training. Supervised learning is used when the goal is prediction, while unsupervised learning is used for discovering patterns.

Q38. What is the difference between classification and regression?

Ans. Classification and regression are both types of supervised learning tasks, but they serve different purposes.

Classification: This involves predicting a categorical outcome. The goal is to assign inputs to one of a set of predefined classes.

# Example: Classification
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

Regression: This involves predicting a continuous outcome. The goal is to predict a numeric value based on input features.

# Example: Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In summary, classification predicts discrete labels, while regression predicts continuous values.

Q39. Write a Python script to perform Principal Component Analysis (PCA) on a dataset and plot the first two principal components.

Ans. I used an example DataFrame df with three features. Performed PCA to reduce the dimensionality to 2 components using PCA from sklearn and plotted the first two principal components using matplotlib. Here’s how you can do it:

import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Example DataFrame
df = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'feature2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'feature3': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12] })

X = df[['feature1', 'feature2', 'feature3']]

# Step 1: Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X)
principal_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Step 2: Plot the first two principal components
plt.scatter(principal_df['PC1'], principal_df['PC2'])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Dataset')
plt.show()

Q40. How do you evaluate a machine learning model?

Ans. Evaluating a machine learning model involves several metrics and techniques to ensure its performance. Here are some common methods:

Train-Test Split: Divide the dataset into a training set and a test set to evaluate how well the model generalizes to unseen data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Cross-Validation: Use k-fold cross-validation to assess the model’s performance on different subsets of the data.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

Confusion Matrix: For classification problems, a confusion matrix helps visualize the performance by showing true vs. predicted values.

from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

ROC-AUC Curve: For binary classification, the ROC-AUC curve helps evaluate the model’s ability to distinguish between classes.

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, y_pred)

Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE): For regression problems, these metrics help quantify the prediction errors.

from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

Evaluating a model comprehensively ensures that it performs well not just on training data but also on new, unseen data, making it robust and reliable.

Check out more machine learning interview questions.

Conclusion

Mastering coding questions in data science is essential to get the job you want in this ever-changing industry. These questions measure not only your technical skills but also your critical thinking and problem solving skills. Through consistent practice and understanding of key concepts, you can establish a solid foundation that will help you in interviews and on your career journey.

The field of data science is competitive, but with proper preparation, you can emerge as a candidate ready to tackle real-world issues. Upgrade your skills, stay abreast of the latest techniques and technologies, and constantly expand your knowledge base. Solving every coding problem gets you closer to becoming a competent and effective data scientist.

We believe this collection of top data science coding questions and answers has given you valuable insights and a structured approach to preparing yourself. Good luck with your interview and may you achieve all your career aspirations in the exciting world of data science!

Frequently Asked Questions

Q1. What are the most important skills to have for a data science interview?

A. Key skills include proficiency in Python or R, a strong understanding of statistics and probability, experience with data manipulation using Pandas and NumPy, knowledge of machine learning algorithms, and problem-solving abilities. Soft skills like communication and teamwork are also important.

Q2. How can I improve my coding skills for data science interviews?

A. Practice on coding platforms like LeetCode and HackerRank, focus on data structures and algorithms, work on real-world projects, review others’ code, participate in coding competitions, and take online courses.

Q3. What is the best way to prepare for data science interviews at top tech companies?

A. Combine technical and non-technical preparation: study common questions, do mock interviews, understand the company, brush up on algorithms and machine learning, and practice explaining your solutions clearly.

Q4. How important are projects and portfolios in data science interviews?

A. Projects and portfolios are crucial as they demonstrate your practical skills, creativity, and experience. A well-documented portfolio with diverse projects can significantly boost your chances and serve as discussion points in interviews.

Q5. What should I focus on during the last week of interview preparation?

A. Review core concepts and common questions, practice coding and mock interviews, revisit your projects, research the company, prepare questions for the interviewers, ensure you get enough rest, and manage stress effectively.

I, Chirag Goyal, am a Content Editor Intern in Analytics Vidhya. I am currently pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details