Navigating Data Formats with Pandas for Beginners

Satyajit Chaudhuri Last Updated : 17 Aug, 2023

10 min read

Introduction

Pandas is more than just a name – it’s short for “panel data.” Now, what exactly does that mean? Use the Data formats with pandas in economics and statistics. It refers to structured data sets that hold observations across multiple periods for different entities or subjects.

Nowadays, people often store data in diverse file formats and must convert it into accessible formats for loading. This is the first step of any data science project and will be this article’s main subject of discussion.

This article was published as a part of the Data Science Blogathon.

Introduction
The Pillars Behind Pandas’ Data Science Stardom
How do Pandas Keep the Data Neatly Organized?
CSV Files
Table
Excel Files
JSON
SQL
HTML
Clipboard
HDF
Parquet
Feather
Conclusion
Frequently Asked Questions

The Pillars Behind Pandas’ Data Science Stardom

Effortless Data Handling: One standout feature of pandas is their ability to handle complex data tasks easily. What was once complex code is now streamlined through pandas’ concise functions.
Perfect Data Harmony: Pandas seamlessly fits into the more prominent data science pictures it works with NumPy, Matplotlib, SciPy, Scikit Learn, and other advanced libraries efficiently.
Adaptability in Data Gathering: Pandas have the flexibility to gather data from various sources. Whether it’s CSV files, Excel sheets, JSON, or SQL databases, pandas handle it all. This adaptability streamlines data import, saving us from the headache of format conversions.

In a nutshell, pandas’ success comes from its user-friendly structures, a knack for managing data, integration with other tools, and a knack for handling various data sources. It empowers data enthusiasts to unlock the potential hidden within their datasets, reshaping the data science landscape.

How do Pandas Keep the Data Neatly Organized?

Imagine pandas as your data organizer. Pandas handles data using two incredible structures: “Series” and “DataFrame.” Imagine them as the superheroes of data storage!

Series: Think of a Series as a straight path where data can sit. It’s like a line of information, holding anything from numbers to words. Each piece of data has a special label called an index. Imagine it as a name tag – it helps you find data easily. Series are super handy when dealing with just one column of data. You can do tricks like calculations and analysis with them.

DataFrame: A DataFrame is like a mini-spreadsheet or a fancy table. It’s got rows and columns like you’d see in Excel. Each column is a Series. So, you can have a “Numbers” column, a “Names” column, and so on. DataFrames are like the full package. They’re amazing at handling all sorts of data – numbers, words, you name it. Plus, you can do cool things like finding stuff, organizing data, and changing things around. Remember, each column in a DataFrame is a series!

Having covered the theoretical groundwork, it’s time to roll up our sleeves and dive into practical coding.

Installing and Importing Pandas

'''
You can type the following in your command prompt or terminal 
to install pandas into your system
'''
pip install pandas

#To install pandas into your conda environment use the following

conda install pandas

# jupyter notebook users can use the following

!pip install pandas

Now just like numpy is abbreviated as np and seaborn as sns, pandas are conventionally abbreviated as pd. You are free to use any other abbreviation of your choice, and this won’t impact the code execution in any way. However, ‘pd‘ is the common abbreviation, and using this improves your code readability. In this article, we will use the standard abbreviation ‘pd’ and use it to import pandas.

#import pandas
import pandas as pd

The following code shows how to create a series with pandas.

ages = pd.Series([22, 35, 58], name="Age")

print(ages)

A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

Now let’s learn the usage of dataframe with an example. I want to store passenger data on the Titanic. For the number of passengers, I know the name (characters), age (integers), and sex (male/female) data.

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)
print(df)

This returns the dictionary as a DataFrame object. Also, note how DataFrame is spelled while calling the function. You can find all these examples here.

Now you know how to load Python and start working with it. But hold on, when dealing with real-world projects, your data will likely have way more rows and columns. Manually typing them out as dictionaries would be a marathon. That’s where the magic of importing data into pandas comes in. And guess what? This leads us right into the spotlight of our discussion today!

CSV Files

Comma-Separated Values or CSV files hold significant importance in data science due to their widespread use for storing tabular data. They’re a universal choice for data exchange due to their simplicity and compatibility across various platforms and applications. Pandas in Python offers an elegant solution to read CSV files, empowering data scientists to load, manipulate, and analyze data effortlessly. This streamlined process saves time, ensures data integrity, and provides a structured framework for exploration and insights extraction, making pandas a vital tool in the data scientist’s toolkit. As you embark on mastering data science, CSV files will be your trusty companions.

To read CSV files, we used the read_csv() function which takes a path to a CSV file and reads the data into a Pandas DataFrame object.

# Import pandas library
import pandas as pd

# Read data from a CSV file named 'sampledata.csv'
# The resulting DataFrame will be stored in the variable 'df_csv'
df_csv = pd.read_csv('sampledata.csv')

Table

A table is a structured arrangement within the data context where information is organized into rows and columns. Each row represents a distinct record or entry, while each column corresponds to a specific attribute or characteristic.

The read_table() function in pandas facilitates the conversion of tabular data from a text file into a pandas DataFrame. When invoked, it opens the designated file in read mode and sequentially processes each line as a row of data. By default, the function assumes tab characters in separate columns. However, you can customize the delimiter using the sep parameter to accommodate different formats. The function reads and processes the data, constructing a DataFrame that translates rows into DataFrame rows and transforms the values in each row into the corresponding DataFrame columns. Once the function processes the entire file, it returns the resulting DataFrame containing the imported tabular data. This simplifies the conversion of text-based data into a structured DataFrame format.

# Import pandas library
import pandas as pd

# Read tabular data from 'data.txt' using tab as the delimiter
df = pd.read_table('data.txt', sep='\t')

# Print the resulting DataFrame
print(df)

Excel Files

Excel is one of the most common data storage formats. They can be quickly loaded into the pandas dataframe with the read_excel() function. It smoothly loads various Excel file formats, from xls to xlsx, and even xlsb, odf, ods, and odt extensions. Whether from your local drive or a URL, this function allows you to read one sheet or a list of sheets.

# Import pandas library
import pandas as pd

# Read data from an Excel file named 'sample.xlsx' and from 'Sheet1'
# The resulting DataFrame will be stored in the variable 'df_excel'
df_excel = pd.read_excel('sample.xlsx', sheet_name='Sheet1')

The sheet_name parameter helps you choose which Excel sheet you wish to work on in the Excel fi.

sheet_name: str, int, list, or None, default 0

Strings are used for sheet names. Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position). Lists of strings/integers are used to request multiple sheets. Specify None to get all worksheets.

Available cases:

Defaults to 0: 1st sheet as a DataFrame

1: 2nd sheet as a DataFrame

“Sheet1”: Load sheet with the name “Sheet1”

[0, 1, “Sheet5”]: Load the first, second, and sheet named “Sheet5” as a dict of DataFrame

None: All worksheets

JSON

JSONs (JavaScript Object Notation) find extensive use as a file format for storing and exchanging data. They possess a design that enables them to be readable by humans and machines alike, thus rendering them versatile for various applications. Due to their simplicity and flexibility, JSONs frequently find applications in web APIs, configuration files, and data storage. In data science, JSONs are valuable for handling structured data and integrating it into various analytical tools, including pandas. They can be read in the pandas’ frame using the read_json() function.

# Import pandas library
import pandas as pd

# Read data from a JSON file named 'data.json'
# The resulting DataFrame will be stored in the variable 'df_json'
df_json = pd.read_json('data.json')

SQL

Connecting pandas to a SQL database unlocks a world of possibilities. It enables you to seamlessly bridge the gap between structured data stored in databases and pandas’ powerful data manipulation capabilities. By establishing this connection, you can leverage SQL queries to extract the needed data and then effortlessly transform it into pandas DataFrames. This synergy empowers you to analyze, visualize, and manipulate SQL data using pandas’ user-friendly syntax and the vast array of data manipulation functions. Connecting pandas to SQL bridges the gap between database querying and exploratory data analysis, allowing for more efficient and insightful data-driven decisions.

There are separate paths for different enterprise-level databases. In this article, you will work with the python friendly sqlite3 database, and data can be read using the read_sql() function.

# Import required libraries
import sqlite3
import pandas as pd

# Establish a connection to an SQLite database named 'database.db'
conn = sqlite3.connect('database.db')

# Define an SQL query to select all data from a table named 'table_name'
query = "SELECT * FROM table_name"

# Read data using the SQL query from the connection, creating a DataFrame 'df_sql'
df_sql = pd.read_sql(query, conn)

HTML

Reading HTML tables into pandas opens a gateway to extract structured data from web pages. It’s like tapping into a wellspring of information. Often, websites present data in tabular form, perfect for analysis. By utilizing pandas’ read_html() function, you effortlessly scrape and convert these tables into DataFrames, bypassing manual extraction. This dynamic duo of web scraping and data conversion equips you to seamlessly integrate web data into your analytical workflows, expanding your data universe beyond local files and databases.

# Import pandas library
import pandas as pd

# Define the URL containing an HTML table
url = 'https://example.com/table.html'

# Read HTML tables from the provided URL
# The resulting list of tables is stored in the variable 'tables'
tables = pd.read_html(url)

# Extract the first table from the list and store it in the DataFrame 'df_html'
df_html = tables[0]

Clipboard

Pandas’ read_clipboard() function offers a convenient way to import data directly from your computer’s clipboard into a DataFrame. To use it, you copy the desired tabular data into your clipboard from sources like webpages or spreadsheets. Then, in your Python environment (such as a Jupyter Notebook or interactive shell), you call the read_clipboard() function. Pandas interpret the copied data as a table and convert it into a DataFrame, which you can then work with and analyze. This function is convenient when you want to import data swiftly without saving it as a file first. However, ensuring that the copied data is correctly structured in a tabular format is essential for accurate interpretation by pandas.

# Import pandas library
import pandas as pd

# Read tabular data from the clipboard and create a DataFrame 'df_clipboard'
df_clipboard = pd.read_clipboard()

HDF

Reading HDF5 files in pandas enhances data management for large and intricate datasets. HDF5’s efficient storage, hierarchical structure, and support for diverse data types make it a robust choice. Its compression capabilities optimize storage space, which is crucial for extensive data. By integrating HDF5 with pandas, you tap into the latter’s powerful data manipulation tools, enabling seamless analysis and exploration of complex, big datasets. This synergy empowers data scientists and researchers to handle large-scale data efficiently, opening avenues for sophisticated analysis and insightful discoveries. Use the read_hdf() function here.

# Import pandas library
import pandas as pd

# Read data from an HDF5 file named 'data.h5'
df_hdf = pd.read_hdf('data.h5', key='mydata')

# Print the resulting DataFrame
print(df_hdf)

Parquet

Reading PARQUET files in pandas offers a versatile data storage and analysis solution. PARQUET’s columnar storage format optimizes data compression, minimizing storage space and enhancing data access speed. With pandas’ capability to read PARQUET files, you can efficiently process large datasets, making it ideal for big data scenarios. The format’s compatibility with various data types ensures data integrity and enables seamless integration with pandas’ data manipulation tools. By harnessing PARQUET’s benefits through the read_parquet() function, you empower yourself to handle extensive data efficiently, accelerating analytical workflows and uncovering insights within vast datasets.

# Import pandas library
import pandas as pd

# Read data from a PARQUET file named 'data.parquet'
df_parquet = pd.read_parquet('data.parquet')

# Print the resulting DataFrame
print(df_parquet)

Feather

Reading Feather files in pandas presents a high-performance data exchange and analysis avenue. Feather’s binary storage format optimizes both read and write speeds, which is ideal for interchanging data between languages. By leveraging read_feather() function to read Feather files, you can rapidly load and analyze large datasets, making it suitable for data-intensive tasks. The format’s simplicity and focus on speed enable seamless integration with pandas’ data manipulation functions. Through this combination, you unlock the potential to swiftly handle substantial datasets, accelerating your data analysis workflows and driving efficient insights extraction.

# Import pandas library
import pandas as pd

# Read data from a Feather file named 'data.feather'
df_feather = pd.read_feather('data.feather')

# Print the resulting DataFrame
print(df_feather)

Conclusion

Proficiency in loading various file formats into pandas is crucial for data analysts and scientists. It enables effortless data integration from sources like CSV, Excel, JSON, and SQL, amplifying the capacity to extract insights. This versatility in data collection broadens the scope of comprehensive analyses, providing a vital skill set for informed and effective data-driven decision-making. And this is what you have learned in this article.

The key takeaways from this article which is on data formats with pandas include:

Introduction to pandas and examine why it is so popular in the data science industry.
The panda’s data structures are “Series” and “DataFrames”.
How to load the data from file formats like csv, excel, json, etc., into the pandas DataFrame.

There are other file formats like fixed-width files (fwf) and SAS, and they can also be effectively read in pandas. I highly recommend you review the documentation for a better understanding.

Frequently Asked Questions

Q1. What are pandas in Python? Share its uses.

A. Pandas is an open-source data manipulation and analysis library in Python. It provides powerful data structures like Series and DataFrames, designed to handle and manipulate structured data efficiently. Pandas are popular in data science due to their user-friendly interface, data alignment, and versatile data manipulation capabilities, making them essential for data preprocessing, analysis, and exploration.

Q2. Mention the different types of Data Structures in Pandas.

A. Pandas offers two fundamental data structures:
Series: A Series is a one-dimensional labeled array that can hold various data types, such as numbers, strings, and more. It’s akin to a column in a table and includes an index that labels each data point. Series provides a versatile way to perform operations on individual elements.
DataFrame: A DataFrame is a two-dimensional tabular structure resembling a spreadsheet or a SQL table. It consists of rows and columns, where each column can hold different data types. DataFrames are highly versatile, suitable for complex data manipulations, filtering, grouping, and merging. They provide a comprehensive view of data relations and patterns.

These two structures lay the foundation for pandas’ data manipulation capabilities, empowering data scientists to work with and analyze diverse datasets effici.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Satyajit Chaudhuri

Greetings,
I am a Data Scientist, currently employed with NTT Data, with a strong academic foundation that includes an MS in Machine Learning & AI from LJMU, an M.Tech(IIEST Shibpur), and a B.Tech (NIT Agartala). I am also a top voice in Linkedin and like to talk about new and innovative developments in the field of AI & ML.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Navigating Data Formats with Pandas for Beginners

Introduction

Table of contents

The Pillars Behind Pandas’ Data Science Stardom

How do Pandas Keep the Data Neatly Organized?

Installing and Importing Pandas

CSV Files

Table

Excel Files

JSON

SQL

HTML

Clipboard

HDF

Parquet

Feather

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg