Pandas 101: A Practical Guide for Absolute Beginners

Yashmeet Singh · Mar 14, 2023 · 15 minute read

In a hurry to learn Pandas? Here's all you need to know to get started!

Introduction

Pandas is the most popular Python library for data analysis. It can easily handle large, messy datasets you’ll find in the real world. You can use its numerous functions to load and manipulate data and even create graphical plots.

Today, I’ll introduce Pandas basics and its most frequently used features. You’ll learn how to:

Load tabular data (e.g., CSV file).
Get summary statistics for numerical and categorical data.
Manipulate columns - rename, remove, or add new columns.
Sort and filter data.
Group data and calculate aggregated results.
Visualize data using Pandas (ex. bar, pie charts).

The Retail Sales Dataset

Before we start, let’s look at the dataset we’ll use in this article.

The retail sales dataset contains a fictitious store’s sales and profits for two consecutive quarters (Q3 and Q4). It breaks down this data by store departments and categories.

Here’s a screenshot of the entire dataset:

Pandas Beginners Guide: Retail Sales Dataset Screenshot

The first row contains the column labels. The store departments listed in the first column are unique. So we would like to use them as the index.

The data is stored in a CSV file named retail_sales_profits.csv. Let’s load and analyze it using Pandas.

Load the Dataset

First, we need to import Pandas. We usually do that using the alias pd:

import pandas as pd

You can use Pandas to load data from a variety of sources (ex., CSV, Excel, SQL database, HTML table, etc.)¹.

Since our dataset is a CSV file, we’ll use the method read_csv(). It’ll return a DataFrame object. DataFrame is a 2-dimensional array Pandas uses to store tabular data.

# Read the CSV file into the 'sales_df' object (Sales DataFrame)
sales_df = pd.read_csv('retail_sales_profits.csv')

Let’s print the DataFrame object:²

sales_df

	Store Department	Store Category	Q3 Sales ($)	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
0	Haircare	Beauty	1529	395	2254	460
1	Kitchen	Home	4638	1099	5792	1594
2	TVs	Electronics	9270	1394	11215	1651
3	Football	Sports	8915	2280	5694	1864
4	Men	Clothing	4497	1210	5906	1902
...	...	...	...	...	...	...
20	Baseball	Sports	8290	3358	6531	2400
21	Decor	Home	3280	694	4592	907
22	Skincare	Beauty	2385	640	3506	875
23	Bedding	Home	2799	921	3886	1180
24	Women	Clothing	9796	3159	12290	4519

25 rows × 6 columns

Pandas used the first row from the CSV file for column labels (in blue). However, it automatically added an index (in red). Let’s print it:

# Print the DataFrame index
sales_df.index

RangeIndex(start=0, stop=25, step=1)

It’s a numerical index ranging from 0 to 24. It’s not going to be helpful in our analysis. Instead, we want to set the first column, Store Department, as the index.

Let’s look at two ways to do that.

Set Index While Loading Data

You can use read_csv()’s parameter index_col to set a column as index:

sales_df = pd.read_csv(
    'retail_sales_profits.csv', # file to load dataset
    index_col='Store Department' # column to set as the index
)
sales_df

	Store Category	Q3 Sales ($)	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
Store Department
Haircare	Beauty	1529	395	2254	460
Kitchen	Home	4638	1099	5792	1594
TVs	Electronics	9270	1394	11215	1651
Football	Sports	8915	2280	5694	1864
Men	Clothing	4497	1210	5906	1902
...	...	...	...	...	...
Baseball	Sports	8290	3358	6531	2400
Decor	Home	3280	694	4592	907
Skincare	Beauty	2385	640	3506	875
Bedding	Home	2799	921	3886	1180
Women	Clothing	9796	3159	12290	4519

25 rows × 5 columns

Change Index After Data is Loaded

You can also change the index after you’ve already read the dataset.

The below code reloads the CSV file. That’ll create the integer column that we don’t want:

sales_df = pd.read_csv('retail_sales_profits.csv')
# print the index
sales_df.index

RangeIndex(start=0, stop=25, step=1)

We can use the DataFrame method set_index() to change the index to the correct column:

sales_df = sales_df.set_index('Store Department')
# print the index again
sales_df.index

Index(['Haircare', 'Kitchen', 'TVs', 'Football', 'Men', 'Refurbished',
       'Appliances', 'Headphones', 'Tennis', 'Fragrances', 'Basketball',
       'Children', 'Golf', 'Laptops', 'Tablets', 'Shoes', 'Computers',
       'Furniture', 'Makeup', 'Nails', 'Baseball', 'Decor', 'Skincare',
       'Bedding', 'Women'],
      dtype='object', name='Store Department')

The column Store Department is now set as the index.

Access Rows By Index

The regular Python and NumPy arrays provide you with only a numerical index. Pandas goes one step further. As we saw in the last section, it allows you to set columns with string data as the index.

You can use these string index labels to query the DataFrame using loc[].

For example, the below code snippets use loc[] to access specific Store Departments:

# Access a row (department) using the string index label
sales_df.loc['Haircare']

Store Category    Beauty
Q3 Sales ($)        1529
Q3 Profit ($)        395
Q4 Sales ($)        2254
Q4 Profit ($)        460
Name: Haircare, dtype: object

# Access rows for multiple departments
# Notice the use of double square brackets
sales_df.loc[['Tennis', 'Laptops']]

	Store Category	Q3 Sales ($)	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
Store Department
Tennis	Sports	4605	952	2270	544
Laptops	Electronics	12815	3710	14602	4605

Head and Tail

If you want to take a quick peek at the DataFrame, you can use the methods head() and tail(). They’ll print the first and last five DataFrame rows, respectively:

# print first 5 rows
sales_df.head()

	Category	Q3_Sales	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
Store Department
Haircare	Beauty	1529	395	2254	460
Kitchen	Home	4638	1099	5792	1594
TVs	Electronics	9270	1394	11215	1651
Football	Sports	8915	2280	5694	1864
Men	Clothing	4497	1210	5906	1902

# print last 5 rows
sales_df.tail()

	Category	Q3_Sales	Q3_Profit	Q4_Sales	Q4_Profit
Store Department
Baseball	Sports	8290	3358	6531	2400
Decor	Home	3280	694	4592	907
Skincare	Beauty	2385	640	3506	875
Bedding	Home	2799	921	3886	1180
Women	Clothing	9796	3159	12290	4519

Print DataFrame Information

We can use the method info() to get useful information about the DataFrame. It prints the index type, columns & their datatypes, and memory usage:

sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25 entries, Haircare to Women
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Store Category  25 non-null     object
 1   Q3 Sales ($)    25 non-null     int64
 2   Q3 Profit ($)   25 non-null     int64
 3   Q4 Sales ($)    25 non-null     int64
 4   Q4 Profit ($)   25 non-null     int64
dtypes: int64(4), object(1)
memory usage: 1.7+ KB

Here are a couple of things to note:

The Store Category column contains text data. So Pandas stores it as the type object.
The other columns contain numerical data, specifically integers. Pandas automatically detects that and stores them as int64 type.

Summarize Numerical Data

The DataFrame method describe() generates basic statistics such as mean, standard deviation, and percentiles for all the numerical columns.

We can use it to print this information for our sales and profit figures:

sales_df.describe()

	Q3 Sales ($)	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
count	25.00	25.00	25.00	25.00
mean	5,674.00	1,711.28	6,676.68	2,303.84
std	3,881.24	1,627.80	4,454.73	2,292.45
min	597.00	78.00	775.00	121.00
25%	2,572.00	640.00	3,506.00	855.00
50%	4,605.00	1,210.00	5,792.00	1,651.00
75%	8,704.00	2,280.00	9,772.00	2,680.00
max	15,570.00	7,408.00	18,690.00	10,215.00

Rename Column Labels

Let’s review the column labels. We can print them using the DataFrame attribute columns:

sales_df.columns

Index(['Store Category', 'Q3 Sales ($)', 'Q3 Profit ($)',
        'Q4 Sales ($)', 'Q4 Profit ($)'],
      dtype='object')

The column labels are too long and contain special characters ($). More importantly, the labels have spaces. That could cause problems when we query data by specific columns.

Luckily, we can rename the columns to anything we like. Let’s see our options.

Relabel Specific Columns

The method rename() allows us to change labels for a subset of columns. The input argument columns takes a dictionary with the current column label as the key and the new column label as the value.

The below code will rename two of the columns:

## Specific columns using a dictionary
sales_df = sales_df.rename(columns={
    ## "current column label": "new column label"
    "Store Category": "Category",
    "Q3 Sales ($)": "Q3_Sales"
})
# show the first five rows
sales_df.head()

	Category	Q3_Sales	Q3 Profit ($)	Q4 Sales ($)	Q4 Profit ($)
Store Department
Haircare	Beauty	1529	395	2254	460
Kitchen	Home	4638	1099	5792	1594
TVs	Electronics	9270	1394	11215	1651
Football	Sports	8915	2280	5694	1864
Men	Clothing	4497	1210	5906	1902

As expected, only two columns (highlighted) have the updated labels. All other column labels remained unchanged.

Change Labels for All Columns

You might want to update all of the column labels in one shot. You can do that using the set_axis() method with the below inputs:

The list of new column names
The parameter axis set to columns

Let’s update the labels for all the sales_df columns:

new_column_labels = [
    'Category', 'Q3_Sales', 'Q3_Profit',
    'Q4_Sales', 'Q4_Profit'
]
# rename all column labels
sales_df = sales_df.set_axis(new_column_labels, axis='columns')
# print first five rows
sales_df.head()

	Category	Q3_Sales	Q3_Profit	Q4_Sales	Q4_Profit
Store Department
Haircare	Beauty	1529	395	2254	460
Kitchen	Home	4638	1099	5792	1594
TVs	Electronics	9270	1394	11215	1651
Football	Sports	8915	2280	5694	1864
Men	Clothing	4497	1210	5906	1902

All the column labels (highlighted) look better now. They are shorter and don’t have blank spaces anymore.

Update Index Name

We’ve fixed the column labels, but the index name, Store Department, is still too long and contains space. We can update it using the DataFrame property index.name:

# Update the DataFrame index name
sales_df.index.name = 'Department'
 
# Print the bottom five rows
sales_df.tail()

	Category	Q3_Sales	Q3_Profit	Q4_Sales	Q4_Profit
Department
Baseball	Sports	8290	3358	6531	2400
Decor	Home	3280	694	4592	907
Skincare	Beauty	2385	640	3506	875
Bedding	Home	2799	921	3886	1180
Women	Clothing	9796	3159	12290	4519

Select Columns

Python and its libraries (ex., NumPy) widely use square brackets, [ ], to fetch items from collections. Pandas carries on this tradition. Let’s see how it allows you to select one or multiple columns.

Get a Single Column

You can retrieve a specific column by passing the column label in the square brackets. The below code gets the Category column from sales_df:

# Get a single column
sales_df['Category']

Department
Haircare         Beauty
Kitchen            Home
TVs         Electronics
Football         Sports
Men            Clothing
               ...
Baseball         Sports
Decor              Home
Skincare         Beauty
Bedding            Home
Women          Clothing
Name: Category, Length: 25, dtype: object

Retrieve Multiple Columns

You’ll need to use double square brackets, [[]], to fetch multiple columns. Why is that?

The inner square brackets create a list of columns. Then we pass that list to outer square brackets. That tells Pandas to retrieve all the columns from the list.

Let’s use this technique to get two columns from the sales_df:

# Get Q3 and Q4 sales
# - inner brackets create column list - ['Q3_Sales', 'Q4_Sales']
# - outer brackets instruct Pandas to fetch content for
#   all columns from the list ['Q3_Sales', 'Q4_Sales']
sales_df[['Q3_Sales', 'Q4_Sales']]

	Q3_Sales	Q4_Sales
Department
Haircare	1529	2254
Kitchen	4638	5792
TVs	9270	11215
Football	8915	5694
Men	4497	5906
...	...	...
Baseball	8290	6531
Decor	3280	4592
Skincare	2385	3506
Bedding	2799	3886
Women	9796	12290

25 rows × 2 columns

Summarize Categorical Columns

Recall that we used the describe() method to get basic statistics for numerical columns. How can you similarly summarize columns with categorical data?

Pandas provides two methods to help us out.

Get Unique Values

You can use unique() method to get all the unique values for a categorical column.

Suppose you want to know all the product categories for our retail store. Here’s how you can do it using unique():

# Get all the unique product categories
# Select the column 'Category' using square bracket notation
# and then call the unique() method
sales_df['Category'].unique()

array(['Beauty', 'Home', 'Electronics', 'Sports', 'Clothing'],
      dtype=object)

Count Unique Values

The unique() tells us that there are five product categories. What if you want to know how many times each category occurred in the retail dataset?

We can use the method value_counts() to get that information:

# How many times does each product category occur?
# Select the 'Category' column and invoke value_counts()
sales_df['Category'].value_counts()

Electronics    6
Beauty         5
Home           5
Sports         5
Clothing       4
Name: Category, dtype: int64

Drop Columns

Let’s take a look at our sales DataFrame again:

# print first five rows
sales_df.head()

	Category	Q3_Sales	Q3_Profit	Q4_Sales	Q4_Profit
Department
Haircare	Beauty	1529	395	2254	460
Kitchen	Home	4638	1099	5792	1594
TVs	Electronics	9270	1394	11215	1651
Football	Sports	8915	2280	5694	1864
Men	Clothing	4497	1210	5906	1902

We’ll examine only the sales for the rest of the article. So there’s no need to keep the profit columns around. We can remove them from the DataFrame using the drop() method.

You’ll need to set the parameter columns to the list of columns you want to remove:

# remove profit columns
sales_df = sales_df.drop(columns=['Q3_Profit', 'Q4_Profit'])
# print first rows
sales_df.head()

	Category	Q3_Sales	Q4_Sales
Department
Haircare	Beauty	1529	2254
Kitchen	Home	4638	5792
TVs	Electronics	9270	11215
Football	Sports	8915	5694
Men	Clothing	4497	5906

Looks good. Now we have only the columns we need.

Filter Data

Pandas provides numerous ways to filter DataFrame data. I’ll introduce the query() method because it’s the easiest to understand and results in cleaner code.

The query() method accepts filter conditions built using comparison operators such as >, >=, <, <=, ==, !=. You can combine multiple conditions using logical operators - and, or, not.³

Allow me to show some examples to clarify.

Single Condition

Let’s say you want to find all underperforming Q3 sales. You can search for rows with Q3 sales below $2,000 using the below code:

# find all the entries with Q3 sales less than 2000
sales_df.query('Q3_Sales < 2000')

	Category	Q3_Sales	Q4_Sales
Department
Haircare	Beauty	1529	2254
Refurbished	Electronics	1334	1297
Fragrances	Beauty	1392	2087
Nails	Beauty	597	775

The above query filtered the data based on numerical values (sales). We can also search the columns containing string values.

For example, the below code will get all the rows where the product category equals ‘Electronics’:

# Fetch all the rows from the electronics category
# You'll need to pass the 'Electronics' in single quotes
sales_df.query("Category == 'Electronics'")

	Category	Q3_Sales	Q4_Sales
Department
TVs	Electronics	9270	11215
Refurbished	Electronics	1334	1297
Headphones	Electronics	2572	4017
Laptops	Electronics	12815	14602
Tablets	Electronics	3506	4759
Computers	Electronics	9850	11503

Multiple Conditions

Suppose you want to query the products from the ‘Clothing’ category with Q3 sales exceeding $5,000. We can combine these two conditions using the logical operator and:

# Search rows that match both conditions:
# 1. product category is Clothing
# 2. Q3 sales over $5,000
sales_df.query("Category == 'Clothing' and Q3_Sales > 5000")

	Category	Q3_Sales	Q4_Sales
Department
Children	Clothing	5608	7502
Shoes	Clothing	7452	9772
Women	Clothing	9796	12290

Sort Data

You can sort DataFrame by column values or index. Let’s see examples of both approaches.

Sort by Columns

Pandas provides the method sort_values() to order the DataFrame rows by column values. It has two key parameters:

by: the column or the list of columns to use for sorting.
ascending: controls the sort direction. It’s set to True by default (low to high). But you can pass False to sort in descending (high to low) order.

Let’s use this method to sort sales_df by Q3 sales in descending order:

# Sort by Q3 sales, from high to low
sales_df.sort_values(by='Q3_Sales', ascending=False)

	Category	Q3_Sales	Q4_Sales
Department
Furniture	Home	15570	18690
Laptops	Electronics	12815	14602
Computers	Electronics	9850	11503
Women	Clothing	9796	12290
TVs	Electronics	9270	11215
...	...	...	...
Skincare	Beauty	2385	3506
Haircare	Beauty	1529	2254
Fragrances	Beauty	1392	2087
Refurbished	Electronics	1334	1297
Nails	Beauty	597	775

25 rows × 4 columns

We can also use multiple columns to sort DataFrames.

Below code sorts sales_df using two columns. First, it orders all the rows using the product category. Then within the same category, it sorts the rows based on the Q4 sales:

# Order based on the product category and Q3 sales
# Set parameter 'by' to a list of columns
sales_df.sort_values(by=['Category', 'Q4_Sales'])

	Category	Q3_Sales	Q4_Sales
Department
Nails	Beauty	597	775
Fragrances	Beauty	1392	2087
Haircare	Beauty	1529	2254
Skincare	Beauty	2385	3506
Makeup	Beauty	4240	6803
...	...	...	...
Tennis	Sports	4605	2270
Golf	Sports	2415	3408
Football	Sports	8915	5694
Baseball	Sports	8290	6531
Basketball	Sports	5791	7482

25 rows × 4 columns

Sort by Index

We can order the DataFrame by the index using the sort_index() method.

Below code sorts sales_df by the index. It uses the ascending parameter to order rows from high to low index value:

# Order by index from high to low
sales_df.sort_index(ascending=False)

	Category	Q3_Sales	Q4_Sales
Department
Women	Clothing	9796	12290
Tennis	Sports	4605	2270
Tablets	Electronics	3506	4759
TVs	Electronics	9270	11215
Skincare	Beauty	2385	3506
...	...	...	...
Children	Clothing	5608	7502
Bedding	Home	2799	3886
Basketball	Sports	5791	7482
Baseball	Sports	8290	6531
Appliances	Home	8704	10284

25 rows × 4 columns

Method Chaining

Method Chaining is a standard Python technique that invokes method calls sequentially. We can use this technique to string DataFrame operations as well.

Let’s see this in action.

Suppose you want to perform two operations on sales_df in sequence:

Find the rows where the product category is ‘Home’ or ‘Clothing’ and Q4 sales are over $7,500.
Then sort the filtered rows by the Q4 sales in descending order.

We can do this by invoking sort_values() directly on the output of query():

# Filter Criteria
## Category is either Home or Clothing
## and Q4 Sales exceed $7500
fc = "Category == ['Home', 'Clothing'] and Q4_Sales > 7500"
 
# Method Chaining
## Filter and then sort the filter output
sales_df.query(fc).sort_values('Q4_Sales', ascending=False)

	Category	Q3_Sales	Q4_Sales
Department
Furniture	Home	15570	18690
Women	Clothing	9796	12290
Appliances	Home	8704	10284
Shoes	Clothing	7452	9772
Children	Clothing	5608	7502

Group and Aggregate Data

So far, we’ve been manipulating data at the row level. However, sometimes we need to perform operations on groups of data.

The DataFrame method groupby() can help with that. It splits the DataFrame into groups based on certain criteria. We can then perform aggregate operations on each group.

Suppose we want to know - what are the total quarterly sales for each product category?

We can answer this question using the below code. Here’s what it does:

Use groupby() to split sales_df into multiple groups, one for each product category.
Call the aggregate method sum() to add up the sales for each group.
Save the results in a new variable, category_sales.

# Split by 'Category' and add sales for each group
category_sales = sales_df.groupby('Category').sum()
category_sales

	Q3_Sales	Q4_Sales
Category
Beauty	10143	15425
Clothing	27353	35470
Electronics	39347	47393
Home	34991	43244
Sports	30016	25385

Visualize DataFrame

Pandas provides various plotting methods to visualize DataFrames directly. You don’t need to load any visualization library (e.g., Matplotlib). Pandas does that for you behind the scenes.

Plot Bar Chart

Let’s say you want to plot the total quarterly sales for each product category.

That’s easy. We can invoke the bar() method on the DataFrame category_sales we created in the last section:

# Plot total quarterly sales by category
category_sales.plot.bar()

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.bar()

That’s not a bad-looking plot. Especially considering we wrote just one line of code to generate it!

We can also customize this plot using the formatting parameters. Let’s add the figure title and the axis labels. And make product categories easier to read:

category_sales.plot.bar(
    xlabel='Product Category',
    ylabel='Sales (in dollars)',
    title='FIG 1: Sales by Category',
    rot=45 # rotate xticks by 45 degrees
)

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.bar() and customize using formatting parameters - xlabel, ylbale, title, rot

The total sales went up from Q3 to Q4 for all the categories except Sports. Let’s look at the data from another angle for a different perspective.

Plot Pie Chart

The above bar plot compares the Q3 and Q4 sales for each category. We could also look at how much each category contributed within each quarter.

Pie charts would be the perfect visualization tool for that. Let’s draw them by invoking the pie() method on the category_sales DataFrame.

We’ll use two key parameters⁴ to get the desired result:

subplots: draw separate pie charts for Q3 and Q4 sales.
autopct: print percentage of sales contributed by each category.

axes = category_sales.plot.pie(
    figsize=(12,6),
    subplots=True, # show Q3 and Q4 side by side
    autopct='%1.2f%%', # show percent of each slice
    legend=None,
    title='FIG 2: Sales by Category - Q3 vs. Q4'
)
# Add x and y axis label for both charts
axes[0].set_xlabel('Q3 Sales')
axes[0].set_ylabel('')
axes[1].set_xlabel('Q4 Sales')
axes[1].set_ylabel('')

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.pie() and customize using formatting parameters - figsize, subplots, autopct, legend, title

The share of the Sports category went down from 21.16% in Q3 to 15.21% in Q4. That’s a significant drop! Store management should look into it.

Create A New Column

The DataFrame sales_df has the Q3 and Q4 sales data. We might also want to know the percent change in sales from Q3 to Q4.

We can calculate that using the below formula:

\displaystyle Percent \ Change = \left( \frac{Q4 \ Sales - Q3 \ Sales }{Q3 \ Sales} \right) *100

Let’s apply this formula on the sales_df⁵ and store the results in a new column Pct_Change:

# Find the change in sales from Q3 to Q4
diff = sales_df['Q4_Sales'] - sales_df['Q3_Sales']
# Calculate percentage change and add a new column for it
sales_df['Pct_Change'] = (diff*100) / sales_df['Q3_Sales']

And print the first five rows to confirm that sales_df now contains the new column:

sales_df.head()

	Category	Q3_Sales	Q4_Sales	Pct_Change
Department
Haircare	Beauty	1529	2254	47.42
Kitchen	Home	4638	5792	24.88
TVs	Electronics	9270	11215	20.98
Football	Sports	8915	5694	-36.13
Men	Clothing	4497	5906	31.33

Let’s filter the rows where the percent change is negative and sort them:

# Find entries where percent change is negative
# and sorted by low to high
sales_df.query('Pct_Change < 0').sort_values('Pct_Change')

	Category	Q3_Sales	Q4_Sales	Pct_Change
Department
Tennis	Sports	4605	2270	-50.71
Football	Sports	8915	5694	-36.13
Baseball	Sports	8290	6531	-21.22
Refurbished	Electronics	1334	1297	-2.77

These are the worst-performing departments. Three of them are from the Sports category.

That doesn’t surprise us. In the last section, we already saw that the total Sports sales shrunk significantly compared to other categories.

Save DataFrame to CSV File

We’ve made quite a few changes to sales_df since we loaded it from the original CSV file. We can save the updated DataFrame to a new file using the to_csv() method:

# Persist the sales_df DataFrame to a CSV file
# in your current local directory
sales_df.to_csv('retail_sales_profits_v2.csv')

By default, this method will persist the index as well. You can set the parameter index to False if you want to exclude the index.

Summary

Pandas is an essential tool for performing data analysis in Python. This tutorial has provided you with the foundational knowledge of Pandas and its most important features and methods.

Here’s a quick recap of what you learned today:

read_csv(): Load CSV dataset as a DataFrame.
head() & tail(): Take a peek at the DataFrame.
set_index(): Set a column as the index.
info(): Print useful information about the DataFrame.
describe(): Get summary statistics for the numerical data.
unique() & value_counts(): Summarize categorical columns.
rename() & set_axis(): Update column labels.
Select columns using square bracket notation.
drop(): Remove unnecessary columns.
Create new columns based on the existing data.
query(): Filter data using logical expressions.
sort_values() & sort_index(): Sort data using columns or index.
Method chaining: Perform multiple operations sequentially.
groupby(): Group and aggregate data.
Visualize data using Pandas’ built-in functions (e.g., bar() and pie() )
to_csv(): Save DataFrame to a CSV file.

You should now have enough practical skills in Pandas to explore simple datasets on your own.

Check out Pandas read methods to find the details. ↩
Your output may differ from what you see here. That’s because I’ve customized the Pandas’ output using its display options. Here are the options I used:
pd.set_option('display.max_rows', 10)
pd.set_option('display.float_format', '{:,.2f}'.format) ↩
You can learn more about logical expressions and operators here. ↩
Check out this article if you want to customize pie charts further. ↩
Pandas will perform these operations on all the rows in one shot. Learn more about it here. ↩

Title Image by Stone Wang

Pandas Visualization

Comment anonymously

M ↓ Markdown

Commento

Share this article:

What's Popular

Explore More