Title Image - Pandas: A Step by Step Guide for Absolute Beginners

Pandas 101: A Practical Guide for Absolute Beginners

Yashmeet Singh · · 15 minute read
In a hurry to learn Pandas? Here's all you need to know to get started!

Introduction

Pandas is the most popular Python library for data analysis. It can easily handle large, messy datasets you’ll find in the real world. You can use its numerous functions to load and manipulate data and even create graphical plots.

Today, I’ll introduce Pandas basics and its most frequently used features. You’ll learn how to:

  • Load tabular data (e.g., CSV file).
  • Get summary statistics for numerical and categorical data.
  • Manipulate columns - rename, remove, or add new columns.
  • Sort and filter data.
  • Group data and calculate aggregated results.
  • Visualize data using Pandas (ex. bar, pie charts).

The Retail Sales Dataset

Before we start, let’s look at the dataset we’ll use in this article.

The retail sales dataset contains a fictitious store’s sales and profits for two consecutive quarters (Q3 and Q4). It breaks down this data by store departments and categories.

Here’s a screenshot of the entire dataset:

Pandas Beginners Guide: Retail Sales Dataset Screenshot

The first row contains the column labels. The store departments listed in the first column are unique. So we would like to use them as the index.

The data is stored in a CSV file named retail_sales_profits.csv. Let’s load and analyze it using Pandas.

Load the Dataset

First, we need to import Pandas. We usually do that using the alias pd:

import pandas as pd

You can use Pandas to load data from a variety of sources (ex., CSV, Excel, SQL database, HTML table, etc.)1.

Since our dataset is a CSV file, we’ll use the method read_csv(). It’ll return a DataFrame object. DataFrame is a 2-dimensional array Pandas uses to store tabular data.

# Read the CSV file into the 'sales_df' object (Sales DataFrame)
sales_df = pd.read_csv('retail_sales_profits.csv')

Let’s print the DataFrame object:2

sales_df
Store Department Store Category Q3 Sales ($) Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
0 Haircare Beauty 1529 395 2254 460
1 Kitchen Home 4638 1099 5792 1594
2 TVs Electronics 9270 1394 11215 1651
3 Football Sports 8915 2280 5694 1864
4 Men Clothing 4497 1210 5906 1902
... ... ... ... ... ... ...
20 Baseball Sports 8290 3358 6531 2400
21 Decor Home 3280 694 4592 907
22 Skincare Beauty 2385 640 3506 875
23 Bedding Home 2799 921 3886 1180
24 Women Clothing 9796 3159 12290 4519

25 rows × 6 columns

Pandas used the first row from the CSV file for column labels (in blue). However, it automatically added an index (in red). Let’s print it:

# Print the DataFrame index
sales_df.index
RangeIndex(start=0, stop=25, step=1)

It’s a numerical index ranging from 0 to 24. It’s not going to be helpful in our analysis. Instead, we want to set the first column, Store Department, as the index.

Let’s look at two ways to do that.

Set Index While Loading Data

You can use read_csv()’s parameter index_col to set a column as index:

sales_df = pd.read_csv(
    'retail_sales_profits.csv', # file to load dataset
    index_col='Store Department' # column to set as the index
)
sales_df
Store Category Q3 Sales ($) Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
Store Department
Haircare Beauty 1529 395 2254 460
Kitchen Home 4638 1099 5792 1594
TVs Electronics 9270 1394 11215 1651
Football Sports 8915 2280 5694 1864
Men Clothing 4497 1210 5906 1902
... ... ... ... ... ...
Baseball Sports 8290 3358 6531 2400
Decor Home 3280 694 4592 907
Skincare Beauty 2385 640 3506 875
Bedding Home 2799 921 3886 1180
Women Clothing 9796 3159 12290 4519

25 rows × 5 columns

Change Index After Data is Loaded

You can also change the index after you’ve already read the dataset.

The below code reloads the CSV file. That’ll create the integer column that we don’t want:

sales_df = pd.read_csv('retail_sales_profits.csv')
# print the index
sales_df.index
RangeIndex(start=0, stop=25, step=1)

We can use the DataFrame method set_index() to change the index to the correct column:

sales_df = sales_df.set_index('Store Department')
# print the index again
sales_df.index
Index(['Haircare', 'Kitchen', 'TVs', 'Football', 'Men', 'Refurbished',
       'Appliances', 'Headphones', 'Tennis', 'Fragrances', 'Basketball',
       'Children', 'Golf', 'Laptops', 'Tablets', 'Shoes', 'Computers',
       'Furniture', 'Makeup', 'Nails', 'Baseball', 'Decor', 'Skincare',
       'Bedding', 'Women'],
      dtype='object', name='Store Department')

The column Store Department is now set as the index.

Access Rows By Index

The regular Python and NumPy arrays provide you with only a numerical index. Pandas goes one step further. As we saw in the last section, it allows you to set columns with string data as the index.

You can use these string index labels to query the DataFrame using loc[].

For example, the below code snippets use loc[] to access specific Store Departments:

# Access a row (department) using the string index label
sales_df.loc['Haircare']
Store Category    Beauty
Q3 Sales ($)        1529
Q3 Profit ($)        395
Q4 Sales ($)        2254
Q4 Profit ($)        460
Name: Haircare, dtype: object
# Access rows for multiple departments
# Notice the use of double square brackets
sales_df.loc[['Tennis', 'Laptops']]
Store Category Q3 Sales ($) Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
Store Department
Tennis Sports 4605 952 2270 544
Laptops Electronics 12815 3710 14602 4605

Head and Tail

If you want to take a quick peek at the DataFrame, you can use the methods head() and tail(). They’ll print the first and last five DataFrame rows, respectively:

# print first 5 rows
sales_df.head()
Category Q3_Sales Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
Store Department
Haircare Beauty 1529 395 2254 460
Kitchen Home 4638 1099 5792 1594
TVs Electronics 9270 1394 11215 1651
Football Sports 8915 2280 5694 1864
Men Clothing 4497 1210 5906 1902
# print last 5 rows
sales_df.tail()
Category Q3_Sales Q3_Profit Q4_Sales Q4_Profit
Store Department
Baseball Sports 8290 3358 6531 2400
Decor Home 3280 694 4592 907
Skincare Beauty 2385 640 3506 875
Bedding Home 2799 921 3886 1180
Women Clothing 9796 3159 12290 4519

We can use the method info() to get useful information about the DataFrame. It prints the index type, columns & their datatypes, and memory usage:

sales_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25 entries, Haircare to Women
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Store Category  25 non-null     object
 1   Q3 Sales ($)    25 non-null     int64
 2   Q3 Profit ($)   25 non-null     int64
 3   Q4 Sales ($)    25 non-null     int64
 4   Q4 Profit ($)   25 non-null     int64
dtypes: int64(4), object(1)
memory usage: 1.7+ KB

Here are a couple of things to note:

  • The Store Category column contains text data. So Pandas stores it as the type object.
  • The other columns contain numerical data, specifically integers. Pandas automatically detects that and stores them as int64 type.

Summarize Numerical Data

The DataFrame method describe() generates basic statistics such as mean, standard deviation, and percentiles for all the numerical columns.

We can use it to print this information for our sales and profit figures:

sales_df.describe()
Q3 Sales ($) Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
count 25.00 25.00 25.00 25.00
mean 5,674.00 1,711.28 6,676.68 2,303.84
std 3,881.24 1,627.80 4,454.73 2,292.45
min 597.00 78.00 775.00 121.00
25% 2,572.00 640.00 3,506.00 855.00
50% 4,605.00 1,210.00 5,792.00 1,651.00
75% 8,704.00 2,280.00 9,772.00 2,680.00
max 15,570.00 7,408.00 18,690.00 10,215.00

Rename Column Labels

Let’s review the column labels. We can print them using the DataFrame attribute columns:

sales_df.columns
Index(['Store Category', 'Q3 Sales ($)', 'Q3 Profit ($)',
        'Q4 Sales ($)', 'Q4 Profit ($)'],
      dtype='object')

The column labels are too long and contain special characters ($). More importantly, the labels have spaces. That could cause problems when we query data by specific columns.

Luckily, we can rename the columns to anything we like. Let’s see our options.

Relabel Specific Columns

The method rename() allows us to change labels for a subset of columns. The input argument columns takes a dictionary with the current column label as the key and the new column label as the value.

The below code will rename two of the columns:

## Specific columns using a dictionary
sales_df = sales_df.rename(columns={
    ## "current column label": "new column label"
    "Store Category": "Category",
    "Q3 Sales ($)": "Q3_Sales"
})
# show the first five rows
sales_df.head()
Category Q3_Sales Q3 Profit ($) Q4 Sales ($) Q4 Profit ($)
Store Department
Haircare Beauty 1529 395 2254 460
Kitchen Home 4638 1099 5792 1594
TVs Electronics 9270 1394 11215 1651
Football Sports 8915 2280 5694 1864
Men Clothing 4497 1210 5906 1902

As expected, only two columns (highlighted) have the updated labels. All other column labels remained unchanged.

Change Labels for All Columns

You might want to update all of the column labels in one shot. You can do that using the set_axis() method with the below inputs:

  • The list of new column names
  • The parameter axis set to columns

Let’s update the labels for all the sales_df columns:

new_column_labels = [
    'Category', 'Q3_Sales', 'Q3_Profit',
    'Q4_Sales', 'Q4_Profit'
]
# rename all column labels
sales_df = sales_df.set_axis(new_column_labels, axis='columns')
# print first five rows
sales_df.head()
Category Q3_Sales Q3_Profit Q4_Sales Q4_Profit
Store Department
Haircare Beauty 1529 395 2254 460
Kitchen Home 4638 1099 5792 1594
TVs Electronics 9270 1394 11215 1651
Football Sports 8915 2280 5694 1864
Men Clothing 4497 1210 5906 1902

All the column labels (highlighted) look better now. They are shorter and don’t have blank spaces anymore.

Update Index Name

We’ve fixed the column labels, but the index name, Store Department, is still too long and contains space. We can update it using the DataFrame property index.name:

# Update the DataFrame index name
sales_df.index.name = 'Department'
 
# Print the bottom five rows
sales_df.tail()
Category Q3_Sales Q3_Profit Q4_Sales Q4_Profit
Department
Baseball Sports 8290 3358 6531 2400
Decor Home 3280 694 4592 907
Skincare Beauty 2385 640 3506 875
Bedding Home 2799 921 3886 1180
Women Clothing 9796 3159 12290 4519

Select Columns

Python and its libraries (ex., NumPy) widely use square brackets, [ ], to fetch items from collections. Pandas carries on this tradition. Let’s see how it allows you to select one or multiple columns.

Get a Single Column

You can retrieve a specific column by passing the column label in the square brackets. The below code gets the Category column from sales_df:

# Get a single column
sales_df['Category']
Department
Haircare         Beauty
Kitchen            Home
TVs         Electronics
Football         Sports
Men            Clothing
               ...
Baseball         Sports
Decor              Home
Skincare         Beauty
Bedding            Home
Women          Clothing
Name: Category, Length: 25, dtype: object

Retrieve Multiple Columns

You’ll need to use double square brackets, [[]], to fetch multiple columns. Why is that?

The inner square brackets create a list of columns. Then we pass that list to outer square brackets. That tells Pandas to retrieve all the columns from the list.

Let’s use this technique to get two columns from the sales_df:

# Get Q3 and Q4 sales
# - inner brackets create column list - ['Q3_Sales', 'Q4_Sales']
# - outer brackets instruct Pandas to fetch content for
#   all columns from the list ['Q3_Sales', 'Q4_Sales']
sales_df[['Q3_Sales', 'Q4_Sales']]
Q3_Sales Q4_Sales
Department
Haircare 1529 2254
Kitchen 4638 5792
TVs 9270 11215
Football 8915 5694
Men 4497 5906
... ... ...
Baseball 8290 6531
Decor 3280 4592
Skincare 2385 3506
Bedding 2799 3886
Women 9796 12290

25 rows × 2 columns

Summarize Categorical Columns

Recall that we used the describe() method to get basic statistics for numerical columns. How can you similarly summarize columns with categorical data?

Pandas provides two methods to help us out.

Get Unique Values

You can use unique() method to get all the unique values for a categorical column.

Suppose you want to know all the product categories for our retail store. Here’s how you can do it using unique():

# Get all the unique product categories
# Select the column 'Category' using square bracket notation
# and then call the unique() method
sales_df['Category'].unique()
array(['Beauty', 'Home', 'Electronics', 'Sports', 'Clothing'],
      dtype=object)

Count Unique Values

The unique() tells us that there are five product categories. What if you want to know how many times each category occurred in the retail dataset?

We can use the method value_counts() to get that information:

# How many times does each product category occur?
# Select the 'Category' column and invoke value_counts()
sales_df['Category'].value_counts()
Electronics    6
Beauty         5
Home           5
Sports         5
Clothing       4
Name: Category, dtype: int64

Drop Columns

Let’s take a look at our sales DataFrame again:

# print first five rows
sales_df.head()
Category Q3_Sales Q3_Profit Q4_Sales Q4_Profit
Department
Haircare Beauty 1529 395 2254 460
Kitchen Home 4638 1099 5792 1594
TVs Electronics 9270 1394 11215 1651
Football Sports 8915 2280 5694 1864
Men Clothing 4497 1210 5906 1902

We’ll examine only the sales for the rest of the article. So there’s no need to keep the profit columns around. We can remove them from the DataFrame using the drop() method.

You’ll need to set the parameter columns to the list of columns you want to remove:

# remove profit columns
sales_df = sales_df.drop(columns=['Q3_Profit', 'Q4_Profit'])
# print first rows
sales_df.head()
Category Q3_Sales Q4_Sales
Department
Haircare Beauty 1529 2254
Kitchen Home 4638 5792
TVs Electronics 9270 11215
Football Sports 8915 5694
Men Clothing 4497 5906

Looks good. Now we have only the columns we need.

Filter Data

Pandas provides numerous ways to filter DataFrame data. I’ll introduce the query() method because it’s the easiest to understand and results in cleaner code.

The query() method accepts filter conditions built using comparison operators such as >, >=, <, <=, ==, !=. You can combine multiple conditions using logical operators - and, or, not.3

Allow me to show some examples to clarify.

Single Condition

Let’s say you want to find all underperforming Q3 sales. You can search for rows with Q3 sales below $2,000 using the below code:

# find all the entries with Q3 sales less than 2000
sales_df.query('Q3_Sales < 2000')
Category Q3_Sales Q4_Sales
Department
Haircare Beauty 1529 2254
Refurbished Electronics 1334 1297
Fragrances Beauty 1392 2087
Nails Beauty 597 775

The above query filtered the data based on numerical values (sales). We can also search the columns containing string values.

For example, the below code will get all the rows where the product category equals ‘Electronics’:

# Fetch all the rows from the electronics category
# You'll need to pass the 'Electronics' in single quotes
sales_df.query("Category == 'Electronics'")
Category Q3_Sales Q4_Sales
Department
TVs Electronics 9270 11215
Refurbished Electronics 1334 1297
Headphones Electronics 2572 4017
Laptops Electronics 12815 14602
Tablets Electronics 3506 4759
Computers Electronics 9850 11503

Multiple Conditions

Suppose you want to query the products from the ‘Clothing’ category with Q3 sales exceeding $5,000. We can combine these two conditions using the logical operator and:

# Search rows that match both conditions:
# 1. product category is Clothing
# 2. Q3 sales over $5,000
sales_df.query("Category == 'Clothing' and Q3_Sales > 5000")
Category Q3_Sales Q4_Sales
Department
Children Clothing 5608 7502
Shoes Clothing 7452 9772
Women Clothing 9796 12290

Sort Data

You can sort DataFrame by column values or index. Let’s see examples of both approaches.

Sort by Columns

Pandas provides the method sort_values() to order the DataFrame rows by column values. It has two key parameters:

  • by: the column or the list of columns to use for sorting.
  • ascending: controls the sort direction. It’s set to True by default (low to high). But you can pass False to sort in descending (high to low) order.

Let’s use this method to sort sales_df by Q3 sales in descending order:

# Sort by Q3 sales, from high to low
sales_df.sort_values(by='Q3_Sales', ascending=False)
Category Q3_Sales Q4_Sales
Department
Furniture Home 15570 18690
Laptops Electronics 12815 14602
Computers Electronics 9850 11503
Women Clothing 9796 12290
TVs Electronics 9270 11215
... ... ... ...
Skincare Beauty 2385 3506
Haircare Beauty 1529 2254
Fragrances Beauty 1392 2087
Refurbished Electronics 1334 1297
Nails Beauty 597 775

25 rows × 4 columns

We can also use multiple columns to sort DataFrames.

Below code sorts sales_df using two columns. First, it orders all the rows using the product category. Then within the same category, it sorts the rows based on the Q4 sales:

# Order based on the product category and Q3 sales
# Set parameter 'by' to a list of columns
sales_df.sort_values(by=['Category', 'Q4_Sales'])
Category Q3_Sales Q4_Sales
Department
Nails Beauty 597 775
Fragrances Beauty 1392 2087
Haircare Beauty 1529 2254
Skincare Beauty 2385 3506
Makeup Beauty 4240 6803
... ... ... ...
Tennis Sports 4605 2270
Golf Sports 2415 3408
Football Sports 8915 5694
Baseball Sports 8290 6531
Basketball Sports 5791 7482

25 rows × 4 columns

Sort by Index

We can order the DataFrame by the index using the sort_index() method.

Below code sorts sales_df by the index. It uses the ascending parameter to order rows from high to low index value:

# Order by index from high to low
sales_df.sort_index(ascending=False)
Category Q3_Sales Q4_Sales
Department
Women Clothing 9796 12290
Tennis Sports 4605 2270
Tablets Electronics 3506 4759
TVs Electronics 9270 11215
Skincare Beauty 2385 3506
... ... ... ...
Children Clothing 5608 7502
Bedding Home 2799 3886
Basketball Sports 5791 7482
Baseball Sports 8290 6531
Appliances Home 8704 10284

25 rows × 4 columns

Method Chaining

Method Chaining is a standard Python technique that invokes method calls sequentially. We can use this technique to string DataFrame operations as well.

Let’s see this in action.

Suppose you want to perform two operations on sales_df in sequence:

  • Find the rows where the product category is ‘Home’ or ‘Clothing’ and Q4 sales are over $7,500.
  • Then sort the filtered rows by the Q4 sales in descending order.

We can do this by invoking sort_values() directly on the output of query():

# Filter Criteria
## Category is either Home or Clothing
## and Q4 Sales exceed $7500
fc = "Category == ['Home', 'Clothing'] and Q4_Sales > 7500"
 
# Method Chaining
## Filter and then sort the filter output
sales_df.query(fc).sort_values('Q4_Sales', ascending=False)
Category Q3_Sales Q4_Sales
Department
Furniture Home 15570 18690
Women Clothing 9796 12290
Appliances Home 8704 10284
Shoes Clothing 7452 9772
Children Clothing 5608 7502

Group and Aggregate Data

So far, we’ve been manipulating data at the row level. However, sometimes we need to perform operations on groups of data.

The DataFrame method groupby() can help with that. It splits the DataFrame into groups based on certain criteria. We can then perform aggregate operations on each group.

Suppose we want to know - what are the total quarterly sales for each product category?

We can answer this question using the below code. Here’s what it does:

  • Use groupby() to split sales_df into multiple groups, one for each product category.
  • Call the aggregate method sum() to add up the sales for each group.
  • Save the results in a new variable, category_sales.
# Split by 'Category' and add sales for each group
category_sales = sales_df.groupby('Category').sum()
category_sales
Q3_Sales Q4_Sales
Category
Beauty 10143 15425
Clothing 27353 35470
Electronics 39347 47393
Home 34991 43244
Sports 30016 25385

Visualize DataFrame

Pandas provides various plotting methods to visualize DataFrames directly. You don’t need to load any visualization library (e.g., Matplotlib). Pandas does that for you behind the scenes.

Plot Bar Chart

Let’s say you want to plot the total quarterly sales for each product category.

That’s easy. We can invoke the bar() method on the DataFrame category_sales we created in the last section:

# Plot total quarterly sales by category
category_sales.plot.bar()

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.bar()

That’s not a bad-looking plot. Especially considering we wrote just one line of code to generate it!

We can also customize this plot using the formatting parameters. Let’s add the figure title and the axis labels. And make product categories easier to read:

category_sales.plot.bar(
    xlabel='Product Category',
    ylabel='Sales (in dollars)',
    title='FIG 1: Sales by Category',
    rot=45 # rotate xticks by 45 degrees
)

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.bar() and customize using formatting parameters - xlabel, ylbale, title, rot

The total sales went up from Q3 to Q4 for all the categories except Sports. Let’s look at the data from another angle for a different perspective.

Plot Pie Chart

The above bar plot compares the Q3 and Q4 sales for each category. We could also look at how much each category contributed within each quarter.

Pie charts would be the perfect visualization tool for that. Let’s draw them by invoking the pie() method on the category_sales DataFrame.

We’ll use two key parameters4 to get the desired result:

  • subplots: draw separate pie charts for Q3 and Q4 sales.
  • autopct: print percentage of sales contributed by each category.
axes = category_sales.plot.pie(
    figsize=(12,6),
    subplots=True, # show Q3 and Q4 side by side
    autopct='%1.2f%%', # show percent of each slice
    legend=None,
    title='FIG 2: Sales by Category - Q3 vs. Q4'
)
# Add x and y axis label for both charts
axes[0].set_xlabel('Q3 Sales')
axes[0].set_ylabel('')
axes[1].set_xlabel('Q4 Sales')
axes[1].set_ylabel('')

Pandas Beginners Guide: Plot bar chart using pandas.DataFrame.plot.pie() and customize using formatting parameters - figsize, subplots, autopct, legend, title

The share of the Sports category went down from 21.16% in Q3 to 15.21% in Q4. That’s a significant drop! Store management should look into it.

Create A New Column

The DataFrame sales_df has the Q3 and Q4 sales data. We might also want to know the percent change in sales from Q3 to Q4.

We can calculate that using the below formula:

Percent Change=(Q4 SalesQ3 SalesQ3 Sales)100\displaystyle Percent \ Change = \left( \frac{Q4 \ Sales - Q3 \ Sales }{Q3 \ Sales} \right) *100

Let’s apply this formula on the sales_df5 and store the results in a new column Pct_Change:

# Find the change in sales from Q3 to Q4
diff = sales_df['Q4_Sales'] - sales_df['Q3_Sales']
# Calculate percentage change and add a new column for it
sales_df['Pct_Change'] = (diff*100) / sales_df['Q3_Sales']

And print the first five rows to confirm that sales_df now contains the new column:

sales_df.head()
Category Q3_Sales Q4_Sales Pct_Change
Department
Haircare Beauty 1529 2254 47.42
Kitchen Home 4638 5792 24.88
TVs Electronics 9270 11215 20.98
Football Sports 8915 5694 -36.13
Men Clothing 4497 5906 31.33

Let’s filter the rows where the percent change is negative and sort them:

# Find entries where percent change is negative
# and sorted by low to high
sales_df.query('Pct_Change < 0').sort_values('Pct_Change')
Category Q3_Sales Q4_Sales Pct_Change
Department
Tennis Sports 4605 2270 -50.71
Football Sports 8915 5694 -36.13
Baseball Sports 8290 6531 -21.22
Refurbished Electronics 1334 1297 -2.77

These are the worst-performing departments. Three of them are from the Sports category.

That doesn’t surprise us. In the last section, we already saw that the total Sports sales shrunk significantly compared to other categories.

Save DataFrame to CSV File

We’ve made quite a few changes to sales_df since we loaded it from the original CSV file. We can save the updated DataFrame to a new file using the to_csv() method:

# Persist the sales_df DataFrame to a CSV file
# in your current local directory
sales_df.to_csv('retail_sales_profits_v2.csv')

By default, this method will persist the index as well. You can set the parameter index to False if you want to exclude the index.

Summary

Pandas is an essential tool for performing data analysis in Python. This tutorial has provided you with the foundational knowledge of Pandas and its most important features and methods.

Here’s a quick recap of what you learned today:

  • read_csv(): Load CSV dataset as a DataFrame.
  • head() & tail(): Take a peek at the DataFrame.
  • set_index(): Set a column as the index.
  • info(): Print useful information about the DataFrame.
  • describe(): Get summary statistics for the numerical data.
  • unique() & value_counts(): Summarize categorical columns.
  • rename() & set_axis(): Update column labels.
  • Select columns using square bracket notation.
  • drop(): Remove unnecessary columns.
  • Create new columns based on the existing data.
  • query(): Filter data using logical expressions.
  • sort_values() & sort_index(): Sort data using columns or index.
  • Method chaining: Perform multiple operations sequentially.
  • groupby(): Group and aggregate data.
  • Visualize data using Pandas’ built-in functions (e.g., bar() and pie() )
  • to_csv(): Save DataFrame to a CSV file.

You should now have enough practical skills in Pandas to explore simple datasets on your own.

Footnotes

  1. Check out Pandas read methods to find the details.

  2. Your output may differ from what you see here. That’s because I’ve customized the Pandas’ output using its display options. Here are the options I used:
    pd.set_option('display.max_rows', 10)
    pd.set_option('display.float_format', '{:,.2f}'.format)

  3. You can learn more about logical expressions and operators here.

  4. Check out this article if you want to customize pie charts further.

  5. Pandas will perform these operations on all the rows in one shot. Learn more about it here.

Title Image by Stone Wang