Introduction ๐
We recently covered the basics of Pandas. Today, we’ll learn how to work with Excel spreadsheets using Pandas.
Here’s what this article will cover:
- Read Excel file into a DataFrame.
- Load selected columns and skip blank rows or columns.
- Read multiple worksheets from an Excel file.
- Write a DataFrame to an Excel file.
- Write multiple worksheets to an Excel file.
Datasets ๐
We’ll use below Excel files in today’s tutorial:
- largest_cities.xlsx: contains data on the 200 largest US cities.
- ranked_cities.xlsx: has three worksheets with rankings of US cities.
Please download them from here and copy them to the same directory as your Jupyter Notebook (or Python script).
Install OpenPyXL ๐
Pandas uses the OpenPyXL library to interact with Excel files. You can install it using the below pip command:
# Run this from the command line
pip install openpyxl
Now you can launch your notebook and import Pandas as usual:
# import Pandas library
# It'll import openpyxl automatically when required
import pandas as pd
Read Excel File ๐
We’ll load the Excel file largest_cities.xlsx in this section. Here’s a screenshot of this file:
A few important things to notice:
- The first row contains the column headers (City, State, 2023 Population, etc.).
- The first column, City, has the names of the largest US cities. We can use it as an index to identify rows uniquely.
- Excel also assigns alphabetical labels (A, B, C, etc.) to each column (red rectangle). These labels will come in handy, as we’ll see later.
Let’s use Pandas' method read_excel() to load this file. The method will return a Pandas DataFrame:
# Read the Excel file and print the DataFrame
# Pass the full file path as the input
# we have the file in the same directory as the code
# So, below will work
pd.read_excel('largest_cities.xlsx')
City | State | 2023 Population | 2020 Census | Annual Change |
Area mi2 |
Density mi2 |
|
---|---|---|---|---|---|---|---|
0 | New York City | New York | 8,992,908 | 8,804,190 | 0.7% | 300.38 | 29,938 |
1 | Los Angeles | California | 3,930,586 | 3,898,747 | 0.27% | 468.96 | 8,382 |
2 | Chicago | Illinois | 2,761,625 | 2,746,388 | 0.18% | 227.37 | 12,146 |
3 | Houston | Texas | 2,366,119 | 2,304,580 | 0.88% | 640.19 | 3,696 |
4 | Phoenix | Arizona | 1,656,892 | 1,608,139 | 0.99% | 517.67 | 3,201 |
... | ... | ... | ... | ... | ... | ... | ... |
195 | Torrance | California | 147,556 | 147,067 | 0.11% | 20.52 | 7,190 |
196 | Rockford | Illinois | 147,389 | 148,655 | -0.29% | 64.38 | 2,289 |
197 | Visalia | California | 146,466 | 141,384 | 1.17% | 37.92 | 3,862 |
198 | Fullerton | California | 146,155 | 143,617 | 0.58% | 22.43 | 6,516 |
199 | Gainesville | Florida | 146,104 | 141,085 | 1.16% | 63.07 | 2,317 |
200 rows ร 7 columns
Pandas used the first row for column labels (in blue). However, it automatically created an index (in red). Let’s print the index to confirm:
# print DataFrame index
largest_cities.index
RangeIndex(start=0, stop=200, step=1)
It is a numerical index ranging from 0 to 199. We don’t want that. Instead, we should use the City column (in green) as the index.
.
Set Column as Index ๐
We can set a column (or a list of columns) as the index using the DataFrame method set_index():
# set City as the index
largest_cities = largest_cities.set_index('City')
# print the top 5 rows
largest_cities.head()
State | 2023 Population | 2020 Census | Annual Change |
Area mi2 |
Density mi2 |
|
---|---|---|---|---|---|---|
City | ||||||
New York City | New York | 8,992,908 | 8,804,190 | 0.7% | 300.38 | 29,938 |
Los Angeles | California | 3,930,586 | 3,898,747 | 0.27% | 468.96 | 8,382 |
Chicago | Illinois | 2,761,625 | 2,746,388 | 0.18% | 227.37 | 12,146 |
Houston | Texas | 2,366,119 | 2,304,580 | 0.88% | 640.19 | 3,696 |
Phoenix | Arizona | 1,656,892 | 1,608,139 | 0.99% | 517.67 | 3,201 |
The DataFrame looks better. The list of cities is set as the index. Let’s print the index again to confirm:
largest_cities.index
Index(['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',
'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose',
...
'Kent', 'Syracuse', 'Thornton', 'Denton', 'Jackson', 'Torrance',
'Rockford', 'Visalia', 'Fullerton', 'Gainesville'],
dtype='object', name='City', length=200)
You can now use Pandas loc[] to access specific cities using the index labels:
# Get data for a specific city
# using loc and single square brackets
largest_cities.loc['Anchorage']
State Alaska
2023 Population 291,073
2020 Census 291,247
Annual Change -0.02%
Area mi2 1,706.80
Density mi2 171
Name: Anchorage, dtype: object
# Access data for multiple cities
# Notice the use of double square brackets
largest_cities.loc[['Austin', 'Denver', 'Nashville']]
State | 2023 Population | 2020 Census | Annual Change |
Area mi2 |
Density mi2 |
|
---|---|---|---|---|---|---|
City | ||||||
Austin | Texas | 1,013,293 | 961,855 | 1.72% | 319.94 | 3,167 |
Denver | Colorado | 750,130 | 715,522 | 1.56% | 153.29 | 4,893 |
Nashville | Tennessee | 715,913 | 689,447 | 1.25% | 475.54 | 1,505 |
Load Selected Columns ๐
Recall that Excel assigns letter labels (A, B, C, etc.) to each column. The read_excel() method allows you to use these labels to load a specific set of columns.
Let’s see some examples.
List of Columns ๐
The parameter usecols
takes in a list of comma-separated letter labels to load specific columns.
Suppose you only want to load four specific columns - City, State, 2020 Census, and Density mi2. You can do that by passing their corresponding letter labels to the parameter usecols
:
largest_cities = pd.read_excel(
'largest_cities.xlsx',
# load specific columns using Excel letter labels
usecols='A,B,D,G',
# Use the first loaded column as the index
index_col=0
)
largest_cities.head()
State | 2020 Census | Density mi2 | |
---|---|---|---|
City | |||
New York City | New York | 8,804,190 | 29,938 |
Los Angeles | California | 3,898,747 | 8,382 |
Chicago | Illinois | 2,746,388 | 12,146 |
Houston | Texas | 2,304,580 | 3,696 |
Phoenix | Arizona | 1,608,139 | 3,201 |
Consecutive Columns ๐
Suppose you want to load a range of columns (e.g., from A to D). We can use usecols
to specify the start and end columns using a colon (:) like below.
largest_cities = pd.read_excel(
'largest_cities.xlsx',
# Load columns A to D (both inclusive)
usecols='A:D',
# Set the first loaded column as the index
index_col=0
)
# Print the last five rows from the DataFrame
largest_cities.tail()
State | 2023 Population | 2020 Census | |
---|---|---|---|
City | |||
Torrance | California | 147,556 | 147,067 |
Rockford | Illinois | 147,389 | 148,655 |
Visalia | California | 146,466 | 141,384 |
Fullerton | California | 146,155 | 143,617 |
Gainesville | Florida | 146,104 | 141,085 |
Mixed Columns ๐
We can also read specific columns and a range of columns together. The below code loads a range (C:E) and two individual columns using usecols
:
largest_cities = pd.read_excel(
'largest_cities.xlsx',
# load a range and the individual columns
usecols='A,C:E,G',
# set the first loaded column as the index
index_col=0
)
largest_cities.head()
2023 Population | 2020 Census | Annual Change |
Density mi2 |
|
---|---|---|---|---|
City | ||||
New York City | 8,992,908 | 8,804,190 | 0.7% | 29,938 |
Los Angeles | 3,930,586 | 3,898,747 | 0.27% | 8,382 |
Chicago | 2,761,625 | 2,746,388 | 0.18% | 12,146 |
Houston | 2,366,119 | 2,304,580 | 0.88% | 3,696 |
Phoenix | 1,656,892 | 1,608,139 | 0.99% | 3,201 |
Read Excel With Blank Rows & Columns ๐
So far, we’ve worked with a perfect Excel file. However, it’s common to get messy excel files with empty rows and columns.
Let’s see an example - largest_cities_blank_rc.xlsx has the same data we saw in the last section, but the first column (A) and the top four rows are blank:
If you load this file using read_excel(), you’ll see a DataFrame with lots of missing values (NaN / Unnamed):
# Load a file with blank rows and columns
pd.read_excel('largest_cities_blank_rc.xlsx')
Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | City | State | 2023 Population | 2020 Census | Annual Change | Area mi2 | Density mi2 |
4 | NaN | New York City | New York | 8,992,908 | 8,804,190 | 0.7% | 300.38 | 29,938 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
199 | NaN | Torrance | California | 147,556 | 147,067 | 0.11% | 20.52 | 7,190 |
200 | NaN | Rockford | Illinois | 147,389 | 148,655 | -0.29% | 64.38 | 2,289 |
201 | NaN | Visalia | California | 146,466 | 141,384 | 1.17% | 37.92 | 3,862 |
202 | NaN | Fullerton | California | 146,155 | 143,617 | 0.58% | 22.43 | 6,516 |
203 | NaN | Gainesville | Florida | 146,104 | 141,085 | 1.16% | 63.07 | 2,317 |
204 rows ร 8 columns
You can skip these empty rows and columns using the below parameters:
- usecols: We already covered this parameter in the last section. Since columns B to H contain the data, we’ll set it to ‘B:H’.
- skiprows: The number of lines to skip at the start of the file. We know the file begins with four blank rows, so let’s set it to 4.
pd.read_excel(
'largest_cities_blank_rc.xlsx',
# Skip the first four rows
skiprows=4,
# Load columns B to H (thus skip A)
usecols='B:H',
# Use the first loaded column as the index
index_col=0
)
State | 2023 Population | 2020 Census | Annual Change |
Area mi2 |
Density mi2 |
|
---|---|---|---|---|---|---|
City | ||||||
New York City | New York | 8,992,908 | 8,804,190 | 0.7% | 300.38 | 29,938 |
Los Angeles | California | 3,930,586 | 3,898,747 | 0.27% | 468.96 | 8,382 |
Chicago | Illinois | 2,761,625 | 2,746,388 | 0.18% | 227.37 | 12,146 |
Houston | Texas | 2,366,119 | 2,304,580 | 0.88% | 640.19 | 3,696 |
Phoenix | Arizona | 1,656,892 | 1,608,139 | 0.99% | 517.67 | 3,201 |
... | ... | ... | ... | ... | ... | ... |
Torrance | California | 147,556 | 147,067 | 0.11% | 20.52 | 7,190 |
Rockford | Illinois | 147,389 | 148,655 | -0.29% | 64.38 | 2,289 |
Visalia | California | 146,466 | 141,384 | 1.17% | 37.92 | 3,862 |
Fullerton | California | 146,155 | 143,617 | 0.58% | 22.43 | 6,516 |
Gainesville | Florida | 146,104 | 141,085 | 1.16% | 63.07 | 2,317 |
200 rows ร 7 columns
The DataFrame looks good now. It doesn’t have any empty rows or columns, and it uses City as the index.
Read Excel With Multiple Sheets ๐
Excel files can have multiple worksheets, each containing a different data set.
Pandas works great with such files. It helps you to:
- Get the names of all the sheets in a file.
- Load a specific sheet or a selected set of sheets.
- Load all sheets.
Let’s see these operations in action. We’ll use the excel file ranked_cities.xlsx, which contains three sheets:
Get Sheet Names ๐
How can you get the names of all the sheets in an Excel file?
Easy! Load the file using the ExcelFile class. Then get the value of the sheet_names
attribute:
# Get the names of all the sheets in an Excel spreadsheet
pd.ExcelFile('ranked_cities.xlsx').sheet_names
['Best Tech Jobs', 'Fastest Growing', 'Worst Traffic']
Load Specific Sheet ๐
By default, the read_excel() method loads the first sheet from the Excel file:
# by default, read_excel() loads the first sheet
ranked_cities = pd.read_excel(
# load the first sheet - 'Best Tech Jobs'
'ranked_cities.xlsx',
# Use the loaded sheet's first column
# as the index
index_col=0
)
ranked_cities.head()
City | State | |
---|---|---|
Rank | ||
1 | Austin | TX |
2 | Raleigh | NC |
3 | San Jose | CA |
4 | Seattle | WA |
5 | San Francisco | CA |
You can load a specific sheet by using the parameter sheet_name
:
ranked_cities = pd.read_excel(
'ranked_cities.xlsx',
# load a specific sheet
sheet_name='Fastest Growing',
# Set the first column as the index
index_col=0
)
ranked_cities.head() # Fastest Growing
City | State | |
---|---|---|
Rank | ||
1 | Seattle | WA |
2 | Austin | TX |
3 | Forth Worth | TX |
4 | Miami | FL |
5 | Denver | CO |
Load selected sheets ๐
The parameter sheet_name
is quite versatile. It also accepts a list of sheet names (or their integer positions).
For example, if you want to load the first and third sheets, you can set sheet_name
to ['Best Tech Jobs', 'Worst Traffic']
, or [0, 2]
. You’ll get back a dictionary containing data from both sheets.
ranked_cities = pd.read_excel(
'ranked_cities.xlsx',
# sheet_name - Sheets to load
# Accepts integer index as well
# so sheet_name=[0, 2] will load the same sheets
sheet_name=['Best Tech Jobs', 'Worst Traffic'],
# Set the first column as the index
# for all loaded sheets
index_col=0
)
# Check return type - it must be a dictionary
type(ranked_cities)
dict
Let’s print the dictionary keys. It’ll be the same sheet names (or integer positions) that you passed in the above read_excel() call.
ranked_cities.keys()
dict_keys(['Best Tech Jobs', 'Worst Traffic'])
The dictionary values will be the DataFrame objects for each sheet. Let’s retrieve them:
# Get DataFrame objects from the dictionary
tech_df = ranked_cities['Best Tech Jobs']
traffic_df = ranked_cities['Worst Traffic']
And print the top 5 entries from both DataFrames:
# Top 5 cities for tech jobs
tech_df.head()
City | State | |
---|---|---|
Rank | ||
1 | Austin | TX |
2 | Raleigh | NC |
3 | San Jose | CA |
4 | Seattle | WA |
5 | San Francisco | CA |
# Top 5 cities with the worst traffic
traffic_df.head()
City | State | Hours | |
---|---|---|---|
Rank | |||
1 | Boston | MA | 164 |
2 | Washington | D.C. | 155 |
3 | Chicago | IL | 138 |
4 | New York City | NY | 133 |
5 | Los Angeles | CA | 128 |
Los Angeles in the fifth position? That’s shocking ๐ฒ. It’s impossible to imagine a city with worse traffic than LA!
Load All Sheets ๐
You can read all the sheets at once by setting sheet_name
to None
:
ranked_cities = pd.read_excel(
'ranked_cities.xlsx',
# Load all the sheets
sheet_name=None,
# Set the first column as
# the index for all sheets
index_col=0
)
The returned dictionary will now have all three sheets:
ranked_cities.keys()
dict_keys(['Best Tech Jobs', 'Fastest Growing', 'Worst Traffic'])
Let’s print the DataFrame containing the fastest growing cities:
ranked_cities['Fastest Growing'].head()
City | State | |
---|---|---|
Rank | ||
1 | Seattle | WA |
2 | Austin | TX |
3 | Forth Worth | TX |
4 | Miami | FL |
5 | Denver | CO |
Write DataFrame to Excel File ๐
Pandas makes it easy to save DataFrames as Excel files. Let’s see it in action.
Suppose you’ve compiled a list of your favorite restaurants in Seattle. And you prepare a DataFrame with their cuisine type and customer ratings:
seattle_restaurants = [
['Bakery Nouveau', 'French', 4.6],
['Pizzeria Credo', 'Italian', 4.6],
['Chan Seattle', 'Korean', 4.4],
['Tilikum Place Cafe', 'European', 4.6],
['Ba Bar Capitol Hill', 'Vietnamese', 4.5]
]
seattle_restaurants_df = pd.DataFrame(
data=seattle_restaurants,
columns=['Restaurant', 'Cuisine', 'Rating']
)
# Print DataFrame contents
seattle_restaurants_df
Restaurant | Cuisine | Rating | |
---|---|---|---|
0 | Bakery Nouveau | French | 4.6 |
1 | Pizzeria Credo | Italian | 4.6 |
2 | Chan Seattle | Korean | 4.4 |
3 | Tilikum Place Cafe | European | 4.6 |
4 | Ba Bar Capitol Hill | Vietnamese | 4.5 |
You can write this DataFrame to an Excel file using the method to_excel().
It’ll persist the index as well. We don’t want to save the numeric index. So we’ll set the parameter index
to False
:
seattle_restaurants_df.to_excel(
'seattle_restaurants.xlsx',
# Don't save the auto-generated numeric index
index=False
)
You should now see the file seattle_restaurants.xlsx in the same directory where you’re running this notebook.
Write Multiple sheets ๐
Suppose you also compile a list of restaurants in New York City:
nyc_restaurants = [
['Piccola Cucina Estiatorio', 'Italian', 4.6],
['Boucherie West Village', 'French', 4.6],
['Mei Jin Ramen', 'Japanese', 4.5],
['Spice Symphony', 'Indian', 4.4],
['Bua Thai Ramen & Robata Grill', 'Thai', 4.5]
]
nyc_restaurants_df = pd.DataFrame(
data=nyc_restaurants,
columns=['Restaurant', 'Cuisine', 'Rating']
)
nyc_restaurants_df
Restaurant | Cuisine | Rating | |
---|---|---|---|
0 | Piccola Cucina Estiatorio | Italian | 4.6 |
1 | Boucherie West Village | French | 4.6 |
2 | Mei Jin Ramen | Japanese | 4.5 |
3 | Spice Symphony | Indian | 4.4 |
4 | Bua Thai Ramen & Robata Grill | Thai | 4.5 |
Now you have two DataFrames with lists of your favorite restaurants in two cities - seattle_restaurants_df
and nyc_restaurants_df
.
You would like to create an Excel file with two sheets - it should have one sheet for each DataFrame. You can do that by using Pandas' class ExcelWriter and DataFrame method to_excel():
# Create an ExcelWriter instance
# and use it as a context manager
with pd.ExcelWriter('cities_restaurants.xlsx') as writer:
# Write Seattle restaurants to a sheet named 'Seattle'
seattle_restaurants_df.to_excel(
writer,
sheet_name='Seattle',
index=False
)
# Write NYC restaurants to a sheet named 'New York City'
nyc_restaurants_df.to_excel(
writer,
sheet_name='New York City',
index=False
)
You’ll now see a new Excel file, cities_restaurants.xlsx, in your current directory. And it’ll have two sheets - Seattle and New York City.
Add Sheet to Existing File ๐
Let’s say you gather a list of restaurants in New Orleans and create a DataFrame for them:
# New Orleans - Favorite restaurants
nola_restaurants = [
['Olde Nola Cookery', 'Cajun & Creole', 4.4],
['Atchafalaya', 'Contemporary', 4.7],
['Southern Candymakers', 'Dessert', 4.7],
['Maรฏs Arepas', 'Colombian', 4.5],
['Katie\'s Restaurant & Bar', 'Cajun & Creole', 4.6]
]
nola_restaurants_df = pd.DataFrame(
data=nola_restaurants,
columns=['Restaurant', 'Cuisine', 'Rating'])
nola_restaurants_df
Restaurant | Cuisine | Rating | |
---|---|---|---|
0 | Olde Nola Cookery | Cajun & Creole | 4.4 |
1 | Atchafalaya | Contemporary | 4.7 |
2 | Southern Candymakers | Dessert | 4.7 |
3 | Maรฏs Arepas | Colombian | 4.5 |
4 | Katie's Restaurant & Bar | Cajun & Creole | 4.6 |
And you want to save this DataFrame as a new sheet in the file we created in the last section - cities_restaurants.xlsx.
To do that, you’ll need to use the class ExcelWriter in append mode (parameter mode='a'
):
# Add a new sheet to the existing file
# Create an instance of ExcelWriter in append mode
with pd.ExcelWriter('cities_restaurants.xlsx', mode='a') as writer:
# Append new sheet
nola_restaurants_df.to_excel(
writer,
sheet_name='New Orleans',
index=False
)
Let’s confirm that the Excel file has the new sheet by printing the sheet names:
# Print the names of all the sheets in the Excel file
pd.ExcelFile('cities_restaurants.xlsx').sheet_names
['Seattle', 'New York City', 'New Orleans']
As expected, the file now contains the new sheet - New Orleans.
Summary ๐
This article showed you how to interact with Excel files using Python and Pandas. Here’s a quick recap of what we learned today:
- Read Excel files using read_excel().
- Load selected columns using the parameter
usecols
. - Skip blank columns and rows with the parameters
usecols
andskiprows
. - Read selected sheets using the parameter
sheet_name
.
- Load selected columns using the parameter
- Write DataFrame to Excel file using to_excel().
- Create an Excel file with multiple sheets with the class ExcelWriter.