Apart from selecting data from row/column labels or integer location, Pandas also has a very useful feature that allows selecting data based on boolean index, i.e. True or False. This is boolean indexing in Pandas. It is one of the most useful feature that quickly filters out useless data from dataframe.
Believe me, it has lots of use cases and is helpful to select subset of data based on actual value of data instead of its index or location.
Use Case
Let’s say that you have a vast set of an ecommerce data. And you want answers for couple of queries like:
- List of all userid and the total payment received from them in the year 2019.
- All userids which have bought for more than 10,000 Rs. of value in 2018 but not bought at all in 2019.
- List of all userids who have bought on sales day.
- List of all purchaser from ‘California’.
Each of the above queries have a logical criteria that must be checked one row at a time.
If you manually answer the above queries, you would need to scan each row and determine whether the row meets the criteria or not. If the row meets the criteria, then it is kept and if not, then it is discarded.
But by using Boolean indexing in Pandas it is so easy to answer
Introduction to Boolean Indexing in Pandas
The Boolean values like ‘True’ and ‘False’ can be used as index in Pandas DataFrame. It can also be used to filter out the required records.
In this indexing, instead of column/row labels, we use a Boolean vector to filter the data.
There are 4 ways to filter the data:
- Accessing a DataFrame with a Boolean index.
- Applying a Boolean mask to a DataFrame.
- Masking data based on column value.
- Masking data based on index value.
Accessing Pandas DataFrame with a Boolean Index
To access a DataFrame with a Boolean index, we need to create a DataFrame in which index contains a Boolean values ‘True’ or ‘False’.
#Creating a dataframe with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
'ID':[90, 40, 80, 98, 50],
'Salary':[90000, 65000, 50000, 92000, 50000]}
# giving a index to a DataFrame
df = pd.DataFrame(dict, index = [True, False, True, False, True])
df
#Output
Name Dept ID Salary
True Joe Manager 90 90000
False Parth Dev 40 65000
True Sunny Admin 80 50000
False Lauren Manager 98 92000
True Lara Admin 50 50000
We can access a DataFrame using three functions .loc[]
, .iloc[]
, .ix[]
.
#Accessing a Dataframe with a boolean index using .loc[] function
df.loc[True]
#Output
Name Dept ID Salary
True Joe Manager 90 90000
True Sunny Admin 80 50000
True Lara Admin 50 50000
Now let’s try to access the DataFrame with .iloc[] function. .iloc[] is used to select rows and columns by number, in the order that they appear in the DataFrame.
.iloc[] function accepts only integer value as argument, so if we try to pass Boolean value [‘True’, or ‘False’] then it will throw a TypeError.
#Accessing a Dataframe with a boolean index using .iloc[] function
df.iloc[True]
#Output
TypeError
#Accessing a Dataframe with a boolean index using .iloc[] function
df.iloc[1]
#Output
Name Parth
Dept Dev
ID 40
Salary 65000
dtype: object
.ix[] function is a hybrid of .loc and .iloc, but this function indexer is deprecated and will be removed someday future, but not on the next release of pandas.
Applying a Boolean mask to Pandas DataFrame
We can apply a Boolean mask by giving list of True and False of the same length as contain in a DataFrame.
On applying a Boolean mask it will print only that DataFrame in which we pass a Boolean value True.
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
'ID':[90, 40, 80, 98, 50],
'Salary':[90000, 65000, 50000, 92000, 50000]}
# giving a index to a dataframe
df = pd.DataFrame(dict, index = [0,1,2,3,4])
df[[True,False,True,False,True]]
#Output
Name Dept ID Salary
0 Joe Manager 90 90000
2 Sunny Admin 80 50000
4 Lara Admin 50 50000
Masking Data Based on Column Value
By this way, we can filter a data based on a column value by applying certain condition on dataframe using different operator like ==
, >
, <
, <=
, >=
.
When we apply these operators on the data frame, then it produces a Series of True and False.
Now, let’s take a condition where we will filter out all the data in which the Department is ‘Manager’.
#Creating a dataframe with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
'ID':[90, 40, 80, 98, 50],
'Salary':[90000, 65000, 50000, 92000, 50000]}
df = pd.DataFrame(dict)
# Filtering the data whose department is 'Manager'
df.Dept == 'Manager'
#Output
0 True
1 False
2 False
3 True
4 False
Name: Dept, dtype: bool
Here, we are getting the result dataset in which it returns True if Department is Manager. Otherwise, it returns False.
Based on Department, we can select all the other columns data or even complete DataFrame.
Let’s see how we can display this.
#Wrapping the data in the DataFrame
df[df.Dept == 'Manager']
#Output
Name Dept ID Salary
0 Joe Manager 90 90000
3 Lauren Manager 98 92000
As we can see that the complete DataFrame containing only information in which the Department is ‘Manager’.
We can also use Multiple conditions to filter the data.
For example, getting the data of an employee whose department is manager and salary is greater than 90000.
df[(df.Dept == 'Manager') & (df.Salary > 90000)]
#Output
Name Dept ID Salary
3 Lauren Manager 98 92000
Here, we have added an & operator to add multiple conditions.
Masking Data Based on Index Value
In a DataFrame to filter a data based on a column value, we can create a mask based on the index values using different operator like ==
, >
, <
, etc..
#Creating a DataFrame with dictionary
import pandas as pd
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
'ID':[90, 40, 80, 98, 50],
'Salary':[90000, 65000, 50000, 92000, 50000]}
# giving a index to a dataframe
df = pd.DataFrame(dict, index = [0,1,2,3,4])
mask = df.index == 2
df[mask]
#Output
Name Dept ID Salary
2 Sunny Admin 80 50000
#Another Example
dict= {'Name':["Joe", "Parth", "Sunny", "Lauren", "Lara"],
'Dept': ["Manager", "Dev", "Admin", "Manager", "Admin"],
'ID':[90, 40, 80, 98, 50],
'Salary':[90000, 65000, 50000, 92000, 50000]}
# giving a index to a DataFrame
df = pd.DataFrame(dict, index = [0,1,2,3,4])
# filtering data on index value
mask = df.index > 1
df[mask]
#Output
Name Dept ID Salary
2 Sunny Admin 80 50000
3 Lauren Manager 98 92000
4 Lara Admin 50 50000
Summary
In this tutorial, we learned 4 ways to filter the data. We have also learned how to index a DataFrame with both the loc and iloc methods and how we can use boolean arrays to index or specify our rows and columns.
You may also like
C P Gupta is a YouTuber and Blogger. He is expert in Microsoft Word, Excel and PowerPoint. His YouTube channel @pickupbrain is very popular and has crossed 9.9 Million Views.