
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Ordering & Sorting Categorical Data
In data analysis, we often need to work with categorical data, especially in columns with repeated string values such as country names, gender, or ratings. Categorical data refers to data that can take only a limited number of distinct values. For instance, values like 'India', 'Australia', in a country names column and "male", and "female" values in the gender column are categorical. These values can also be ordered, allowing for logical sorting.
Categorical data is one of the data type in Pandas that is used to handle variables with a fixed number of possible values, also known as "categories." This type of data is commonly used in statistical analysis. In this tutorial, we will learn how to order and sort categorical data using Pandas.
Ordering Categorical Data
Ordered categorical data in Pandas have a meaning, and allowing you to perform certain operations like sorting, min(), max(), and comparisons. Pandas will raise a TypeError when you try to apply min/max operations on unordered data. The Pandas .cat accessor provides the as_ordered() method to convert a categorical data type into an ordered one.
Example
The following example demonstrates how to create an ordered categorical series using the .cat.as_ordered() method and perform operations such as finding the minimum and maximum values on the ordered categorical series.
import pandas as pd # Create a categorical series s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype(pd.CategoricalDtype()) # Convert the categorical series into ordered using the .cat.as_ordered() method s = s.cat.as_ordered() # Display the ordered categorical series print('Ordered Categorical Series:\n',s) # Perform the minimum and maximum operation on ordered categorical series print('Minimum value of the categorical series:',s.min()) print('Maximum value of the categorical series:', s.max())
Following is the output of the above code −
Ordered Categorical Series: 0 a 1 b 2 c 3 a 4 a 5 a 6 b 7 b dtype: category Categories (3, object): ['a' < 'b' < 'c'] Minimum value of the categorical series: a Maximum value of the categorical series: c
Reordering Categories
Pandas allows you to reorder or reset the categories in your categorical data using .cat.reorder_categories() and .cat.set_categories() methods.
reorder_categories(): This method is used to reorder the existing categories with the specified new_categaries.
set_categories(): This method allows you to define a new set of categories, which may involve adding new categories or removing existing ones.
Example
The following example demonstrates how to reorder categories using both reorder_categories() and set_categories() methods.
import pandas as pd # Create a categorical series with a specific order s = pd.Series(["b", "a", "c", "a", "b"], dtype="category") # Reorder categories using reorder_categories s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True) print("Reordered Categories:\n", s_reordered) # Set new categories using set_categories s_new_categories = s.cat.set_categories(["d", "b", "a", "c"], ordered=True) print("\nNew Categories Set:\n", s_new_categories)
Following is the output of the above code −
Reordered Categories: 0 b 1 a 2 c 3 a 4 b dtype: category Categories (3, object): ['b' < 'a' < 'c'] New Categories Set: 0 b 1 a 2 c 3 a 4 b dtype: category Categories (4, object): ['d' < 'b' < 'a' < 'c']
Sorting the Categorical Data
Sorting categorical data refers to arranging data in a specific order based on the defined order of categories. For example, if you have categorical data with a specific order like, ["c", "a", "b"], sorting will arrange the values according to this order. Otherwise if you are not specified the order explicitly then, sorting might behave lexically (alphabetically or numerically).
Example
The following example demonstrates how the sorting behaves in Pandas with both unordered and ordered categorical data.
import pandas as pd # Create a categorical series without any specific order s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category") # Sort the categorical series without any predefined order (lexical sorting) print("Lexical Sorting:\n", s.sort_values()) # Define a custom order for the categories s = s.cat.set_categories(['c', 'a', 'b'], ordered=True) # Sort the categorical series with the defined order print("\nSorted with Defined Category Order:\n", s.sort_values())
Following is the output of the above code −
Lexical Sorting: 0 a 3 a 4 a 5 a 1 b 6 b 7 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] Sorted with Defined Category Order: 2 c 0 a 3 a 4 a 5 a 1 b
Multi-Column Sorting with Categorical Data
If you have multiple categorical columns in your DataFrame then a categorical column will be sorted with other columns, and its order will follow the defined categories.
Example
In this example, a DataFrame is created with two categorical columns, "A" and "B". The DataFrame is then sorted first by column "A" based on its categorical order, and then by column "B".
import pandas as pd # Create a DataFrame with categorical columns dfs = pd.DataFrame({ "A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"], categories=["Y", "Z", "X"], ordered=True), "B": [1, 2, 1, 2, 2, 1, 2, 1] }) # Sort by multiple columns sorted_dfs = dfs.sort_values(by=["A", "B"]) print("Sorted DataFrame:\n", sorted_dfs)
Following is the output of the above code −
Sorted DataFrame:
A | B | |
---|---|---|
2 | Y | 1 |
3 | Y | 2 |
5 | Z | 1 |
6 | Z | 2 |
0 | X | 1 |
7 | X | 1 |
1 | X | 2 |
4 | X | 2 |