Python Pandas - Ordering & Sorting Categorical Data



In data analysis, we often need to work with categorical data, especially in columns with repeated string values such as country names, gender, or ratings. Categorical data refers to data that can take only a limited number of distinct values. For instance, values like 'India', 'Australia', in a country names column and "male", and "female" values in the gender column are categorical. These values can also be ordered, allowing for logical sorting.

Categorical data is one of the data type in Pandas that is used to handle variables with a fixed number of possible values, also known as "categories." This type of data is commonly used in statistical analysis. In this tutorial, we will learn how to order and sort categorical data using Pandas.

Ordering Categorical Data

Ordered categorical data in Pandas have a meaning, and allowing you to perform certain operations like sorting, min(), max(), and comparisons. Pandas will raise a TypeError when you try to apply min/max operations on unordered data. The Pandas .cat accessor provides the as_ordered() method to convert a categorical data type into an ordered one.

Example

The following example demonstrates how to create an ordered categorical series using the .cat.as_ordered() method and perform operations such as finding the minimum and maximum values on the ordered categorical series.

import pandas as pd

# Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype(pd.CategoricalDtype())

# Convert the categorical series into ordered using the .cat.as_ordered() method 
s = s.cat.as_ordered()

# Display the ordered categorical series
print('Ordered Categorical Series:\n',s)

# Perform the minimum and maximum operation on ordered categorical series
print('Minimum value of the categorical series:',s.min())
print('Maximum value of the categorical series:', s.max())

Following is the output of the above code −

Ordered Categorical Series: 
0    a
1    b
2    c
3    a
4    a
5    a
6    b
7    b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

Minimum value of the categorical series: a
Maximum value of the categorical series: c

Reordering Categories

Pandas allows you to reorder or reset the categories in your categorical data using .cat.reorder_categories() and .cat.set_categories() methods.

  • reorder_categories(): This method is used to reorder the existing categories with the specified new_categaries.

  • set_categories(): This method allows you to define a new set of categories, which may involve adding new categories or removing existing ones.

Example

The following example demonstrates how to reorder categories using both reorder_categories() and set_categories() methods.

import pandas as pd

# Create a categorical series with a specific order
s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")

# Reorder categories using reorder_categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)

print("Reordered Categories:\n", s_reordered)

# Set new categories using set_categories
s_new_categories = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)

print("\nNew Categories Set:\n", s_new_categories)

Following is the output of the above code −

Reordered Categories:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']

New Categories Set:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']

Sorting the Categorical Data

Sorting categorical data refers to arranging data in a specific order based on the defined order of categories. For example, if you have categorical data with a specific order like, ["c", "a", "b"], sorting will arrange the values according to this order. Otherwise if you are not specified the order explicitly then, sorting might behave lexically (alphabetically or numerically).

Example

The following example demonstrates how the sorting behaves in Pandas with both unordered and ordered categorical data.

import pandas as pd

# Create a categorical series without any specific order
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")

# Sort the categorical series without any predefined order (lexical sorting)
print("Lexical Sorting:\n", s.sort_values())

# Define a custom order for the categories
s = s.cat.set_categories(['c', 'a', 'b'], ordered=True)

# Sort the categorical series with the defined order
print("\nSorted with Defined Category Order:\n", s.sort_values())

Following is the output of the above code −

Lexical Sorting:
0    a
3    a
4    a
5    a
1    b
6    b
7    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Sorted with Defined Category Order:
2    c
0    a
3    a
4    a
5    a
1    b

Multi-Column Sorting with Categorical Data

If you have multiple categorical columns in your DataFrame then a categorical column will be sorted with other columns, and its order will follow the defined categories.

Example

In this example, a DataFrame is created with two categorical columns, "A" and "B". The DataFrame is then sorted first by column "A" based on its categorical order, and then by column "B".

import pandas as pd

# Create a DataFrame with categorical columns
dfs = pd.DataFrame({
"A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"], categories=["Y", "Z", "X"], ordered=True),
"B": [1, 2, 1, 2, 2, 1, 2, 1]
})

# Sort by multiple columns
sorted_dfs = dfs.sort_values(by=["A", "B"])

print("Sorted DataFrame:\n", sorted_dfs)

Following is the output of the above code −

Sorted DataFrame:
A B
2 Y 1
3 Y 2
5 Z 1
6 Z 2
0 X 1
7 X 1
1 X 2
4 X 2
Advertisements