Advanced Indexing & Data Wrangling in Pandas
Overview

Advanced Indexing & Data Wrangling in Pandas

· December 15, 2025 · 6 min read

Real-world data is messy and comes in every shape imaginable. Once you’ve imported your data, you need to reshape, pivot, merge, and transform it into something useful. These advanced pandas techniques are what separate beginners from professionals.

Selecting & Indexing with isin()

Filter your data to focus on what matters.

Basic Filtering

src/components/Callout.astro
---
import pandas as pd
import numpy as np
# Select rows based on conditions
df = pd.DataFrame({
'Country': ['Belgium', 'India', 'Brazil', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília', 'Brasília'],
'Population': [11190846, 1303171035, 207847528, 207847528]
})
# Find rows where Country is in a list
selected = df[df.Country.isin(['Belgium', 'Brazil'])]
# Select columns with any values > 1
df3.loc[:, (df3 > 1).any()]
# Select columns with all values > 1
df3.loc[:, (df3 > 1).all()]
# Select columns with NaN values
df3.loc[:, df3.isnull().any()]
# Select columns without NaN values
df3.loc[:, df3.notnull().all()]
---
import Callout from '@/components/Callout.astro'

Advanced Selection

# Custom filtering
df.filter(items=['a', 'b'])
# Using lambda for selection
df.select(lambda x: not x % 5) # Select every 5th element
# Query the DataFrame
df6.query('second > first')

Where & Query

Subset your data with powerful querying.

# Using where() to subset data
s.where(s > 0)
# Query with string expressions
result = df.query('Population > 200000000')
# Multiple conditions
large_countries = df.query('Population > 200000000 and Country != "India"')

Reshaping Data

Transform data between long and wide formats—critical for analysis and visualization.

Setting & Resetting Index

# Set a column as the index
df.set_index('Country')
# Reset index to get default integer index
df4 = df.reset_index()
# Rename index and columns
df = df.rename(
index=str,
columns={
'Country': 'cntry',
'Capital': 'cptl',
'Population': 'ppltn'
}
)

Pivoting

Spread rows into columns for cross-tabulation.

# Create sample data
df2 = pd.DataFrame({
'Date': pd.date_range('2000-01-01', periods=12, freq='M'),
'Type': ['A', 'B', 'C'] * 4,
'Value': np.random.randn(12)
})
# Pivot the DataFrame
df3 = df2.pivot(
index='Date',
columns='Type',
values='Value'
)
# Pivot table with aggregation
df4 = pd.pivot_table(
df2,
values='Value',
index='Date',
columns='Type',
aggfunc='mean' # Default is np.mean
)

Stack & Unstack

# Stack: pivot level of column labels
stacked = df5.stack()
# Unstack: pivot level of index labels
unstacked = stacked.unstack()

Melt: Gather Columns into Rows

Transform wide format to long format.

# Melt the DataFrame
melted = pd.melt(
df2,
id_vars=['Date'],
value_vars=['Type', 'Value'],
value_name='Observations'
)

MultiIndexing

Create hierarchical indexes for complex data.

# Create MultiIndex from arrays
arrays = [np.array([1, 2, 3]), np.array([5, 4, 3])]
df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
# Create MultiIndex from tuples
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(
tuples,
names=['first', 'second']
)
df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
# Set multiple columns as index
df2.set_index(['Date', 'Type'])

Combining Data

Merge, join, and concatenate datasets—essential for real-world analysis.

Merge

Combine DataFrames on common columns.

# Prepare sample data
data1 = pd.DataFrame({'X1': ['A', 'B', 'C'], 'Y1': [1, 2, 3]})
data2 = pd.DataFrame({'X1': ['A', 'C', 'D'], 'Z1': [10, 20, 30]})
# Left merge (keep all from left)
result = pd.merge(data1, data2, how='left', on='X1')
# Right merge (keep all from right)
result = pd.merge(data1, data2, how='right', on='X1')
# Inner merge (keep only common)
result = pd.merge(data1, data2, how='inner', on='X1')
# Outer merge (keep all from both)
result = pd.merge(data1, data2, how='outer', on='X1')

Join

# Join DataFrames
joined = data1.join(data2, how='right')

Concatenate

# Append Series
s.append(s2)
# Concatenate along columns
result = pd.concat([s, s2], axis=1, keys=['One', 'Two'])
# Concatenate with specific join method
result = pd.concat([data1, data2], axis=1, join='inner')

Reindexing

Reorder or add new labels to your data.

# Reindex with new labels
s2 = s.reindex(['a', 'c', 'd', 'e', 'b'])
# Forward fill missing values
df.reindex(range(4), method='ffill')
# Backward fill missing values
s3 = s.reindex(range(5), method='bfill')

Missing Data

Handle missing values like a pro.

# Drop NaN values
df.dropna()
# Fill NaN with predetermined value
df3.fillna(df3.mean())
# Fill forward (copy previous value)
df.fillna(method='ffill')
# Fill backward (copy next value)
df.fillna(method='bfill')
# Replace values
df2.replace('a', 'f')

Duplicate Data

Identify and remove duplicates efficiently.

# Return unique values
s3.unique()
# Check for duplicates
df2.duplicated('Type')
# Drop duplicates
df2.drop_duplicates('Type', keep='last')
# Check index duplicates
df.index.duplicated()

Date Handling

Work with dates and time series data.

# Convert column to datetime
df2['Date'] = pd.to_datetime(df2['Date'])
# Create date range
df2['Date'] = pd.date_range('2000-01-01', periods=6, freq='M')
# Create DatetimeIndex
dates = [datetime(2012, 5, 1), datetime(2012, 5, 2)]
index = pd.DatetimeIndex(dates)
# Create date range with custom end
index = pd.date_range(datetime(2012, 2, 1), end, freq='BM')

Grouping Data

Aggregate, transform, and apply operations on groups.

Aggregation

# Group by columns and aggregate
df2.groupby(by=['Date', 'Type']).mean()
# Group by index level
df4.groupby(level=0).sum()
# Multiple aggregation functions
df4.groupby(level=0).agg({
'a': lambda x: sum(x)/len(x),
'b': np.sum
})

Transformation

# Apply custom transformation function
customSum = lambda x: (x + x % 2)
df4.groupby(level=0).transform(customSum)

Visualization

Create quick visualizations for exploration.

import matplotlib.pyplot as plt
# Plot Series
s.plot()
plt.show()
# Plot DataFrame
df2.plot()
plt.show()

Forward & Backward Filling

Fill missing values based on adjacent data.

Forward Filling

# Propagate last valid observation forward
df.reindex(range(4), method='ffill')

Backward Filling

# Use next valid observation to fill gap
s3 = s.reindex(range(5), method='bfill')

Iteration

Iterate over DataFrames (use sparingly—vectorization is better!).

# Iterate over columns
for column_index, series in df.iteritems():
# (Column-index, Series) pairs
pass
# Iterate over rows
for row_index, series in df.iterrows():
# (Row-index, Series) pairs
pass

Real-World Examples

Example 1: Pivoting Sales Data

# Import sales data
sales = pd.DataFrame({
'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
'Product': ['Widget A', 'Widget B', 'Widget C'] * 4,
'Revenue': np.random.randint(1000, 10000, 12)
})
# Pivot to see revenue by product over time
revenue_pivot = sales.pivot(
index='Date',
columns='Product',
values='Revenue'
)
# Or use pivot table for aggregation
revenue_table = pd.pivot_table(
sales,
values='Revenue',
index='Date',
columns='Product',
aggfunc='sum'
)

Example 2: Merging Customer Data

# Customer basic info
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
# Customer orders
orders = pd.DataFrame({
'customer_id': [1, 1, 2, 4],
'order_amount': [100, 200, 150, 75]
})
# Merge with different join types
# Inner: Only customers who have orders
inner = pd.merge(customers, orders, how='inner', on='customer_id')
# Left: All customers, NaN for no orders
left = pd.merge(customers, orders, how='left', on='customer_id')
# Outer: All customers and all orders
outer = pd.merge(customers, orders, how='outer', on='customer_id')

Example 3: Handling Time Series with Missing Data

# Create time series with missing data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
values = [1, 2, np.nan, 4, 5, np.nan, np.nan, 8, 9, 10]
ts = pd.Series(values, index=dates)
# Forward fill
ts_ffill = ts.fillna(method='ffill')
# Backward fill
ts_bfill = ts.fillna(method='bfill')
# Fill with mean
ts_mean = ts.fillna(ts.mean())
# Drop NaN
ts_clean = ts.dropna()

Quick Reference

Most Common Operations

# Reshape
df.pivot(index=, columns=, values=)
pd.pivot_table(df, index=, columns=, values=)
df.melt(id_vars=, value_vars=)
df.stack() / df.unstack()
# Combine
pd.merge(df1, df2, how='left', on='key')
pd.concat([df1, df2], axis=1, join='inner')
df.join(df2, how='outer')
# Group
df.groupby('column').mean()
df.groupby(level=0).sum()
df.groupby('col').transform(func)
# Select
df.loc[condition]
df.query('condition')
df.where(condition)
# Handle missing
df.dropna()
df.fillna(value)
df.fillna(method='ffill')
# Dates
pd.to_datetime(df['Date'])
pd.date_range(start, end, freq='M')

Pro Tips

Use pivot tables instead of pivot when you need aggregation
Inner joins are fastest—pre-filter before merging large datasets
Forward fill for time series where data flows forward naturally
Backward fill for sensor data or reverse-chronological patterns
MultiIndex unlocks powerful hierarchical analysis
Vectorize operations—avoid iteration whenever possible
Groupby + transform when you need to preserve original shape