Advanced Indexing & Data Wrangling in Pandas

Real-world data is messy and comes in every shape imaginable. Once you’ve imported your data, you need to reshape, pivot, merge, and transform it into something useful. These advanced pandas techniques are what separate beginners from professionals.

Selecting & Indexing with isin()

Filter your data to focus on what matters.

Basic Filtering

1
---
2
import pandas as pd
3
import numpy as np
4

5
# Select rows based on conditions
6
df = pd.DataFrame({
7
    'Country': ['Belgium', 'India', 'Brazil', 'Brazil'],
8
    'Capital': ['Brussels', 'New Delhi', 'Brasília', 'Brasília'],
9
    'Population': [11190846, 1303171035, 207847528, 207847528]
10
})
11

12
# Find rows where Country is in a list
13
selected = df[df.Country.isin(['Belgium', 'Brazil'])]
14

15
# Select columns with any values > 1
16
df3.loc[:, (df3 > 1).any()]
17

18
# Select columns with all values > 1
19
df3.loc[:, (df3 > 1).all()]
20

21
# Select columns with NaN values
22
df3.loc[:, df3.isnull().any()]
23

24
# Select columns without NaN values
25
df3.loc[:, df3.notnull().all()]
26
---
27
import Callout from '@/components/Callout.astro'

Advanced Selection

1
# Custom filtering
2
df.filter(items=['a', 'b'])
3

4
# Using lambda for selection
5
df.select(lambda x: not x % 5)  # Select every 5th element
6

7
# Query the DataFrame
8
df6.query('second > first')

Where & Query

Subset your data with powerful querying.

1
# Using where() to subset data
2
s.where(s > 0)
3

4
# Query with string expressions
5
result = df.query('Population > 200000000')
6

7
# Multiple conditions
8
large_countries = df.query('Population > 200000000 and Country != "India"')

Reshaping Data

Transform data between long and wide formats—critical for analysis and visualization.

Setting & Resetting Index

1
# Set a column as the index
2
df.set_index('Country')
3

4
# Reset index to get default integer index
5
df4 = df.reset_index()
6

7
# Rename index and columns
8
df = df.rename(
9
    index=str,
10
    columns={
11
        'Country': 'cntry',
12
        'Capital': 'cptl',
13
        'Population': 'ppltn'
14
    }
15
)

Pivoting

Spread rows into columns for cross-tabulation.

1
# Create sample data
2
df2 = pd.DataFrame({
3
    'Date': pd.date_range('2000-01-01', periods=12, freq='M'),
4
    'Type': ['A', 'B', 'C'] * 4,
5
    'Value': np.random.randn(12)
6
})
7

8
# Pivot the DataFrame
9
df3 = df2.pivot(
10
    index='Date',
11
    columns='Type',
12
    values='Value'
13
)
14

15
# Pivot table with aggregation
16
df4 = pd.pivot_table(
17
    df2,
18
    values='Value',
19
    index='Date',
20
    columns='Type',
21
    aggfunc='mean'  # Default is np.mean
22
)

Stack & Unstack

1
# Stack: pivot level of column labels
2
stacked = df5.stack()
3

4
# Unstack: pivot level of index labels
5
unstacked = stacked.unstack()

Melt: Gather Columns into Rows

Transform wide format to long format.

1
# Melt the DataFrame
2
melted = pd.melt(
3
    df2,
4
    id_vars=['Date'],
5
    value_vars=['Type', 'Value'],
6
    value_name='Observations'
7
)

MultiIndexing

Create hierarchical indexes for complex data.

1
# Create MultiIndex from arrays
2
arrays = [np.array([1, 2, 3]), np.array([5, 4, 3])]
3
df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
4

5
# Create MultiIndex from tuples
6
tuples = list(zip(*arrays))
7
index = pd.MultiIndex.from_tuples(
8
    tuples,
9
    names=['first', 'second']
10
)
11

12
df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
13

14
# Set multiple columns as index
15
df2.set_index(['Date', 'Type'])

Combining Data

Merge, join, and concatenate datasets—essential for real-world analysis.

Merge

Combine DataFrames on common columns.

1
# Prepare sample data
2
data1 = pd.DataFrame({'X1': ['A', 'B', 'C'], 'Y1': [1, 2, 3]})
3
data2 = pd.DataFrame({'X1': ['A', 'C', 'D'], 'Z1': [10, 20, 30]})
4

5
# Left merge (keep all from left)
6
result = pd.merge(data1, data2, how='left', on='X1')
7

8
# Right merge (keep all from right)
9
result = pd.merge(data1, data2, how='right', on='X1')
10

11
# Inner merge (keep only common)
12
result = pd.merge(data1, data2, how='inner', on='X1')
13

14
# Outer merge (keep all from both)
15
result = pd.merge(data1, data2, how='outer', on='X1')

Join

1
# Join DataFrames
2
joined = data1.join(data2, how='right')

Concatenate

1
# Append Series
2
s.append(s2)
3

4
# Concatenate along columns
5
result = pd.concat([s, s2], axis=1, keys=['One', 'Two'])
6

7
# Concatenate with specific join method
8
result = pd.concat([data1, data2], axis=1, join='inner')

Reindexing

Reorder or add new labels to your data.

1
# Reindex with new labels
2
s2 = s.reindex(['a', 'c', 'd', 'e', 'b'])
3

4
# Forward fill missing values
5
df.reindex(range(4), method='ffill')
6

7
# Backward fill missing values
8
s3 = s.reindex(range(5), method='bfill')

Missing Data

Handle missing values like a pro.

1
# Drop NaN values
2
df.dropna()
3

4
# Fill NaN with predetermined value
5
df3.fillna(df3.mean())
6

7
# Fill forward (copy previous value)
8
df.fillna(method='ffill')
9

10
# Fill backward (copy next value)
11
df.fillna(method='bfill')
12

13
# Replace values
14
df2.replace('a', 'f')

Duplicate Data

Identify and remove duplicates efficiently.

1
# Return unique values
2
s3.unique()
3

4
# Check for duplicates
5
df2.duplicated('Type')
6

7
# Drop duplicates
8
df2.drop_duplicates('Type', keep='last')
9

10
# Check index duplicates
11
df.index.duplicated()

Date Handling

Work with dates and time series data.

1
# Convert column to datetime
2
df2['Date'] = pd.to_datetime(df2['Date'])
3

4
# Create date range
5
df2['Date'] = pd.date_range('2000-01-01', periods=6, freq='M')
6

7
# Create DatetimeIndex
8
dates = [datetime(2012, 5, 1), datetime(2012, 5, 2)]
9
index = pd.DatetimeIndex(dates)
10

11
# Create date range with custom end
12
index = pd.date_range(datetime(2012, 2, 1), end, freq='BM')

Grouping Data

Aggregate, transform, and apply operations on groups.

Aggregation

1
# Group by columns and aggregate
2
df2.groupby(by=['Date', 'Type']).mean()
3

4
# Group by index level
5
df4.groupby(level=0).sum()
6

7
# Multiple aggregation functions
8
df4.groupby(level=0).agg({
9
    'a': lambda x: sum(x)/len(x),
10
    'b': np.sum
11
})

Transformation

1
# Apply custom transformation function
2
customSum = lambda x: (x + x % 2)
3
df4.groupby(level=0).transform(customSum)

Visualization

Create quick visualizations for exploration.

1
import matplotlib.pyplot as plt
2

3
# Plot Series
4
s.plot()
5
plt.show()
6

7
# Plot DataFrame
8
df2.plot()
9
plt.show()

Forward & Backward Filling

Fill missing values based on adjacent data.

Forward Filling

1
# Propagate last valid observation forward
2
df.reindex(range(4), method='ffill')

Backward Filling

1
# Use next valid observation to fill gap
2
s3 = s.reindex(range(5), method='bfill')

Iteration

Iterate over DataFrames (use sparingly—vectorization is better!).

1
# Iterate over columns
2
for column_index, series in df.iteritems():
3
    # (Column-index, Series) pairs
4
    pass
5

6
# Iterate over rows
7
for row_index, series in df.iterrows():
8
    # (Row-index, Series) pairs
9
    pass

Real-World Examples

Example 1: Pivoting Sales Data

1
# Import sales data
2
sales = pd.DataFrame({
3
    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
4
    'Product': ['Widget A', 'Widget B', 'Widget C'] * 4,
5
    'Revenue': np.random.randint(1000, 10000, 12)
6
})
7

8
# Pivot to see revenue by product over time
9
revenue_pivot = sales.pivot(
10
    index='Date',
11
    columns='Product',
12
    values='Revenue'
13
)
14

15
# Or use pivot table for aggregation
16
revenue_table = pd.pivot_table(
17
    sales,
18
    values='Revenue',
19
    index='Date',
20
    columns='Product',
21
    aggfunc='sum'
22
)

Example 2: Merging Customer Data

1
# Customer basic info
2
customers = pd.DataFrame({
3
    'customer_id': [1, 2, 3],
4
    'name': ['Alice', 'Bob', 'Charlie']
5
})
6

7
# Customer orders
8
orders = pd.DataFrame({
9
    'customer_id': [1, 1, 2, 4],
10
    'order_amount': [100, 200, 150, 75]
11
})
12

13
# Merge with different join types
14
# Inner: Only customers who have orders
15
inner = pd.merge(customers, orders, how='inner', on='customer_id')
16

17
# Left: All customers, NaN for no orders
18
left = pd.merge(customers, orders, how='left', on='customer_id')
19

20
# Outer: All customers and all orders
21
outer = pd.merge(customers, orders, how='outer', on='customer_id')

Example 3: Handling Time Series with Missing Data

1
# Create time series with missing data
2
dates = pd.date_range('2024-01-01', periods=10, freq='D')
3
values = [1, 2, np.nan, 4, 5, np.nan, np.nan, 8, 9, 10]
4
ts = pd.Series(values, index=dates)
5

6
# Forward fill
7
ts_ffill = ts.fillna(method='ffill')
8

9
# Backward fill
10
ts_bfill = ts.fillna(method='bfill')
11

12
# Fill with mean
13
ts_mean = ts.fillna(ts.mean())
14

15
# Drop NaN
16
ts_clean = ts.dropna()

Quick Reference

Most Common Operations

1
# Reshape
2
df.pivot(index=, columns=, values=)
3
pd.pivot_table(df, index=, columns=, values=)
4
df.melt(id_vars=, value_vars=)
5
df.stack() / df.unstack()
6

7
# Combine
8
pd.merge(df1, df2, how='left', on='key')
9
pd.concat([df1, df2], axis=1, join='inner')
10
df.join(df2, how='outer')
11

12
# Group
13
df.groupby('column').mean()
14
df.groupby(level=0).sum()
15
df.groupby('col').transform(func)
16

17
# Select
18
df.loc[condition]
19
df.query('condition')
20
df.where(condition)
21

22
# Handle missing
23
df.dropna()
24
df.fillna(value)
25
df.fillna(method='ffill')
26

27
# Dates
28
pd.to_datetime(df['Date'])
29
pd.date_range(start, end, freq='M')

Pro Tips

✅ Use pivot tables instead of pivot when you need aggregation
✅ Inner joins are fastest—pre-filter before merging large datasets
✅ Forward fill for time series where data flows forward naturally
✅ Backward fill for sensor data or reverse-chronological patterns
✅ MultiIndex unlocks powerful hierarchical analysis
✅ Vectorize operations—avoid iteration whenever possible
✅ Groupby + transform when you need to preserve original shape