Advanced Indexing & Data Wrangling in Pandas

Pandas techniques for reshaping, pivoting, merging, and transforming data

Oct 27, 2025·6 min read·⏱ 0s··

SERIES

Data Science from Scratch

3 of 3

Real-world data is messy and comes in every shape imaginable. Once you've imported your data, you need to reshape, pivot, merge, and transform it into something useful. These advanced pandas techniques are what separate beginners from professionals.

Pro tip: Bookmark this page! You'll reference these patterns constantly as you work with real datasets.

Selecting & Indexing with isin()

Filter your data to focus on what matters.

Basic Filtering

1import pandas as pd
2import numpy as np
3
4# Select rows based on conditions
5df = pd.DataFrame({
6    'Country': ['Belgium', 'India', 'Brazil', 'Brazil'],
7    'Capital': ['Brussels', 'New Delhi', 'Brasília', 'Brasília'],
8    'Population': [11190846, 1303171035, 207847528, 207847528]
9})
10
11# Find rows where Country is in a list
12selected = df[df.Country.isin(['Belgium', 'Brazil'])]
13
14# Select columns with any values > 1
15df3.loc[:, (df3 > 1).any()]
16
17# Select columns with all values > 1
18df3.loc[:, (df3 > 1).all()]
19
20# Select columns with NaN values
21df3.loc[:, df3.isnull().any()]
22
23# Select columns without NaN values
24df3.loc[:, df3.notnull().all()]

1import pandas as pd
2import numpy as np
3
4# Select rows based on conditions
5df = pd.DataFrame({
6    'Country': ['Belgium', 'India', 'Brazil', 'Brazil'],
7    'Capital': ['Brussels', 'New Delhi', 'Brasília', 'Brasília'],
8    'Population': [11190846, 1303171035, 207847528, 207847528]
9})
10
11# Find rows where Country is in a list
12selected = df[df.Country.isin(['Belgium', 'Brazil'])]
13
14# Select columns with any values > 1
15df3.loc[:, (df3 > 1).any()]
16
17# Select columns with all values > 1
18df3.loc[:, (df3 > 1).all()]
19
20# Select columns with NaN values
21df3.loc[:, df3.isnull().any()]
22
23# Select columns without NaN values
24df3.loc[:, df3.notnull().all()]

Advanced Selection

1# Custom filtering
2df.filter(items=['a', 'b'])
3
4# Using lambda for selection
5df.select(lambda x: not x % 5)  # Select every 5th element
6
7# Query the DataFrame
8df6.query('second > first')

1# Custom filtering
2df.filter(items=['a', 'b'])
3
4# Using lambda for selection
5df.select(lambda x: not x % 5)  # Select every 5th element
6
7# Query the DataFrame
8df6.query('second > first')

Where & Query

Subset your data with powerful querying.

1# Using where() to subset data
2s.where(s > 0)
3
4# Query with string expressions
5result = df.query('Population > 200000000')
6
7# Multiple conditions
8large_countries = df.query('Population > 200000000 and Country != "India"')

1# Using where() to subset data
2s.where(s > 0)
3
4# Query with string expressions
5result = df.query('Population > 200000000')
6
7# Multiple conditions
8large_countries = df.query('Population > 200000000 and Country != "India"')

Reshaping Data

Transform data between long and wide formats—critical for analysis and visualization.

Setting & Resetting Index

1# Set a column as the index
2df.set_index('Country')
3
4# Reset index to get default integer index
5df4 = df.reset_index()
6
7# Rename index and columns
8df = df.rename(
9    index=str,
10    columns={
11        'Country': 'cntry',
12        'Capital': 'cptl',
13        'Population': 'ppltn'
14    }
15)

1# Set a column as the index
2df.set_index('Country')
3
4# Reset index to get default integer index
5df4 = df.reset_index()
6
7# Rename index and columns
8df = df.rename(
9    index=str,
10    columns={
11        'Country': 'cntry',
12        'Capital': 'cptl',
13        'Population': 'ppltn'
14    }
15)

Pivoting

Spread rows into columns for cross-tabulation.

1# Create sample data
2df2 = pd.DataFrame({
3    'Date': pd.date_range('2000-01-01', periods=12, freq='M'),
4    'Type': ['A', 'B', 'C'] * 4,
5    'Value': np.random.randn(12)
6})
7
8# Pivot the DataFrame
9df3 = df2.pivot(
10    index='Date',
11    columns='Type',
12    values='Value'
13)
14
15# Pivot table with aggregation
16df4 = pd.pivot_table(
17    df2,
18    values='Value',
19    index='Date',
20    columns='Type',
21    aggfunc='mean'  # Default is np.mean
22)

1# Create sample data
2df2 = pd.DataFrame({
3    'Date': pd.date_range('2000-01-01', periods=12, freq='M'),
4    'Type': ['A', 'B', 'C'] * 4,
5    'Value': np.random.randn(12)
6})
7
8# Pivot the DataFrame
9df3 = df2.pivot(
10    index='Date',
11    columns='Type',
12    values='Value'
13)
14
15# Pivot table with aggregation
16df4 = pd.pivot_table(
17    df2,
18    values='Value',
19    index='Date',
20    columns='Type',
21    aggfunc='mean'  # Default is np.mean
22)

Stack & Unstack

1# Stack: pivot level of column labels
2stacked = df5.stack()
3
4# Unstack: pivot level of index labels
5unstacked = stacked.unstack()

1# Stack: pivot level of column labels
2stacked = df5.stack()
3
4# Unstack: pivot level of index labels
5unstacked = stacked.unstack()

Melt: Gather Columns into Rows

Transform wide format to long format.

1# Melt the DataFrame
2melted = pd.melt(
3    df2,
4    id_vars=['Date'],
5    value_vars=['Type', 'Value'],
6    value_name='Observations'
7)

1# Melt the DataFrame
2melted = pd.melt(
3    df2,
4    id_vars=['Date'],
5    value_vars=['Type', 'Value'],
6    value_name='Observations'
7)

MultiIndexing

Create hierarchical indexes for complex data.

1# Create MultiIndex from arrays
2arrays = [np.array([1, 2, 3]), np.array([5, 4, 3])]
3df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
4
5# Create MultiIndex from tuples
6tuples = list(zip(*arrays))
7index = pd.MultiIndex.from_tuples(
8    tuples,
9    names=['first', 'second']
10)
11
12df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
13
14# Set multiple columns as index
15df2.set_index(['Date', 'Type'])

1# Create MultiIndex from arrays
2arrays = [np.array([1, 2, 3]), np.array([5, 4, 3])]
3df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
4
5# Create MultiIndex from tuples
6tuples = list(zip(*arrays))
7index = pd.MultiIndex.from_tuples(
8    tuples,
9    names=['first', 'second']
10)
11
12df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
13
14# Set multiple columns as index
15df2.set_index(['Date', 'Type'])

Combining Data

Merge, join, and concatenate datasets—essential for real-world analysis.

Merge

Combine DataFrames on common columns.

1# Prepare sample data
2data1 = pd.DataFrame({'X1': ['A', 'B', 'C'], 'Y1': [1, 2, 3]})
3data2 = pd.DataFrame({'X1': ['A', 'C', 'D'], 'Z1': [10, 20, 30]})
4
5# Left merge (keep all from left)
6result = pd.merge(data1, data2, how='left', on='X1')
7
8# Right merge (keep all from right)
9result = pd.merge(data1, data2, how='right', on='X1')
10
11# Inner merge (keep only common)
12result = pd.merge(data1, data2, how='inner', on='X1')
13
14# Outer merge (keep all from both)
15result = pd.merge(data1, data2, how='outer', on='X1')

1# Prepare sample data
2data1 = pd.DataFrame({'X1': ['A', 'B', 'C'], 'Y1': [1, 2, 3]})
3data2 = pd.DataFrame({'X1': ['A', 'C', 'D'], 'Z1': [10, 20, 30]})
4
5# Left merge (keep all from left)
6result = pd.merge(data1, data2, how='left', on='X1')
7
8# Right merge (keep all from right)
9result = pd.merge(data1, data2, how='right', on='X1')
10
11# Inner merge (keep only common)
12result = pd.merge(data1, data2, how='inner', on='X1')
13
14# Outer merge (keep all from both)
15result = pd.merge(data1, data2, how='outer', on='X1')

Join

1# Join DataFrames
2joined = data1.join(data2, how='right')

1# Join DataFrames
2joined = data1.join(data2, how='right')

Concatenate

1# Append Series
2s.append(s2)
3
4# Concatenate along columns
5result = pd.concat([s, s2], axis=1, keys=['One', 'Two'])
6
7# Concatenate with specific join method
8result = pd.concat([data1, data2], axis=1, join='inner')

1# Append Series
2s.append(s2)
3
4# Concatenate along columns
5result = pd.concat([s, s2], axis=1, keys=['One', 'Two'])
6
7# Concatenate with specific join method
8result = pd.concat([data1, data2], axis=1, join='inner')

Reindexing

Reorder or add new labels to your data.

1# Reindex with new labels
2s2 = s.reindex(['a', 'c', 'd', 'e', 'b'])
3
4# Forward fill missing values
5df.reindex(range(4), method='ffill')
6
7# Backward fill missing values
8s3 = s.reindex(range(5), method='bfill')

1# Reindex with new labels
2s2 = s.reindex(['a', 'c', 'd', 'e', 'b'])
3
4# Forward fill missing values
5df.reindex(range(4), method='ffill')
6
7# Backward fill missing values
8s3 = s.reindex(range(5), method='bfill')

Missing Data

Handle missing values like a pro.

1# Drop NaN values
2df.dropna()
3
4# Fill NaN with predetermined value
5df3.fillna(df3.mean())
6
7# Fill forward (copy previous value)
8df.fillna(method='ffill')
9
10# Fill backward (copy next value)
11df.fillna(method='bfill')
12
13# Replace values
14df2.replace('a', 'f')

1# Drop NaN values
2df.dropna()
3
4# Fill NaN with predetermined value
5df3.fillna(df3.mean())
6
7# Fill forward (copy previous value)
8df.fillna(method='ffill')
9
10# Fill backward (copy next value)
11df.fillna(method='bfill')
12
13# Replace values
14df2.replace('a', 'f')

Duplicate Data

Identify and remove duplicates efficiently.

1# Return unique values
2s3.unique()
3
4# Check for duplicates
5df2.duplicated('Type')
6
7# Drop duplicates
8df2.drop_duplicates('Type', keep='last')
9
10# Check index duplicates
11df.index.duplicated()

1# Return unique values
2s3.unique()
3
4# Check for duplicates
5df2.duplicated('Type')
6
7# Drop duplicates
8df2.drop_duplicates('Type', keep='last')
9
10# Check index duplicates
11df.index.duplicated()

Date Handling

Work with dates and time series data.

1# Convert column to datetime
2df2['Date'] = pd.to_datetime(df2['Date'])
3
4# Create date range
5df2['Date'] = pd.date_range('2000-01-01', periods=6, freq='M')
6
7# Create DatetimeIndex
8dates = [datetime(2012, 5, 1), datetime(2012, 5, 2)]
9index = pd.DatetimeIndex(dates)
10
11# Create date range with custom end
12index = pd.date_range(datetime(2012, 2, 1), end, freq='BM')

1# Convert column to datetime
2df2['Date'] = pd.to_datetime(df2['Date'])
3
4# Create date range
5df2['Date'] = pd.date_range('2000-01-01', periods=6, freq='M')
6
7# Create DatetimeIndex
8dates = [datetime(2012, 5, 1), datetime(2012, 5, 2)]
9index = pd.DatetimeIndex(dates)
10
11# Create date range with custom end
12index = pd.date_range(datetime(2012, 2, 1), end, freq='BM')

Grouping Data

Aggregate, transform, and apply operations on groups.

Aggregation

1# Group by columns and aggregate
2df2.groupby(by=['Date', 'Type']).mean()
3
4# Group by index level
5df4.groupby(level=0).sum()
6
7# Multiple aggregation functions
8df4.groupby(level=0).agg({
9    'a': lambda x: sum(x)/len(x),
10    'b': np.sum
11})

1# Group by columns and aggregate
2df2.groupby(by=['Date', 'Type']).mean()
3
4# Group by index level
5df4.groupby(level=0).sum()
6
7# Multiple aggregation functions
8df4.groupby(level=0).agg({
9    'a': lambda x: sum(x)/len(x),
10    'b': np.sum
11})

Transformation

1# Apply custom transformation function
2customSum = lambda x: (x + x % 2)
3df4.groupby(level=0).transform(customSum)

1# Apply custom transformation function
2customSum = lambda x: (x + x % 2)
3df4.groupby(level=0).transform(customSum)

Visualization

Create quick visualizations for exploration.

1import matplotlib.pyplot as plt
2
3# Plot Series
4s.plot()
5plt.show()
6
7# Plot DataFrame
8df2.plot()
9plt.show()

1import matplotlib.pyplot as plt
2
3# Plot Series
4s.plot()
5plt.show()
6
7# Plot DataFrame
8df2.plot()
9plt.show()

Forward & Backward Filling

Fill missing values based on adjacent data.

Forward Filling

1# Propagate last valid observation forward
2df.reindex(range(4), method='ffill')

1# Propagate last valid observation forward
2df.reindex(range(4), method='ffill')

Backward Filling

1# Use next valid observation to fill gap
2s3 = s.reindex(range(5), method='bfill')

1# Use next valid observation to fill gap
2s3 = s.reindex(range(5), method='bfill')

Iteration

Iterate over DataFrames (use sparingly—vectorization is better!).

1# Iterate over columns
2for column_index, series in df.iteritems():
3    # (Column-index, Series) pairs
4    pass
5
6# Iterate over rows
7for row_index, series in df.iterrows():
8    # (Row-index, Series) pairs
9    pass

1# Iterate over columns
2for column_index, series in df.iteritems():
3    # (Column-index, Series) pairs
4    pass
5
6# Iterate over rows
7for row_index, series in df.iterrows():
8    # (Row-index, Series) pairs
9    pass

Performance warning: Iteration is slow in pandas. Whenever possible, use vectorized operations instead!

Real-World Examples

Example 1: Pivoting Sales Data

1# Import sales data
2sales = pd.DataFrame({
3    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
4    'Product': ['Widget A', 'Widget B', 'Widget C'] * 4,
5    'Revenue': np.random.randint(1000, 10000, 12)
6})
7
8# Pivot to see revenue by product over time
9revenue_pivot = sales.pivot(
10    index='Date',
11    columns='Product',
12    values='Revenue'
13)
14
15# Or use pivot table for aggregation
16revenue_table = pd.pivot_table(
17    sales,
18    values='Revenue',
19    index='Date',
20    columns='Product',
21    aggfunc='sum'
22)

1# Import sales data
2sales = pd.DataFrame({
3    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
4    'Product': ['Widget A', 'Widget B', 'Widget C'] * 4,
5    'Revenue': np.random.randint(1000, 10000, 12)
6})
7
8# Pivot to see revenue by product over time
9revenue_pivot = sales.pivot(
10    index='Date',
11    columns='Product',
12    values='Revenue'
13)
14
15# Or use pivot table for aggregation
16revenue_table = pd.pivot_table(
17    sales,
18    values='Revenue',
19    index='Date',
20    columns='Product',
21    aggfunc='sum'
22)

Example 2: Merging Customer Data

1# Customer basic info
2customers = pd.DataFrame({
3    'customer_id': [1, 2, 3],
4    'name': ['Alice', 'Bob', 'Charlie']
5})
6
7# Customer orders
8orders = pd.DataFrame({
9    'customer_id': [1, 1, 2, 4],
10    'order_amount': [100, 200, 150, 75]
11})
12
13# Merge with different join types
14# Inner: Only customers who have orders
15inner = pd.merge(customers, orders, how='inner', on='customer_id')
16
17# Left: All customers, NaN for no orders
18left = pd.merge(customers, orders, how='left', on='customer_id')
19
20# Outer: All customers and all orders
21outer = pd.merge(customers, orders, how='outer', on='customer_id')

1# Customer basic info
2customers = pd.DataFrame({
3    'customer_id': [1, 2, 3],
4    'name': ['Alice', 'Bob', 'Charlie']
5})
6
7# Customer orders
8orders = pd.DataFrame({
9    'customer_id': [1, 1, 2, 4],
10    'order_amount': [100, 200, 150, 75]
11})
12
13# Merge with different join types
14# Inner: Only customers who have orders
15inner = pd.merge(customers, orders, how='inner', on='customer_id')
16
17# Left: All customers, NaN for no orders
18left = pd.merge(customers, orders, how='left', on='customer_id')
19
20# Outer: All customers and all orders
21outer = pd.merge(customers, orders, how='outer', on='customer_id')

Example 3: Handling Time Series with Missing Data

1# Create time series with missing data
2dates = pd.date_range('2024-01-01', periods=10, freq='D')
3values = [1, 2, np.nan, 4, 5, np.nan, np.nan, 8, 9, 10]
4ts = pd.Series(values, index=dates)
5
6# Forward fill
7ts_ffill = ts.fillna(method='ffill')
8
9# Backward fill
10ts_bfill = ts.fillna(method='bfill')
11
12# Fill with mean
13ts_mean = ts.fillna(ts.mean())
14
15# Drop NaN
16ts_clean = ts.dropna()

1# Create time series with missing data
2dates = pd.date_range('2024-01-01', periods=10, freq='D')
3values = [1, 2, np.nan, 4, 5, np.nan, np.nan, 8, 9, 10]
4ts = pd.Series(values, index=dates)
5
6# Forward fill
7ts_ffill = ts.fillna(method='ffill')
8
9# Backward fill
10ts_bfill = ts.fillna(method='bfill')
11
12# Fill with mean
13ts_mean = ts.fillna(ts.mean())
14
15# Drop NaN
16ts_clean = ts.dropna()

Quick Reference

Most Common Operations

1# Reshape
2df.pivot(index=, columns=, values=)
3pd.pivot_table(df, index=, columns=, values=)
4df.melt(id_vars=, value_vars=)
5df.stack() / df.unstack()
6
7# Combine
8pd.merge(df1, df2, how='left', on='key')
9pd.concat([df1, df2], axis=1, join='inner')
10df.join(df2, how='outer')
11
12# Group
13df.groupby('column').mean()
14df.groupby(level=0).sum()
15df.groupby('col').transform(func)
16
17# Select
18df.loc[condition]
19df.query('condition')
20df.where(condition)
21
22# Handle missing
23df.dropna()
24df.fillna(value)
25df.fillna(method='ffill')
26
27# Dates
28pd.to_datetime(df['Date'])
29pd.date_range(start, end, freq='M')

1# Reshape
2df.pivot(index=, columns=, values=)
3pd.pivot_table(df, index=, columns=, values=)
4df.melt(id_vars=, value_vars=)
5df.stack() / df.unstack()
6
7# Combine
8pd.merge(df1, df2, how='left', on='key')
9pd.concat([df1, df2], axis=1, join='inner')
10df.join(df2, how='outer')
11
12# Group
13df.groupby('column').mean()
14df.groupby(level=0).sum()
15df.groupby('col').transform(func)
16
17# Select
18df.loc[condition]
19df.query('condition')
20df.where(condition)
21
22# Handle missing
23df.dropna()
24df.fillna(value)
25df.fillna(method='ffill')
26
27# Dates
28pd.to_datetime(df['Date'])
29pd.date_range(start, end, freq='M')

Pro Tips

✅ Use pivot tables instead of pivot when you need aggregation
✅ Inner joins are fastest—pre-filter before merging large datasets
✅ Forward fill for time series where data flows forward naturally
✅ Backward fill for sensor data or reverse-chronological patterns
✅ MultiIndex unlocks powerful hierarchical analysis
✅ Vectorize operations—avoid iteration whenever possible
✅ Groupby + transform when you need to preserve original shape