Real-world data is messy and comes in every shape imaginable. Once you’ve imported your data, you need to reshape, pivot, merge, and transform it into something useful. These advanced pandas techniques are what separate beginners from professionals.
Selecting & Indexing with isin()
Filter your data to focus on what matters.
Basic Filtering
---import pandas as pdimport numpy as np
# Select rows based on conditionsdf = pd.DataFrame({ 'Country': ['Belgium', 'India', 'Brazil', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília', 'Brasília'], 'Population': [11190846, 1303171035, 207847528, 207847528]})
# Find rows where Country is in a listselected = df[df.Country.isin(['Belgium', 'Brazil'])]
# Select columns with any values > 1df3.loc[:, (df3 > 1).any()]
# Select columns with all values > 1df3.loc[:, (df3 > 1).all()]
# Select columns with NaN valuesdf3.loc[:, df3.isnull().any()]
# Select columns without NaN valuesdf3.loc[:, df3.notnull().all()]---import Callout from '@/components/Callout.astro'Advanced Selection
# Custom filteringdf.filter(items=['a', 'b'])
# Using lambda for selectiondf.select(lambda x: not x % 5) # Select every 5th element
# Query the DataFramedf6.query('second > first')Where & Query
Subset your data with powerful querying.
# Using where() to subset datas.where(s > 0)
# Query with string expressionsresult = df.query('Population > 200000000')
# Multiple conditionslarge_countries = df.query('Population > 200000000 and Country != "India"')Reshaping Data
Transform data between long and wide formats—critical for analysis and visualization.
Setting & Resetting Index
# Set a column as the indexdf.set_index('Country')
# Reset index to get default integer indexdf4 = df.reset_index()
# Rename index and columnsdf = df.rename( index=str, columns={ 'Country': 'cntry', 'Capital': 'cptl', 'Population': 'ppltn' })Pivoting
Spread rows into columns for cross-tabulation.
# Create sample datadf2 = pd.DataFrame({ 'Date': pd.date_range('2000-01-01', periods=12, freq='M'), 'Type': ['A', 'B', 'C'] * 4, 'Value': np.random.randn(12)})
# Pivot the DataFramedf3 = df2.pivot( index='Date', columns='Type', values='Value')
# Pivot table with aggregationdf4 = pd.pivot_table( df2, values='Value', index='Date', columns='Type', aggfunc='mean' # Default is np.mean)Stack & Unstack
# Stack: pivot level of column labelsstacked = df5.stack()
# Unstack: pivot level of index labelsunstacked = stacked.unstack()Melt: Gather Columns into Rows
Transform wide format to long format.
# Melt the DataFramemelted = pd.melt( df2, id_vars=['Date'], value_vars=['Type', 'Value'], value_name='Observations')MultiIndexing
Create hierarchical indexes for complex data.
# Create MultiIndex from arraysarrays = [np.array([1, 2, 3]), np.array([5, 4, 3])]df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
# Create MultiIndex from tuplestuples = list(zip(*arrays))index = pd.MultiIndex.from_tuples( tuples, names=['first', 'second'])
df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
# Set multiple columns as indexdf2.set_index(['Date', 'Type'])Combining Data
Merge, join, and concatenate datasets—essential for real-world analysis.
Merge
Combine DataFrames on common columns.
# Prepare sample datadata1 = pd.DataFrame({'X1': ['A', 'B', 'C'], 'Y1': [1, 2, 3]})data2 = pd.DataFrame({'X1': ['A', 'C', 'D'], 'Z1': [10, 20, 30]})
# Left merge (keep all from left)result = pd.merge(data1, data2, how='left', on='X1')
# Right merge (keep all from right)result = pd.merge(data1, data2, how='right', on='X1')
# Inner merge (keep only common)result = pd.merge(data1, data2, how='inner', on='X1')
# Outer merge (keep all from both)result = pd.merge(data1, data2, how='outer', on='X1')Join
# Join DataFramesjoined = data1.join(data2, how='right')Concatenate
# Append Seriess.append(s2)
# Concatenate along columnsresult = pd.concat([s, s2], axis=1, keys=['One', 'Two'])
# Concatenate with specific join methodresult = pd.concat([data1, data2], axis=1, join='inner')Reindexing
Reorder or add new labels to your data.
# Reindex with new labelss2 = s.reindex(['a', 'c', 'd', 'e', 'b'])
# Forward fill missing valuesdf.reindex(range(4), method='ffill')
# Backward fill missing valuess3 = s.reindex(range(5), method='bfill')Missing Data
Handle missing values like a pro.
# Drop NaN valuesdf.dropna()
# Fill NaN with predetermined valuedf3.fillna(df3.mean())
# Fill forward (copy previous value)df.fillna(method='ffill')
# Fill backward (copy next value)df.fillna(method='bfill')
# Replace valuesdf2.replace('a', 'f')Duplicate Data
Identify and remove duplicates efficiently.
# Return unique valuess3.unique()
# Check for duplicatesdf2.duplicated('Type')
# Drop duplicatesdf2.drop_duplicates('Type', keep='last')
# Check index duplicatesdf.index.duplicated()Date Handling
Work with dates and time series data.
# Convert column to datetimedf2['Date'] = pd.to_datetime(df2['Date'])
# Create date rangedf2['Date'] = pd.date_range('2000-01-01', periods=6, freq='M')
# Create DatetimeIndexdates = [datetime(2012, 5, 1), datetime(2012, 5, 2)]index = pd.DatetimeIndex(dates)
# Create date range with custom endindex = pd.date_range(datetime(2012, 2, 1), end, freq='BM')Grouping Data
Aggregate, transform, and apply operations on groups.
Aggregation
# Group by columns and aggregatedf2.groupby(by=['Date', 'Type']).mean()
# Group by index leveldf4.groupby(level=0).sum()
# Multiple aggregation functionsdf4.groupby(level=0).agg({ 'a': lambda x: sum(x)/len(x), 'b': np.sum})Transformation
# Apply custom transformation functioncustomSum = lambda x: (x + x % 2)df4.groupby(level=0).transform(customSum)Visualization
Create quick visualizations for exploration.
import matplotlib.pyplot as plt
# Plot Seriess.plot()plt.show()
# Plot DataFramedf2.plot()plt.show()Forward & Backward Filling
Fill missing values based on adjacent data.
Forward Filling
# Propagate last valid observation forwarddf.reindex(range(4), method='ffill')Backward Filling
# Use next valid observation to fill gaps3 = s.reindex(range(5), method='bfill')Iteration
Iterate over DataFrames (use sparingly—vectorization is better!).
# Iterate over columnsfor column_index, series in df.iteritems(): # (Column-index, Series) pairs pass
# Iterate over rowsfor row_index, series in df.iterrows(): # (Row-index, Series) pairs passReal-World Examples
Example 1: Pivoting Sales Data
# Import sales datasales = pd.DataFrame({ 'Date': pd.date_range('2024-01-01', periods=12, freq='M'), 'Product': ['Widget A', 'Widget B', 'Widget C'] * 4, 'Revenue': np.random.randint(1000, 10000, 12)})
# Pivot to see revenue by product over timerevenue_pivot = sales.pivot( index='Date', columns='Product', values='Revenue')
# Or use pivot table for aggregationrevenue_table = pd.pivot_table( sales, values='Revenue', index='Date', columns='Product', aggfunc='sum')Example 2: Merging Customer Data
# Customer basic infocustomers = pd.DataFrame({ 'customer_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
# Customer ordersorders = pd.DataFrame({ 'customer_id': [1, 1, 2, 4], 'order_amount': [100, 200, 150, 75]})
# Merge with different join types# Inner: Only customers who have ordersinner = pd.merge(customers, orders, how='inner', on='customer_id')
# Left: All customers, NaN for no ordersleft = pd.merge(customers, orders, how='left', on='customer_id')
# Outer: All customers and all ordersouter = pd.merge(customers, orders, how='outer', on='customer_id')Example 3: Handling Time Series with Missing Data
# Create time series with missing datadates = pd.date_range('2024-01-01', periods=10, freq='D')values = [1, 2, np.nan, 4, 5, np.nan, np.nan, 8, 9, 10]ts = pd.Series(values, index=dates)
# Forward fillts_ffill = ts.fillna(method='ffill')
# Backward fillts_bfill = ts.fillna(method='bfill')
# Fill with meants_mean = ts.fillna(ts.mean())
# Drop NaNts_clean = ts.dropna()Quick Reference
Most Common Operations
# Reshapedf.pivot(index=, columns=, values=)pd.pivot_table(df, index=, columns=, values=)df.melt(id_vars=, value_vars=)df.stack() / df.unstack()
# Combinepd.merge(df1, df2, how='left', on='key')pd.concat([df1, df2], axis=1, join='inner')df.join(df2, how='outer')
# Groupdf.groupby('column').mean()df.groupby(level=0).sum()df.groupby('col').transform(func)
# Selectdf.loc[condition]df.query('condition')df.where(condition)
# Handle missingdf.dropna()df.fillna(value)df.fillna(method='ffill')
# Datespd.to_datetime(df['Date'])pd.date_range(start, end, freq='M')Pro Tips
✅ Use pivot tables instead of pivot when you need aggregation
✅ Inner joins are fastest—pre-filter before merging large datasets
✅ Forward fill for time series where data flows forward naturally
✅ Backward fill for sensor data or reverse-chronological patterns
✅ MultiIndex unlocks powerful hierarchical analysis
✅ Vectorize operations—avoid iteration whenever possible
✅ Groupby + transform when you need to preserve original shape