Python and Data Wrangling Study Guide, Study notes of Economics

It has information regarding dataframes, pandas, python, and other fundamental topics in data analysis for research purposes within the economic environment.

Typology: Study notes

2025/2026

Uploaded on 03/10/2026

arin-mahapatra
arin-mahapatra 🇺🇸

1 document

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Python for Data Science
MASTER CHEATSHEET & EXAM GUIDE
Pandas • Regex • Data Wrangling • Error Handling • API Calls • Time Series • Regression
This sheet synthesises ALL midterm topics, complete Q1–Q5 answers with explanations, and key debugging lessons
from practice sessions.
What's Inside
Section Contents
Q1 A–I Pandas DataFrame fundamentals — create, copy, transform, save, delete
Q2 A–C Regex & string methods — Titanic dataset
Q3 A–E Data wrangling — SkyKilometers (unit conversion, filter, classify, plot)
Q4 Loops & error handling — word counter program
Q5 A–D API calls — OpenWeatherMap & U.S. Census Bureau
§6 Regression, time series, EDA quick reference (course topics V–IV)
§7 Common bugs & fixes (from practice sessions)
§8 Full syntax quick-reference tables
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Python and Data Wrangling Study Guide and more Study notes Economics in PDF only on Docsity!

Python for Data Science

MASTER CHEATSHEET & EXAM GUIDE

Pandas • Regex • Data Wrangling • Error Handling • API Calls • Time Series • Regression

ℹ This sheet synthesises ALL midterm topics, complete Q1–Q5 answers with explanations, and key debugging lessons

from practice sessions.

What's Inside

Section Contents

Q1 A–I (^) Pandas DataFrame fundamentals — create, copy, transform, save, delete Q2 A–C (^) Regex & string methods — Titanic dataset Q3 A–E (^) Data wrangling — SkyKilometers (unit conversion, filter, classify, plot) Q4 (^) Loops & error handling — word counter program Q5 A–D (^) API calls — OpenWeatherMap & U.S. Census Bureau §6 (^) Regression, time series, EDA quick reference (course topics V–IV) §7 (^) Common bugs & fixes (from practice sessions) §8 (^) Full syntax quick-reference tables

Q1 — Pandas DataFrame Fundamentals | Parts A–I | 11 points The Data (reference throughout Q1)

Color (index) Month Age Experienced

Blue (^) December 10 True Red (^) January 12 True Yellow (^) February 10 True Green (^) March 12 False

Part A — Create a DataFrame with a custom index (1 pt)

ℹ Pass a dict of {column: [values]} to pd.DataFrame(). Use index= for row labels.

import pandas as pd new_table = pd.DataFrame( { 'Month': ['December', 'January', 'February', 'March'], 'Age': [10, 12, 10, 12], 'Experienced': [True, True, True, False] }, index=['Blue', 'Red', 'Yellow', 'Green'] # sets row labels )

Keys become column names. Lists must all be the same length. index= replaces default 0,1,2,3 with the color strings.

Part B — Deep copy (1 pt)

⚠ Always use deep=True when you want a fully independent copy.

extra_table = new_table.copy(deep=True)

Method Behaviour

.copy(deep=True) (^) Fully independent — changes to extra_table do NOT affect new_table .copy(deep=False) (^) Shares underlying data — changes in one WILL affect the other extra_table = new_table (^) NOT a copy — just a second pointer to the same object

Part C — Conditional integer column 'is_ten' (1 pt)

ℹ np.where is the cleanest approach: np.where(condition, value_if_true, value_if_false)

import numpy as np

Option 1 — np.where (preferred, vectorised)

extra_table['is_ten'] = np.where(extra_table['Age'] == 10, 1, 0)

Option 2 — lambda with .apply()

extra_table['is_ten'] = extra_table['Age'].apply(lambda x: 1 if x == 10 else 0)

Both produce identical results. np.where is faster on large DataFrames because it avoids Python-level looping.

Part D — Lowercase the 'Month' column (1 pt)

✓ The Month column contains real strings, so .str.lower() works directly — no .astype(str) needed.

ℹ del only removes the name binding. Python's garbage collector frees the actual memory. Always confirm with dir() or

%whos.

Q1 — Final State of extra_table

Index Color Month Age Tenure is_ten Notes

(inde x) BLUE december 10 True 1 Tenure (was Experienced) (^1) RED january 12 True 0 (^2) YELLOW february 10 True 1 (^3) GREEN march 12 False 0

Q2 — Regex & String Methods | Titanic Dataset | 6 points Regex Fundamentals

Pattern Meaning

. (^) Any single character

  • (^) Zero or more of the preceding
  • (^) One or more of the preceding ? (^) Zero or one of the preceding (optional) ^ (^) Start of string $ (^) End of string [A-Z] (^) Any single uppercase letter [a-z]+ (^) One or more lowercase letters [^abc] (^) Any character EXCEPT a, b, or c (group) (^) Capture group — extract this part with .str.extract() \b (^) Word boundary (edge of a word) \s (^) Any whitespace character \S+ (^) One or more non-whitespace characters \d (^) Any digit 0– \w+ (^) One or more word characters (letters, digits, underscore) a|b (^) Match a OR b (?:...) (^) Non-capturing group — groups without extracting Pandas String/Regex Methods

Method Purpose

.str.replace(r'pattern', 'new', regex=True) Replace every match of pattern with new string .str.extract(r'(pattern)') (^) Extract first capture group → new Series .str.extractall(r'(pattern)') (^) Extract ALL matches → MultiIndex DataFrame .str.findall(r'pattern') (^) Find all matches → list per row .str.contains(r'pattern') (^) Boolean mask — True where pattern found .str.match(r'pattern') (^) True if pattern matches at START of string .str.split('delim', expand=True) (^) Split into multiple columns .str.lower() / .str.upper() (^) Case conversion .str.strip() (^) Remove leading/trailing whitespace .str.len() (^) Character length of each string

Part A — Replace 'Mr.' with 'Mister' using regex (2 pts)

⚠ The dot in Mr. is a regex wildcard — escape it with . to match a literal period.

import pandas as pd

Q3 — Data Wrangling | SkyKilometers Dataset | 18 points

Part A — Import, convert miles → km, rename columns (4 pts)

ℹ Vectorised multiplication ( * 1.6 ) applies to every row instantly — no loop needed.

import pandas as pd SkyKilometers = pd.read_csv('SkyKilometers.csv')

Convert miles to kilometers (1 mile = 1.6 km)

SkyKilometers['QualKilometers'] = SkyKilometers['QualMiles'] * 1. SkyKilometers['BonusKilometers'] = SkyKilometers['BonusMiles'] * 1. SkyKilometers['FlightKilometers'] = SkyKilometers['FlightMiles'] * 1.

Remove the original miles columns

SkyKilometers = SkyKilometers.drop(columns=['QualMiles', 'BonusMiles', 'FlightMiles'])

Part B — Drop rows where DaysSinceEnroll < 30, print stats (4 pts)

Keep only rows where DaysSinceEnroll >= 30

SkyKilometers = SkyKilometers[SkyKilometers['DaysSinceEnroll'] >= 30]

Print summary statistics

print(SkyKilometers['DaysSinceEnroll'].min()) print(SkyKilometers['DaysSinceEnroll'].mean()) print(SkyKilometers['DaysSinceEnroll'].max())

ℹ Boolean indexing: df[df['col'] condition] keeps only rows where condition is True. This is the standard pandas filter

pattern.

Part C — Create 'MemberClass' column (4 pts)

ℹ np.where(condition, value_if_true, value_if_false) — the essential binary classification tool.

import numpy as np SkyKilometers['MemberClass'] = np.where( SkyKilometers['DaysSinceEnroll'] <= 365, 'New_Member', 'Longtime_Member' )

Part D — Print count of each MemberClass in a sentence (4 pts)

counts = SkyKilometers['MemberClass'].value_counts() print(f"There are {counts['New_Member']} New Members in the program") print(f"There are {counts['Longtime_Member']} Longtime Members in the program")

value_counts() returns a Series indexed by unique values. f-strings embed variables directly inside curly braces.

Part E — Box-and-whisker plot (2 pts)

import matplotlib.pyplot as plt SkyKilometers.boxplot(column='DaysSinceEnroll') plt.title('Distribution of DaysSinceEnroll') plt.ylabel('Days') plt.show()

Box Plot Anatomy

Feature Meaning

Center line (^) Median (50th percentile) Box edges (^) Q1 (25th) and Q3 (75th) — this range is the IQR Whiskers (^) Extend to 1.5 × IQR beyond Q1/Q Dots/circles (^) Outliers — values beyond the whiskers

try / except Flow Diagram

ℹ Code in try runs normally. If ANY exception occurs, Python jumps to the matching except block, runs it, and continues

AFTER the try/except — it does not crash.

Scenario What happens

try block runs, no error (^) Code executes normally, except block is skipped entirely try block runs, error raised (^) Python jumps to except, runs it, then continues after the block except catches the specific type Only catches the named error (ValueError here), others still crash except Exception as e (^) Catches ALL exceptions — use sparingly, hides bugs

Q5 — API Calls | OpenWeatherMap & U.S. Census | 4 points API Anatomy

Every API call has this structure:

BASE_URL + ENDPOINT + ?param1=value1¶m2=value2&key=YOUR_KEY

import requests url = 'https://api.example.com/data?city=Chicago&units=metric&appid=abc123' response = requests.get(url) # makes the HTTP GET request data = response.json() # parses JSON response into a Python dict print(data) # inspect structure before drilling in

Concept Meaning

requests.get(url) (^) Sends an HTTP GET request; returns a response object response.json() (^) Parses the JSON body into a Python dict or list response.status_code (^) 200 = OK | 401 = Bad API key | 404 = Not found response.text (^) Raw response as a string (useful for debugging) ?key=value&key2=value2 (^) Query string — parameters appended to the URL after? f'...{variable}...' (^) f-string — easiest way to build URLs with variables

Part A — Current temperature Boise, ID in Celsius (1 pt)

api_key = 'YOUR_OPENWEATHERMAP_KEY' url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Boise,ID,US&units=metric&appid={api_key}') response = requests.get(url) data = response.json() temp_celsius = data['main']['temp'] print(f'Current temperature in Boise: {temp_celsius}°C')

ℹ units=metric returns Celsius. units=imperial returns Fahrenheit. Omit units for Kelvin (default).

Part B — Max temperature Detroit, MI (1 pt)

url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Detroit,MI,US&units=metric&appid={api_key}') data = requests.get(url).json() print(data['main']['temp_max']) # max temperature

Part C — 2016 ACS statewide population: Wisconsin & Rhode Island (1 pt)

ℹ Census FIPS codes: Wisconsin = 55, Rhode Island = 44. B01003_001E is the total population variable.

census_key = 'YOUR_CENSUS_KEY' url = (f'https://api.census.gov/data/2016/acs/acs1' f'?get=NAME,B01003_001E&for=state:55,44&key={census_key}') response = requests.get(url) data = response.json() # returns a list of lists: [header_row, data_row, ...]

§6 — Regression, EDA & Time Series | Course Topics III– V Exploratory Data Analysis

Code Purpose

df.corr() (^) Correlation matrix — spot linear relationships and multicollinearity df['col'].value_counts() (^) Count each unique value — great for categorical columns df.describe() (^) Min, max, mean, std, quartiles for all numeric columns df.info() (^) Column names, dtypes, non-null counts, memory usage plt.hist(df['col'], bins=20) (^) Distribution — check for skew, normality, outliers plt.scatter(df['x'], df['y']) (^) Bivariate relationship between two variables sns.heatmap(df.corr(), annot=True) (^) Visual correlation matrix — annotated with values sns.regplot(x='x', y='y', data=df) (^) Scatter with regression line overlaid OLS Regression (statsmodels) import statsmodels.formula.api as smf

Estimate the model

model = smf.ols('y ~ x1 + x2 + x3', data=df) result = model.fit()

View full output

print(result.summary())

Key attributes

result.params # coefficients result.rsquared # R² result.rsquared_adj # Adjusted R² (penalises for extra variables) result.aic # AIC — lower is better result.bic # BIC — lower is better result.resid # residuals (use for diagnostic plots)

Robust standard errors (heteroskedasticity-consistent)

result_robust = result.get_robustcov_results(cov_type='HC3') Regression Interpretation Guide

Term Meaning

Coefficient (β) (^) For each 1-unit increase in X, Y changes by β units, holding all else constant p-value < 0.05 (^) The predictor is statistically significant at the 5% level R² (^) Proportion of variance in Y explained by the model (0–1) Adjusted R² (^) R² penalised for number of predictors — use for model comparison AIC / BIC (^) Lower = better model fit, penalised for complexity — use for selection F-statistic (^) Tests whether the model as a whole explains significant variance

Residuals (^) Difference between actual and predicted Y — should be random (no pattern) Time Series from statsmodels.tsa.stattools import adfuller from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima.model import ARIMA

Set datetime index

df = df.set_index(pd.to_datetime(df['date']))

Plot the series

df['value'].plot(title='Time Series') plt.show()

Stationarity test (Augmented Dickey-Fuller)

result = adfuller(df['value'].dropna()) print(f'ADF Statistic: {result[0]:.4f}') print(f'p-value: {result[1]:.4f}') # p < 0.05 → stationary

ACF / PACF to find AR and MA order

plot_acf(df['value'].dropna(), lags=20) plot_pacf(df['value'].dropna(), lags=20) plt.show()

Fit ARIMA(p, d, q)

model = ARIMA(df['value'], order=(1, 1, 1)) result = model.fit() print(result.summary())

Resample daily data to monthly means

monthly = df['value'].resample('M').mean() ARIMA Order Selection

Parameter Meaning

p (AR order) (^) Number of lag terms. Read from PACF — where it cuts off d (differencing) (^) Number of times to difference to achieve stationarity. ADF test guides this. q (MA order) (^) Number of lagged error terms. Read from ACF — where it cuts off p-value > 0.05 on ADF (^) Series is non-stationary — apply differencing (d ≥ 1) p-value < 0.05 on ADF (^) Series is stationary — proceed with d = 0

§8 — Full Syntax Quick Reference DataFrame Creation & Inspection

Code Purpose

pd.DataFrame({'A':[1,2],'B':[3,4]}) (^) Create from dict pd.read_csv('file.csv') (^) Import CSV pd.read_excel('file.xlsx') (^) Import Excel pd.read_stata('file.dta') (^) Import Stata file df.head(10) (^) First 10 rows df.tail() (^) Last 5 rows df.shape (^) (rows, columns) df.info() (^) Types, nulls, memory df.describe() (^) Summary stats df.dtypes (^) Column types df.columns.tolist() (^) Column names as list df.index (^) Row index Selecting & Filtering

Code Purpose

df['col'] (^) Select one column → Series df[['col1','col2']] (^) Select multiple columns → DataFrame df.loc['row_label', 'col'] (^) Select by label df.iloc[0, 1] (^) Select by integer position df[df['col'] > 5] (^) Filter rows by condition df[(df['a']>1) & (df['b']<10)] (^) Multiple conditions — use & not 'and' df[df['col'].isin(['a','b'])] (^) Filter by list of values df[df['col'].notna()] (^) Drop NaN rows in one column Creating & Modifying Columns

Code Purpose

df['new'] = df['a'] + df['b'] (^) Vectorised arithmetic df['new'] = np.where(cond, t, f) (^) Binary conditional column df['new'] = df['col'].apply(func) (^) Apply any function row-by-row df['new'] = np.log(df['col']) (^) Natural log transformation df['new'] = (df['col']- df['col'].mean())/df['col'].std() Standardize (z-score) df = df.drop(columns=['col']) (^) Remove column(s) df = df.rename(columns={'old':'new'}) (^) Rename column

df = df.sort_values('col', ascending=False) Sort descending Aggregation & Reshaping

Code Purpose

df.groupby('col').mean() (^) Mean by group df.groupby('col').agg({'a':'sum','b':'m ean'}) Multiple aggregations df.pivot_table(values='v', index='r', columns='c', aggfunc='sum') Pivot table df.melt(id_vars='id', value_vars=['a','b']) Wide → long format pd.concat([df1, df2], ignore_index=True) Stack DataFrames vertically pd.merge(df1, df2, on='key', how='left') SQL-style join Missing Data

Code Purpose

df.isna().sum() (^) Count missing values per column df.dropna() (^) Drop rows with ANY missing value df.dropna(subset=['col']) (^) Drop rows where a specific column is NaN df.fillna(0) (^) Fill all NaN with 0 df['col'].fillna(df['col'].mean()) (^) Fill NaN with column mean Python Fundamentals

Code Meaning

for item in iterable: (^) Iterate over a list, range, dict, etc. for i, val in enumerate(lst): (^) Loop with index and value simultaneously [x**2 for x in range(10)] (^) List comprehension — compact loop {k: v for k, v in dict.items()} (^) Dict comprehension lambda x: x * (^2) Anonymous one-line function def func(x, y=10): return x + y (^) Named function with default argument try: ... except TypeError as e: (^) Error handling — catch specific exception raise ValueError('message') (^) Manually raise an exception f'Value is {var:.2f}' (^) f-string with 2 decimal format isinstance(x, int) (^) Check if x is an integer str.isalpha() (^) True if all characters are letters str.isdigit() (^) True if all characters are digits