Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Python and Data Wrangling Study Guide, Study notes of Economics

University of Wisconsin (UW) - Madison Economics

It has information regarding dataframes, pandas, python, and other fundamental topics in data analysis for research purposes within the economic environment.

Typology: Study notes

2025/2026

Uploaded on 03/10/2026

arin-mahapatra 🇺🇸

1 document

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Python for Data Science

MASTER CHEATSHEET & EXAM GUIDE

Pandas • Regex • Data Wrangling • Error Handling • API Calls • Time Series • Regression

This sheet synthesises ALL midterm topics, complete Q1–Q5 answers with explanations, and key debugging lessons

ℹ

from practice sessions.

What's Inside

Section Contents

Q1 A–I Pandas DataFrame fundamentals — create, copy, transform, save, delete

Q2 A–C Regex & string methods — Titanic dataset

Q3 A–E Data wrangling — SkyKilometers (unit conversion, filter, classify, plot)

Q4 Loops & error handling — word counter program

Q5 A–D API calls — OpenWeatherMap & U.S. Census Bureau

§6 Regression, time series, EDA quick reference (course topics V–IV)

§7 Common bugs & fixes (from practice sessions)

§8 Full syntax quick-reference tables

Discover Study notes of Economics University of Wisconsin (UW) - Madison

Partial preview of the text

Download Python and Data Wrangling Study Guide and more Study notes Economics in PDF only on Docsity!

Python for Data Science

MASTER CHEATSHEET & EXAM GUIDE

Pandas • Regex • Data Wrangling • Error Handling • API Calls • Time Series • Regression

ℹ This sheet synthesises ALL midterm topics, complete Q1–Q5 answers with explanations, and key debugging lessons

from practice sessions.

What's Inside

Section Contents

Q1 A–I (^) Pandas DataFrame fundamentals — create, copy, transform, save, delete Q2 A–C (^) Regex & string methods — Titanic dataset Q3 A–E (^) Data wrangling — SkyKilometers (unit conversion, filter, classify, plot) Q4 (^) Loops & error handling — word counter program Q5 A–D (^) API calls — OpenWeatherMap & U.S. Census Bureau §6 (^) Regression, time series, EDA quick reference (course topics V–IV) §7 (^) Common bugs & fixes (from practice sessions) §8 (^) Full syntax quick-reference tables

Q1 — Pandas DataFrame Fundamentals | Parts A–I | 11 points The Data (reference throughout Q1)

Color (index) Month Age Experienced

Blue (^) December 10 True Red (^) January 12 True Yellow (^) February 10 True Green (^) March 12 False

Part A — Create a DataFrame with a custom index (1 pt)

ℹ Pass a dict of {column: [values]} to pd.DataFrame(). Use index= for row labels.

import pandas as pd new_table = pd.DataFrame( { 'Month': ['December', 'January', 'February', 'March'], 'Age': [10, 12, 10, 12], 'Experienced': [True, True, True, False] }, index=['Blue', 'Red', 'Yellow', 'Green'] # sets row labels )

Keys become column names. Lists must all be the same length. index= replaces default 0,1,2,3 with the color strings.

Part B — Deep copy (1 pt)

⚠ Always use deep=True when you want a fully independent copy.

extra_table = new_table.copy(deep=True)

Method Behaviour

.copy(deep=True) (^) Fully independent — changes to extra_table do NOT affect new_table .copy(deep=False) (^) Shares underlying data — changes in one WILL affect the other extra_table = new_table (^) NOT a copy — just a second pointer to the same object

Part C — Conditional integer column 'is_ten' (1 pt)

ℹ np.where is the cleanest approach: np.where(condition, value_if_true, value_if_false)

import numpy as np

Option 1 — np.where (preferred, vectorised)

extra_table['is_ten'] = np.where(extra_table['Age'] == 10, 1, 0)

Option 2 — lambda with .apply()

extra_table['is_ten'] = extra_table['Age'].apply(lambda x: 1 if x == 10 else 0)

Both produce identical results. np.where is faster on large DataFrames because it avoids Python-level looping.

Part D — Lowercase the 'Month' column (1 pt)

✓ The Month column contains real strings, so .str.lower() works directly — no .astype(str) needed.

ℹ del only removes the name binding. Python's garbage collector frees the actual memory. Always confirm with dir() or

%whos.

Q1 — Final State of extra_table

Index Color Month Age Tenure is_ten Notes

(inde x) BLUE december 10 True 1 Tenure (was Experienced) (^1) RED january 12 True 0 (^2) YELLOW february 10 True 1 (^3) GREEN march 12 False 0

Q2 — Regex & String Methods | Titanic Dataset | 6 points Regex Fundamentals

Pattern Meaning

. (^) Any single character

(^) Zero or more of the preceding

(^) One or more of the preceding ? (^) Zero or one of the preceding (optional) ^ (^) Start of string $ (^) End of string [A-Z] (^) Any single uppercase letter [a-z]+ (^) One or more lowercase letters [^abc] (^) Any character EXCEPT a, b, or c (group) (^) Capture group — extract this part with .str.extract() \b (^) Word boundary (edge of a word) \s (^) Any whitespace character \S+ (^) One or more non-whitespace characters \d (^) Any digit 0– \w+ (^) One or more word characters (letters, digits, underscore) a|b (^) Match a OR b (?:...) (^) Non-capturing group — groups without extracting Pandas String/Regex Methods

Method Purpose

.str.replace(r'pattern', 'new', regex=True) Replace every match of pattern with new string .str.extract(r'(pattern)') (^) Extract first capture group → new Series .str.extractall(r'(pattern)') (^) Extract ALL matches → MultiIndex DataFrame .str.findall(r'pattern') (^) Find all matches → list per row .str.contains(r'pattern') (^) Boolean mask — True where pattern found .str.match(r'pattern') (^) True if pattern matches at START of string .str.split('delim', expand=True) (^) Split into multiple columns .str.lower() / .str.upper() (^) Case conversion .str.strip() (^) Remove leading/trailing whitespace .str.len() (^) Character length of each string

Part A — Replace 'Mr.' with 'Mister' using regex (2 pts)

⚠ The dot in Mr. is a regex wildcard — escape it with . to match a literal period.

import pandas as pd

Q3 — Data Wrangling | SkyKilometers Dataset | 18 points

Part A — Import, convert miles → km, rename columns (4 pts)

ℹ Vectorised multiplication ( * 1.6 ) applies to every row instantly — no loop needed.

import pandas as pd SkyKilometers = pd.read_csv('SkyKilometers.csv')

Convert miles to kilometers (1 mile = 1.6 km)

SkyKilometers['QualKilometers'] = SkyKilometers['QualMiles'] * 1. SkyKilometers['BonusKilometers'] = SkyKilometers['BonusMiles'] * 1. SkyKilometers['FlightKilometers'] = SkyKilometers['FlightMiles'] * 1.

Remove the original miles columns

SkyKilometers = SkyKilometers.drop(columns=['QualMiles', 'BonusMiles', 'FlightMiles'])

Part B — Drop rows where DaysSinceEnroll < 30, print stats (4 pts)

Keep only rows where DaysSinceEnroll >= 30

SkyKilometers = SkyKilometers[SkyKilometers['DaysSinceEnroll'] >= 30]

Print summary statistics

print(SkyKilometers['DaysSinceEnroll'].min()) print(SkyKilometers['DaysSinceEnroll'].mean()) print(SkyKilometers['DaysSinceEnroll'].max())

ℹ Boolean indexing: df[df['col'] condition] keeps only rows where condition is True. This is the standard pandas filter

pattern.

Part C — Create 'MemberClass' column (4 pts)

ℹ np.where(condition, value_if_true, value_if_false) — the essential binary classification tool.

import numpy as np SkyKilometers['MemberClass'] = np.where( SkyKilometers['DaysSinceEnroll'] <= 365, 'New_Member', 'Longtime_Member' )

Part D — Print count of each MemberClass in a sentence (4 pts)

counts = SkyKilometers['MemberClass'].value_counts() print(f"There are {counts['New_Member']} New Members in the program") print(f"There are {counts['Longtime_Member']} Longtime Members in the program")

value_counts() returns a Series indexed by unique values. f-strings embed variables directly inside curly braces.

Part E — Box-and-whisker plot (2 pts)

import matplotlib.pyplot as plt SkyKilometers.boxplot(column='DaysSinceEnroll') plt.title('Distribution of DaysSinceEnroll') plt.ylabel('Days') plt.show()

Box Plot Anatomy

Feature Meaning

Center line (^) Median (50th percentile) Box edges (^) Q1 (25th) and Q3 (75th) — this range is the IQR Whiskers (^) Extend to 1.5 × IQR beyond Q1/Q Dots/circles (^) Outliers — values beyond the whiskers

try / except Flow Diagram

ℹ Code in try runs normally. If ANY exception occurs, Python jumps to the matching except block, runs it, and continues

AFTER the try/except — it does not crash.

Scenario What happens

try block runs, no error (^) Code executes normally, except block is skipped entirely try block runs, error raised (^) Python jumps to except, runs it, then continues after the block except catches the specific type Only catches the named error (ValueError here), others still crash except Exception as e (^) Catches ALL exceptions — use sparingly, hides bugs

Q5 — API Calls | OpenWeatherMap & U.S. Census | 4 points API Anatomy

Every API call has this structure:

BASE_URL + ENDPOINT + ?param1=value1¶m2=value2&key=YOUR_KEY

import requests url = 'https://api.example.com/data?city=Chicago&units=metric&appid=abc123' response = requests.get(url) # makes the HTTP GET request data = response.json() # parses JSON response into a Python dict print(data) # inspect structure before drilling in

Concept Meaning

requests.get(url) (^) Sends an HTTP GET request; returns a response object response.json() (^) Parses the JSON body into a Python dict or list response.status_code (^) 200 = OK | 401 = Bad API key | 404 = Not found response.text (^) Raw response as a string (useful for debugging) ?key=value&key2=value2 (^) Query string — parameters appended to the URL after? f'...{variable}...' (^) f-string — easiest way to build URLs with variables

Part A — Current temperature Boise, ID in Celsius (1 pt)

api_key = 'YOUR_OPENWEATHERMAP_KEY' url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Boise,ID,US&units=metric&appid={api_key}') response = requests.get(url) data = response.json() temp_celsius = data['main']['temp'] print(f'Current temperature in Boise: {temp_celsius}°C')

ℹ units=metric returns Celsius. units=imperial returns Fahrenheit. Omit units for Kelvin (default).

Part B — Max temperature Detroit, MI (1 pt)

url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Detroit,MI,US&units=metric&appid={api_key}') data = requests.get(url).json() print(data['main']['temp_max']) # max temperature

Part C — 2016 ACS statewide population: Wisconsin & Rhode Island (1 pt)

ℹ Census FIPS codes: Wisconsin = 55, Rhode Island = 44. B01003_001E is the total population variable.

census_key = 'YOUR_CENSUS_KEY' url = (f'https://api.census.gov/data/2016/acs/acs1' f'?get=NAME,B01003_001E&for=state:55,44&key={census_key}') response = requests.get(url) data = response.json() # returns a list of lists: [header_row, data_row, ...]

§6 — Regression, EDA & Time Series | Course Topics III– V Exploratory Data Analysis

Code Purpose

df.corr() (^) Correlation matrix — spot linear relationships and multicollinearity df['col'].value_counts() (^) Count each unique value — great for categorical columns df.describe() (^) Min, max, mean, std, quartiles for all numeric columns df.info() (^) Column names, dtypes, non-null counts, memory usage plt.hist(df['col'], bins=20) (^) Distribution — check for skew, normality, outliers plt.scatter(df['x'], df['y']) (^) Bivariate relationship between two variables sns.heatmap(df.corr(), annot=True) (^) Visual correlation matrix — annotated with values sns.regplot(x='x', y='y', data=df) (^) Scatter with regression line overlaid OLS Regression (statsmodels) import statsmodels.formula.api as smf

Estimate the model

model = smf.ols('y ~ x1 + x2 + x3', data=df) result = model.fit()

View full output

print(result.summary())

Key attributes

result.params # coefficients result.rsquared # R² result.rsquared_adj # Adjusted R² (penalises for extra variables) result.aic # AIC — lower is better result.bic # BIC — lower is better result.resid # residuals (use for diagnostic plots)

Robust standard errors (heteroskedasticity-consistent)

result_robust = result.get_robustcov_results(cov_type='HC3') Regression Interpretation Guide

Term Meaning

Coefficient (β) (^) For each 1-unit increase in X, Y changes by β units, holding all else constant p-value < 0.05 (^) The predictor is statistically significant at the 5% level R² (^) Proportion of variance in Y explained by the model (0–1) Adjusted R² (^) R² penalised for number of predictors — use for model comparison AIC / BIC (^) Lower = better model fit, penalised for complexity — use for selection F-statistic (^) Tests whether the model as a whole explains significant variance

Residuals (^) Difference between actual and predicted Y — should be random (no pattern) Time Series from statsmodels.tsa.stattools import adfuller from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima.model import ARIMA

Set datetime index

df = df.set_index(pd.to_datetime(df['date']))

Plot the series

df['value'].plot(title='Time Series') plt.show()

Stationarity test (Augmented Dickey-Fuller)

result = adfuller(df['value'].dropna()) print(f'ADF Statistic: {result[0]:.4f}') print(f'p-value: {result[1]:.4f}') # p < 0.05 → stationary

ACF / PACF to find AR and MA order

plot_acf(df['value'].dropna(), lags=20) plot_pacf(df['value'].dropna(), lags=20) plt.show()

Fit ARIMA(p, d, q)

model = ARIMA(df['value'], order=(1, 1, 1)) result = model.fit() print(result.summary())

Resample daily data to monthly means

monthly = df['value'].resample('M').mean() ARIMA Order Selection

Parameter Meaning

p (AR order) (^) Number of lag terms. Read from PACF — where it cuts off d (differencing) (^) Number of times to difference to achieve stationarity. ADF test guides this. q (MA order) (^) Number of lagged error terms. Read from ACF — where it cuts off p-value > 0.05 on ADF (^) Series is non-stationary — apply differencing (d ≥ 1) p-value < 0.05 on ADF (^) Series is stationary — proceed with d = 0

§8 — Full Syntax Quick Reference DataFrame Creation & Inspection

Code Purpose

pd.DataFrame({'A':[1,2],'B':[3,4]}) (^) Create from dict pd.read_csv('file.csv') (^) Import CSV pd.read_excel('file.xlsx') (^) Import Excel pd.read_stata('file.dta') (^) Import Stata file df.head(10) (^) First 10 rows df.tail() (^) Last 5 rows df.shape (^) (rows, columns) df.info() (^) Types, nulls, memory df.describe() (^) Summary stats df.dtypes (^) Column types df.columns.tolist() (^) Column names as list df.index (^) Row index Selecting & Filtering

Code Purpose

df['col'] (^) Select one column → Series df[['col1','col2']] (^) Select multiple columns → DataFrame df.loc['row_label', 'col'] (^) Select by label df.iloc[0, 1] (^) Select by integer position df[df['col'] > 5] (^) Filter rows by condition df[(df['a']>1) & (df['b']<10)] (^) Multiple conditions — use & not 'and' df[df['col'].isin(['a','b'])] (^) Filter by list of values df[df['col'].notna()] (^) Drop NaN rows in one column Creating & Modifying Columns

Code Purpose

df['new'] = df['a'] + df['b'] (^) Vectorised arithmetic df['new'] = np.where(cond, t, f) (^) Binary conditional column df['new'] = df['col'].apply(func) (^) Apply any function row-by-row df['new'] = np.log(df['col']) (^) Natural log transformation df['new'] = (df['col']- df['col'].mean())/df['col'].std() Standardize (z-score) df = df.drop(columns=['col']) (^) Remove column(s) df = df.rename(columns={'old':'new'}) (^) Rename column

df = df.sort_values('col', ascending=False) Sort descending Aggregation & Reshaping

Code Purpose

df.groupby('col').mean() (^) Mean by group df.groupby('col').agg({'a':'sum','b':'m ean'}) Multiple aggregations df.pivot_table(values='v', index='r', columns='c', aggfunc='sum') Pivot table df.melt(id_vars='id', value_vars=['a','b']) Wide → long format pd.concat([df1, df2], ignore_index=True) Stack DataFrames vertically pd.merge(df1, df2, on='key', how='left') SQL-style join Missing Data

Code Purpose

df.isna().sum() (^) Count missing values per column df.dropna() (^) Drop rows with ANY missing value df.dropna(subset=['col']) (^) Drop rows where a specific column is NaN df.fillna(0) (^) Fill all NaN with 0 df['col'].fillna(df['col'].mean()) (^) Fill NaN with column mean Python Fundamentals

Code Meaning

for item in iterable: (^) Iterate over a list, range, dict, etc. for i, val in enumerate(lst): (^) Loop with index and value simultaneously [x**2 for x in range(10)] (^) List comprehension — compact loop {k: v for k, v in dict.items()} (^) Dict comprehension lambda x: x * (^2) Anonymous one-line function def func(x, y=10): return x + y (^) Named function with default argument try: ... except TypeError as e: (^) Error handling — catch specific exception raise ValueError('message') (^) Manually raise an exception f'Value is {var:.2f}' (^) f-string with 2 decimal format isinstance(x, int) (^) Check if x is an integer str.isalpha() (^) True if all characters are letters str.isdigit() (^) True if all characters are digits

Python and Data Wrangling Study Guide, Study notes of Economics

Related documents

Partial preview of the text

Download Python and Data Wrangling Study Guide and more Study notes Economics in PDF only on Docsity!

Python for Data Science

MASTER CHEATSHEET & EXAM GUIDE

Pandas • Regex • Data Wrangling • Error Handling • API Calls • Time Series • Regression

ℹ This sheet synthesises ALL midterm topics, complete Q1–Q5 answers with explanations, and key debugging lessons

from practice sessions.

Section Contents

Color (index) Month Age Experienced

Part A — Create a DataFrame with a custom index (1 pt)

ℹ Pass a dict of {column: [values]} to pd.DataFrame(). Use index= for row labels.

Keys become column names. Lists must all be the same length. index= replaces default 0,1,2,3 with the color strings.

Part B — Deep copy (1 pt)

⚠ Always use deep=True when you want a fully independent copy.

Method Behaviour

Part C — Conditional integer column 'is_ten' (1 pt)

ℹ np.where is the cleanest approach: np.where(condition, value_if_true, value_if_false)

Option 1 — np.where (preferred, vectorised)

Option 2 — lambda with .apply()

Both produce identical results. np.where is faster on large DataFrames because it avoids Python-level looping.

Part D — Lowercase the 'Month' column (1 pt)

✓ The Month column contains real strings, so .str.lower() works directly — no .astype(str) needed.

ℹ del only removes the name binding. Python's garbage collector frees the actual memory. Always confirm with dir() or

%whos.

Index Color Month Age Tenure is_ten Notes

Pattern Meaning

Method Purpose

Part A — Replace 'Mr.' with 'Mister' using regex (2 pts)

⚠ The dot in Mr. is a regex wildcard — escape it with . to match a literal period.

Part A — Import, convert miles → km, rename columns (4 pts)

ℹ Vectorised multiplication ( * 1.6 ) applies to every row instantly — no loop needed.

Convert miles to kilometers (1 mile = 1.6 km)

Remove the original miles columns

Part B — Drop rows where DaysSinceEnroll < 30, print stats (4 pts)

Keep only rows where DaysSinceEnroll >= 30

Print summary statistics

ℹ Boolean indexing: df[df['col'] condition] keeps only rows where condition is True. This is the standard pandas filter

pattern.

Part C — Create 'MemberClass' column (4 pts)

ℹ np.where(condition, value_if_true, value_if_false) — the essential binary classification tool.

Part D — Print count of each MemberClass in a sentence (4 pts)

value_counts() returns a Series indexed by unique values. f-strings embed variables directly inside curly braces.

Part E — Box-and-whisker plot (2 pts)

Feature Meaning

ℹ Code in try runs normally. If ANY exception occurs, Python jumps to the matching except block, runs it, and continues

AFTER the try/except — it does not crash.

Scenario What happens

Every API call has this structure:

BASE_URL + ENDPOINT + ?param1=value1¶m2=value2&key=YOUR_KEY

Concept Meaning

Part A — Current temperature Boise, ID in Celsius (1 pt)

ℹ units=metric returns Celsius. units=imperial returns Fahrenheit. Omit units for Kelvin (default).

Part B — Max temperature Detroit, MI (1 pt)

Part C — 2016 ACS statewide population: Wisconsin & Rhode Island (1 pt)

ℹ Census FIPS codes: Wisconsin = 55, Rhode Island = 44. B01003_001E is the total population variable.

Code Purpose

Estimate the model

View full output

Key attributes

Robust standard errors (heteroskedasticity-consistent)

Term Meaning

Set datetime index

Plot the series

Stationarity test (Augmented Dickey-Fuller)

ACF / PACF to find AR and MA order

Fit ARIMA(p, d, q)

Resample daily data to monthly means

Parameter Meaning

Code Purpose

Code Purpose

Code Purpose

Code Purpose

Code Purpose

Code Meaning