










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
It has information regarding dataframes, pandas, python, and other fundamental topics in data analysis for research purposes within the economic environment.
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!











What's Inside
Q1 A–I (^) Pandas DataFrame fundamentals — create, copy, transform, save, delete Q2 A–C (^) Regex & string methods — Titanic dataset Q3 A–E (^) Data wrangling — SkyKilometers (unit conversion, filter, classify, plot) Q4 (^) Loops & error handling — word counter program Q5 A–D (^) API calls — OpenWeatherMap & U.S. Census Bureau §6 (^) Regression, time series, EDA quick reference (course topics V–IV) §7 (^) Common bugs & fixes (from practice sessions) §8 (^) Full syntax quick-reference tables
Q1 — Pandas DataFrame Fundamentals | Parts A–I | 11 points The Data (reference throughout Q1)
Blue (^) December 10 True Red (^) January 12 True Yellow (^) February 10 True Green (^) March 12 False
import pandas as pd new_table = pd.DataFrame( { 'Month': ['December', 'January', 'February', 'March'], 'Age': [10, 12, 10, 12], 'Experienced': [True, True, True, False] }, index=['Blue', 'Red', 'Yellow', 'Green'] # sets row labels )
extra_table = new_table.copy(deep=True)
.copy(deep=True) (^) Fully independent — changes to extra_table do NOT affect new_table .copy(deep=False) (^) Shares underlying data — changes in one WILL affect the other extra_table = new_table (^) NOT a copy — just a second pointer to the same object
import numpy as np
extra_table['is_ten'] = np.where(extra_table['Age'] == 10, 1, 0)
extra_table['is_ten'] = extra_table['Age'].apply(lambda x: 1 if x == 10 else 0)
Q1 — Final State of extra_table
(inde x) BLUE december 10 True 1 Tenure (was Experienced) (^1) RED january 12 True 0 (^2) YELLOW february 10 True 1 (^3) GREEN march 12 False 0
Q2 — Regex & String Methods | Titanic Dataset | 6 points Regex Fundamentals
. (^) Any single character
.str.replace(r'pattern', 'new', regex=True) Replace every match of pattern with new string .str.extract(r'(pattern)') (^) Extract first capture group → new Series .str.extractall(r'(pattern)') (^) Extract ALL matches → MultiIndex DataFrame .str.findall(r'pattern') (^) Find all matches → list per row .str.contains(r'pattern') (^) Boolean mask — True where pattern found .str.match(r'pattern') (^) True if pattern matches at START of string .str.split('delim', expand=True) (^) Split into multiple columns .str.lower() / .str.upper() (^) Case conversion .str.strip() (^) Remove leading/trailing whitespace .str.len() (^) Character length of each string
import pandas as pd
Q3 — Data Wrangling | SkyKilometers Dataset | 18 points
import pandas as pd SkyKilometers = pd.read_csv('SkyKilometers.csv')
SkyKilometers['QualKilometers'] = SkyKilometers['QualMiles'] * 1. SkyKilometers['BonusKilometers'] = SkyKilometers['BonusMiles'] * 1. SkyKilometers['FlightKilometers'] = SkyKilometers['FlightMiles'] * 1.
SkyKilometers = SkyKilometers.drop(columns=['QualMiles', 'BonusMiles', 'FlightMiles'])
SkyKilometers = SkyKilometers[SkyKilometers['DaysSinceEnroll'] >= 30]
print(SkyKilometers['DaysSinceEnroll'].min()) print(SkyKilometers['DaysSinceEnroll'].mean()) print(SkyKilometers['DaysSinceEnroll'].max())
import numpy as np SkyKilometers['MemberClass'] = np.where( SkyKilometers['DaysSinceEnroll'] <= 365, 'New_Member', 'Longtime_Member' )
counts = SkyKilometers['MemberClass'].value_counts() print(f"There are {counts['New_Member']} New Members in the program") print(f"There are {counts['Longtime_Member']} Longtime Members in the program")
import matplotlib.pyplot as plt SkyKilometers.boxplot(column='DaysSinceEnroll') plt.title('Distribution of DaysSinceEnroll') plt.ylabel('Days') plt.show()
Box Plot Anatomy
Center line (^) Median (50th percentile) Box edges (^) Q1 (25th) and Q3 (75th) — this range is the IQR Whiskers (^) Extend to 1.5 × IQR beyond Q1/Q Dots/circles (^) Outliers — values beyond the whiskers
try / except Flow Diagram
try block runs, no error (^) Code executes normally, except block is skipped entirely try block runs, error raised (^) Python jumps to except, runs it, then continues after the block except catches the specific type Only catches the named error (ValueError here), others still crash except Exception as e (^) Catches ALL exceptions — use sparingly, hides bugs
Q5 — API Calls | OpenWeatherMap & U.S. Census | 4 points API Anatomy
import requests url = 'https://api.example.com/data?city=Chicago&units=metric&appid=abc123' response = requests.get(url) # makes the HTTP GET request data = response.json() # parses JSON response into a Python dict print(data) # inspect structure before drilling in
requests.get(url) (^) Sends an HTTP GET request; returns a response object response.json() (^) Parses the JSON body into a Python dict or list response.status_code (^) 200 = OK | 401 = Bad API key | 404 = Not found response.text (^) Raw response as a string (useful for debugging) ?key=value&key2=value2 (^) Query string — parameters appended to the URL after? f'...{variable}...' (^) f-string — easiest way to build URLs with variables
api_key = 'YOUR_OPENWEATHERMAP_KEY' url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Boise,ID,US&units=metric&appid={api_key}') response = requests.get(url) data = response.json() temp_celsius = data['main']['temp'] print(f'Current temperature in Boise: {temp_celsius}°C')
url = (f'http://api.openweathermap.org/data/2.5/weather' f'?q=Detroit,MI,US&units=metric&appid={api_key}') data = requests.get(url).json() print(data['main']['temp_max']) # max temperature
census_key = 'YOUR_CENSUS_KEY' url = (f'https://api.census.gov/data/2016/acs/acs1' f'?get=NAME,B01003_001E&for=state:55,44&key={census_key}') response = requests.get(url) data = response.json() # returns a list of lists: [header_row, data_row, ...]
§6 — Regression, EDA & Time Series | Course Topics III– V Exploratory Data Analysis
df.corr() (^) Correlation matrix — spot linear relationships and multicollinearity df['col'].value_counts() (^) Count each unique value — great for categorical columns df.describe() (^) Min, max, mean, std, quartiles for all numeric columns df.info() (^) Column names, dtypes, non-null counts, memory usage plt.hist(df['col'], bins=20) (^) Distribution — check for skew, normality, outliers plt.scatter(df['x'], df['y']) (^) Bivariate relationship between two variables sns.heatmap(df.corr(), annot=True) (^) Visual correlation matrix — annotated with values sns.regplot(x='x', y='y', data=df) (^) Scatter with regression line overlaid OLS Regression (statsmodels) import statsmodels.formula.api as smf
model = smf.ols('y ~ x1 + x2 + x3', data=df) result = model.fit()
print(result.summary())
result.params # coefficients result.rsquared # R² result.rsquared_adj # Adjusted R² (penalises for extra variables) result.aic # AIC — lower is better result.bic # BIC — lower is better result.resid # residuals (use for diagnostic plots)
result_robust = result.get_robustcov_results(cov_type='HC3') Regression Interpretation Guide
Coefficient (β) (^) For each 1-unit increase in X, Y changes by β units, holding all else constant p-value < 0.05 (^) The predictor is statistically significant at the 5% level R² (^) Proportion of variance in Y explained by the model (0–1) Adjusted R² (^) R² penalised for number of predictors — use for model comparison AIC / BIC (^) Lower = better model fit, penalised for complexity — use for selection F-statistic (^) Tests whether the model as a whole explains significant variance
Residuals (^) Difference between actual and predicted Y — should be random (no pattern) Time Series from statsmodels.tsa.stattools import adfuller from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.arima.model import ARIMA
df = df.set_index(pd.to_datetime(df['date']))
df['value'].plot(title='Time Series') plt.show()
result = adfuller(df['value'].dropna()) print(f'ADF Statistic: {result[0]:.4f}') print(f'p-value: {result[1]:.4f}') # p < 0.05 → stationary
plot_acf(df['value'].dropna(), lags=20) plot_pacf(df['value'].dropna(), lags=20) plt.show()
model = ARIMA(df['value'], order=(1, 1, 1)) result = model.fit() print(result.summary())
monthly = df['value'].resample('M').mean() ARIMA Order Selection
p (AR order) (^) Number of lag terms. Read from PACF — where it cuts off d (differencing) (^) Number of times to difference to achieve stationarity. ADF test guides this. q (MA order) (^) Number of lagged error terms. Read from ACF — where it cuts off p-value > 0.05 on ADF (^) Series is non-stationary — apply differencing (d ≥ 1) p-value < 0.05 on ADF (^) Series is stationary — proceed with d = 0
§8 — Full Syntax Quick Reference DataFrame Creation & Inspection
pd.DataFrame({'A':[1,2],'B':[3,4]}) (^) Create from dict pd.read_csv('file.csv') (^) Import CSV pd.read_excel('file.xlsx') (^) Import Excel pd.read_stata('file.dta') (^) Import Stata file df.head(10) (^) First 10 rows df.tail() (^) Last 5 rows df.shape (^) (rows, columns) df.info() (^) Types, nulls, memory df.describe() (^) Summary stats df.dtypes (^) Column types df.columns.tolist() (^) Column names as list df.index (^) Row index Selecting & Filtering
df['col'] (^) Select one column → Series df[['col1','col2']] (^) Select multiple columns → DataFrame df.loc['row_label', 'col'] (^) Select by label df.iloc[0, 1] (^) Select by integer position df[df['col'] > 5] (^) Filter rows by condition df[(df['a']>1) & (df['b']<10)] (^) Multiple conditions — use & not 'and' df[df['col'].isin(['a','b'])] (^) Filter by list of values df[df['col'].notna()] (^) Drop NaN rows in one column Creating & Modifying Columns
df['new'] = df['a'] + df['b'] (^) Vectorised arithmetic df['new'] = np.where(cond, t, f) (^) Binary conditional column df['new'] = df['col'].apply(func) (^) Apply any function row-by-row df['new'] = np.log(df['col']) (^) Natural log transformation df['new'] = (df['col']- df['col'].mean())/df['col'].std() Standardize (z-score) df = df.drop(columns=['col']) (^) Remove column(s) df = df.rename(columns={'old':'new'}) (^) Rename column
df = df.sort_values('col', ascending=False) Sort descending Aggregation & Reshaping
df.groupby('col').mean() (^) Mean by group df.groupby('col').agg({'a':'sum','b':'m ean'}) Multiple aggregations df.pivot_table(values='v', index='r', columns='c', aggfunc='sum') Pivot table df.melt(id_vars='id', value_vars=['a','b']) Wide → long format pd.concat([df1, df2], ignore_index=True) Stack DataFrames vertically pd.merge(df1, df2, on='key', how='left') SQL-style join Missing Data
df.isna().sum() (^) Count missing values per column df.dropna() (^) Drop rows with ANY missing value df.dropna(subset=['col']) (^) Drop rows where a specific column is NaN df.fillna(0) (^) Fill all NaN with 0 df['col'].fillna(df['col'].mean()) (^) Fill NaN with column mean Python Fundamentals
for item in iterable: (^) Iterate over a list, range, dict, etc. for i, val in enumerate(lst): (^) Loop with index and value simultaneously [x**2 for x in range(10)] (^) List comprehension — compact loop {k: v for k, v in dict.items()} (^) Dict comprehension lambda x: x * (^2) Anonymous one-line function def func(x, y=10): return x + y (^) Named function with default argument try: ... except TypeError as e: (^) Error handling — catch specific exception raise ValueError('message') (^) Manually raise an exception f'Value is {var:.2f}' (^) f-string with 2 decimal format isinstance(x, int) (^) Check if x is an integer str.isalpha() (^) True if all characters are letters str.isdigit() (^) True if all characters are digits