Machine Learning for Data Analysis, Lecture notes of Machine Learning

The role of data in economics and the importance of causality. It also provides examples of unsupervised learning and clustering. authored by Jesu´s Fern´andez-Villaverde and Galo Nun˜o and was published on September 1, 2022. related to university topics such as data science, economics, and machine learning. The University of Pennsylvania is the most likely university to have courses related to these topics. The document could be useful as study notes with a rate of 8. The typology of the document is 'lecture notes'. The document might belong to an academic course in data science or economics. The possible academic year is 2023. The document could be more useful to a university student. The boolean output 'succeeded' is true.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

tanvir
tanvir 🇺🇸

5

(4)

224 documents

1 / 67

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning for Data Analysis
Jes´us Fern´andez-Villaverde1and Galo Nu˜no2
September 1, 2022
1University of Pennsylvania
2Banco de Espa˜na
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43

Partial preview of the text

Download Machine Learning for Data Analysis and more Lecture notes Machine Learning in PDF only on Docsity!

Machine Learning for Data Analysis

Jes´us Fern´andez-Villaverde^1 and Galo Nu˜no^2

September 1, 2022

(^1) University of Pennsylvania

(^2) Banco de Espa˜na

New data

  • Most important lesson for economists from data science: Everything is data.
  • Unstructured data: Newspaper articles, business reports, congressional speeches, FOMC meetings transcripts, satellite data, photographs, audio, mobility, ...

Parish and probate data

Satellite imagery

Cell use

TABLE I M OST PARTISAN PHRASES F ROM THE 2005 C ONGRESSIONAL R ECORD a Panel A: Phrases Used More Often by Democrats Two-Word Phrases private accounts Rosa Parks workers rights trade agreement President budget poor people American people Republican party Republican leader tax breaks change the rules Arctic refuge trade deficit minimum wage cut funding oil companies budget deficit American workers credit card Republican senators living in poverty nuclear option privatization plan Senate Republicans war in Iraq wildlife refuge fuel efficiency middle class card companies national wildlife Three-Word Phrases veterans health care corporation for public cut health care congressional black caucus broadcasting civil rights movement VA health care additional tax cuts cuts to child support billion in tax cuts pay for tax cuts drilling in the Arctic National credit card companies tax cuts for people victims of gun violence security trust fund oil and gas companies solvency of social security social security trust prescription drug bill Voting Rights Act privatize social security caliber sniper rifles war in Iraq and Afghanistan American free trade increase in the minimum wage civil rights protections central American free system of checks and balances credit card debt middle class families ( Continues ) 7

F IGURE 1.—Language-based and reader-submitted ratings of slant. The slant index (y axis) is shown against the average Mondo Times user rating of newspaper conservativeness (x axis), which ranges from 1 (liberal) to 5 (conservative). Included are all papers rated by at least two users on Mondo Times, with at least 25,000 mentions of our 1000 phrases in 2005. The line is pre- dicted slant from an OLS regression of slant on Mondo Times rating. The correlation coefficient is 0.40 (p = 0 0114). (^8) We wish to thank Eric Kallgren of Mondo Code for graciously providing these data. 9

Economics and machine learning II

  • A more general point ⇒ role of causality in economics:
    1. Counterfactuals.
    2. Welfare.
    3. General equilibrium effects.
    4. New changes.
    5. Less data.
  • Another example by Athey (2017): hotel prices and occupancy rates. In the data, prices and occupancy rates are strongly positively correlated, but what is the expected impact of a hotel raising its prices on a given day?

Unsupervised learning

Unsupervised learning

  • Use a sample: D = {xi }Ni= to:
  1. Group observations in interesting patterns.
  2. Describe most important sources of variation in the data.
  3. Dimensionality reduction.
  • Example: what can we learn about the loan book of a bank without imposing too much a priori structure?
  • More concretely, we search for: p (xi |θ)
  • Clustering and association rules.

Cluster discovering

  1. Select K clusters K ∗^ = argmax K

p (K |D)

  1. Assign each observation to a cluster

z t∗ = argmax k

p (zi = k|xi , D)

  1. A common method to pin down K is the silhouette. For each observation i, we compute:

si =

b(i) − a(i) max(a(i), b(i))

where a(i) is the average distance between i and all other members of the cluster while b(i) is the minimum distance between i and all other members of another cluster.

K-means

  • K-means clustering by Steinhaus (1957)

argmax S

Xk

i=

X

x∈Si

∥x − μi ∥^2

  • It requires an iterative algorithm for implementation Lloyd (1957).
  • Related variations:
    1. k-medians ⇒ uses medians computed through the Taxicab geometry.
    2. k-medoids ⇒minimizes a sum of pairwise dissimilarities.
    3. k-SVD.

Other algorithms

  • Other clustering methods:
    1. Agglomerative clustering.
    2. DBSCAN.
    3. Birch.
  • Principal component analysis.
  • Density estimation.
  • Gaussian mixture models.
  • Association rules and the Apriori algorithm (Agrawal and Srikant, 1994).