Docsity
Docsity

Prepara tus exámenes
Prepara tus exámenes

Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity


Consigue puntos base para descargar
Consigue puntos base para descargar

Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium


Orientación Universidad
Orientación Universidad


Web technologies to improve historical research, Apuntes de Contabilidad Financiera

Asignatura: humanidades, Profesor: , Carrera: Finanzas y Contabilidad, Universidad: UC3M

Tipo: Apuntes

2017/2018

Subido el 24/05/2018

felicity11
felicity11 🇪🇸

1

(1)

6 documentos

1 / 39

Toggle sidebar

Esta página no es visible en la vista previa

¡No te pierdas las partes importantes!

bg1
Web technologies to improve
historical research
1st Session
HUMANITIES COURSES
2nd ed.
2017/18
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Vista previa parcial del texto

¡Descarga Web technologies to improve historical research y más Apuntes en PDF de Contabilidad Financiera solo en Docsity!

Web technologies to improve

historical research

st

Session

HUMANITIES COURSES

2nd ed.

Ask questions

  • Items from Europeana, the CU WW1 Collection and Out of the Trenches relating to events that happened in West Flanders
  • Population change in Belgian provinces during the war years as compared to the number of atrocities as well as total events that occurred there

These questions were suggested to show the usefulness of the Operation War Diary and Out of the trenches for the WW1 centenary commemoration (http://www.ldf.fi/project.html)

Who was Francisco Sanchez “el escéptico”? (http://www.larramendi.es/francisco_sanchez/es/micrositios/inicio.do)

Web technologies to improve

historical research

1.1. Lifecycles

HUMANITIES COURSE

1 st^ ed.

Digital Humanities Life Cycle

A similar trend could be seen in Digital Humanities (DH)

2. Acquisition (OCR...)

Recording, extraction

3. Cleaning 5. Aggregation, (i.e. KOS) 6. Analysis & Interpretation **7. Publication

  1. Objectives & Planning
  2. Enrichment (merging and LOD)**

Web technologies to improve

historical research

1.2. Acquisition

AN HUMANITIES COURSE

1 st^ ed.

There are multiple sources that are usually integrated and

that can be classified according to:

Provenance

  • Public data: poorly structured
  • Internal data retrieved from inside the company to make

decisions.

How are they created:

  • Manually created
  • Automatic data ingestion: i.e. sensor networks

Degree of formalization:

  • Unstructured: (80% Enterprise data), e. g., Natural lang.
  • Semi- Structured: XML
  • Structured: known data type , with a schema and data

constraints

Information sources

Data formats. Serializations

Formats Example

Spreadsheets

(Excel, …)

TSV

(tab separated values)

NAME NACIONALITY WEIGHT Alan Spanish 55 John French 129

CSV

(comma separated values)

NAME, NACIONALITY,WEIGHT

Alan, Spanish, 55 John, French, 129

XML <person^ ID=“1”> Alan

Spanish55</person

<person

ID=“2”>John<….>

Data formats. Serializations

Formats Example

JSON {“example":[

{“name":“Alan”, “nacionality”:“Spanish“, “phone“:[“work_ph”:”25255”,”cell_ph”:”45433”] , “weight”:51}, {other_record} ]}

JSON-

based

BSON (Binary JSON) is a more efficient format than JSON.

BSON includes data types (string, Integer, double, date, array

or boolean), document size and field length in large

elements. Other serializations based on JSON are: HOCON,

Candle , Smile or Yaml

YAML Data:

given: Alan nacionality: Spanish weight: 51. age: 26 Phone:

  • Work: 25255 Address: 8 St.Paul Av. Quebec
  • cellular: 45433

Data acquisition.

Biased Data by poor handling

▫ Reinhard and Rogoff (2010) recommended global

austerity cut backs based on wrong data

Claim: rising levels of government debt are associated with

much weaker rates of economic growth

Cause: Reinhart and Rogoff did not selecte the entire row of

an Excel spreadsheet when averaging growth figures, besides

the file had coding errors.

http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301- 350/WP322.pdf

Web technologies to improve

historical research

1.2.1 Acquisition tools

AN HUMANITIES COURSE

1 st^ ed.

Data acquistion: Web Scraping

• Web Scraper, Web Harvesting or Web Data

Extraction

• Tool to extract structured data from websites

with unstructured data

• Process

▫ Visit the page/site

▫ Select the data you want to download

▫ Get the data with the tool

• Ej: camelcamelcamel price oscilation in Amazon

Web Scraping: tools

  • Easy to use, little or none knowledge in programming ▫ Import.io (https://www.import.io) ▫ Octoparse: (http://www.octoparse.com ▫ Screen Scraper (www.screen-scraper.com) ▫ Mozenda (http://www.mozenda.com/) ▫ Web Scraper (http://webscraper.io/) ▫ ParseHub (https://www.parsehub.com/) ▫ Portia (https://scrapinghub.com/portia/) ▫ DataScraping.co (https://www.datascraping.co/)
  • Browser plug-ins ▫ Scraper ▫ Web Scraper ▫ Extracty
  • Frameworks , programming capabilities ▫ Scrapy (https://scrapy.org/) Python. ▫ Jsoup (https://jsoup.org/) Java

Google Fusion Tables

  • Go to https://fusiontables.google.com/ , you need a Google account. Click on Create a Fusion Table
  • If you don’t have data yet, you can go to the free repository and Export to Fusion Tables and Open
  • In file you can merge with other dataset.
  • If you want to carry out geolocalitation you should edit a column switching from text to location.
  • More functionalities if you go to Help and select classic look

Gather data collaboratively [https://support.google.com/fusiontables/answer/2584135?hl=en]

Data Acquisition.

Sparql on DBpedia

  • Community effort to extract

structured information from

Wikipedia (infoboxes)

  • It is interlinked with many datasets,

like geonames

  • By SKOS vocabulary and OWL

(sameAs) elements different datasets

are merged. Different groups could

link datasets in a different way.

  • It uses three schemata to classify

concepts: Wikipedia categories, Yago

classification and Wordnet