¡Descarga Web technologies to improve historical research y más Apuntes en PDF de Contabilidad Financiera solo en Docsity!
Web technologies to improve
historical research
st
Session
HUMANITIES COURSES
2nd ed.
Ask questions
- Items from Europeana, the CU WW1 Collection and Out of the Trenches relating to events that happened in West Flanders
- Population change in Belgian provinces during the war years as compared to the number of atrocities as well as total events that occurred there
These questions were suggested to show the usefulness of the Operation War Diary and Out of the trenches for the WW1 centenary commemoration (http://www.ldf.fi/project.html)
Who was Francisco Sanchez “el escéptico”? (http://www.larramendi.es/francisco_sanchez/es/micrositios/inicio.do)
Web technologies to improve
historical research
1.1. Lifecycles
HUMANITIES COURSE
1 st^ ed.
Digital Humanities Life Cycle
A similar trend could be seen in Digital Humanities (DH)
2. Acquisition (OCR...)
Recording, extraction
3. Cleaning 5. Aggregation, (i.e. KOS) 6. Analysis & Interpretation **7. Publication
- Objectives & Planning
- Enrichment (merging and LOD)**
Web technologies to improve
historical research
1.2. Acquisition
AN HUMANITIES COURSE
1 st^ ed.
There are multiple sources that are usually integrated and
that can be classified according to:
Provenance
- Public data: poorly structured
- Internal data retrieved from inside the company to make
decisions.
How are they created:
- Manually created
- Automatic data ingestion: i.e. sensor networks
Degree of formalization:
- Unstructured: (80% Enterprise data), e. g., Natural lang.
- Semi- Structured: XML
- Structured: known data type , with a schema and data
constraints
Information sources
Data formats. Serializations
Formats Example
Spreadsheets
(Excel, …)
TSV
(tab separated values)
NAME NACIONALITY WEIGHT Alan Spanish 55 John French 129
CSV
(comma separated values)
NAME, NACIONALITY,WEIGHT
Alan, Spanish, 55 John, French, 129
XML <person^ ID=“1”> Alan
Spanish55</person
<person
ID=“2”>John<….>
Data formats. Serializations
Formats Example
JSON {“example":[
{“name":“Alan”, “nacionality”:“Spanish“, “phone“:[“work_ph”:”25255”,”cell_ph”:”45433”] , “weight”:51}, {other_record} ]}
JSON-
based
BSON (Binary JSON) is a more efficient format than JSON.
BSON includes data types (string, Integer, double, date, array
or boolean), document size and field length in large
elements. Other serializations based on JSON are: HOCON,
Candle , Smile or Yaml
YAML Data:
given: Alan nacionality: Spanish weight: 51. age: 26 Phone:
- Work: 25255 Address: 8 St.Paul Av. Quebec
- cellular: 45433
Data acquisition.
Biased Data by poor handling
▫ Reinhard and Rogoff (2010) recommended global
austerity cut backs based on wrong data
Claim: rising levels of government debt are associated with
much weaker rates of economic growth
Cause: Reinhart and Rogoff did not selecte the entire row of
an Excel spreadsheet when averaging growth figures, besides
the file had coding errors.
http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301- 350/WP322.pdf
Web technologies to improve
historical research
1.2.1 Acquisition tools
AN HUMANITIES COURSE
1 st^ ed.
Data acquistion: Web Scraping
• Web Scraper, Web Harvesting or Web Data
Extraction
• Tool to extract structured data from websites
with unstructured data
• Process
▫ Visit the page/site
▫ Select the data you want to download
▫ Get the data with the tool
• Ej: camelcamelcamel price oscilation in Amazon
Web Scraping: tools
- Easy to use, little or none knowledge in programming ▫ Import.io (https://www.import.io) ▫ Octoparse: (http://www.octoparse.com ▫ Screen Scraper (www.screen-scraper.com) ▫ Mozenda (http://www.mozenda.com/) ▫ Web Scraper (http://webscraper.io/) ▫ ParseHub (https://www.parsehub.com/) ▫ Portia (https://scrapinghub.com/portia/) ▫ DataScraping.co (https://www.datascraping.co/)
- Browser plug-ins ▫ Scraper ▫ Web Scraper ▫ Extracty
- Frameworks , programming capabilities ▫ Scrapy (https://scrapy.org/) Python. ▫ Jsoup (https://jsoup.org/) Java
Google Fusion Tables
- Go to https://fusiontables.google.com/ , you need a Google account. Click on Create a Fusion Table
- If you don’t have data yet, you can go to the free repository and Export to Fusion Tables and Open
- In file you can merge with other dataset.
- If you want to carry out geolocalitation you should edit a column switching from text to location.
- More functionalities if you go to Help and select classic look
Gather data collaboratively [https://support.google.com/fusiontables/answer/2584135?hl=en]
Data Acquisition.
Sparql on DBpedia
- Community effort to extract
structured information from
Wikipedia (infoboxes)
- It is interlinked with many datasets,
like geonames
- By SKOS vocabulary and OWL
(sameAs) elements different datasets
are merged. Different groups could
link datasets in a different way.
- It uses three schemata to classify
concepts: Wikipedia categories, Yago
classification and Wordnet