Web Scraping Technology, Cheat Sheet of Web Programming and Technologies

Explained about the web scraping technology

Typology: Cheat Sheet

2020/2021

Uploaded on 07/23/2021

Priyam1
Priyam1 🇮🇳

3

(1)

1 document

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1. INTRODUCTION
The web is a major source of information for many professionals in various sectors. It
contains useful and useless, structured and non-structured information, in different formats,
and from various sources. However, in addition to being a very complex activity, Web
Scraping is a time- and resource consuming task, especially when it is carried out manually.
This complexity increase depending on data and collection websites. Many techniques have
been used to retrieve content from a web page: Cut/Paste, http, Query languages for semi-
structured Data , DOM or even Web-Scraping . Many advanced techniques are also used to
collect data from the web. Among these, one can mention API computer languages , robots,
intelligent agents and Web Scraping. A web scraper is, therefore, a software that simulates
human browsing on the web to collect detailed information data from different websites. The
advantage of a scraper resides on its speed and its capacity to be automated and/or
programmed. However, no matter what technique is used, the approach and the objectives
remain the same: capture web data and present it in a more structured format .
Web scraping, also known as web extraction or harvesting, is a technique to extract data from
the World Wide Web (WWW) and save it to a file system or database for later retrieval or
analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or
through a web browser. This is accomplished either manually by a user or automatically by a
bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is
constantly generated on the WWW, web scraping is widely acknowledged as an efficient and
powerful technique for collecting big data (Mooney et al. 2015; Bar-Ilan 2001). To adapt to a
variety of scenarios, current web scraping techniques have become customized from smaller
ad hoc, human-aided procedures to the utilization of fully automated systems that are able to
convert entire websites into well-organized data set. State-of-the-art web scraping tools are
not only capable of parsing markup languages or JSON files but also integrating with
computer visual analytics (Butler 2007) and natural language processing to simulate how
human users browse web content (Yi et al. 2003).
pf3
pf4
pf5

Partial preview of the text

Download Web Scraping Technology and more Cheat Sheet Web Programming and Technologies in PDF only on Docsity!

1. INTRODUCTION

The web is a major source of information for many professionals in various sectors. It contains useful and useless, structured and non-structured information, in different formats, and from various sources. However, in addition to being a very complex activity, Web Scraping is a time- and resource consuming task, especially when it is carried out manually. This complexity increase depending on data and collection websites. Many techniques have been used to retrieve content from a web page: Cut/Paste, http, Query languages for semi- structured Data , DOM or even Web-Scraping. Many advanced techniques are also used to collect data from the web. Among these, one can mention API computer languages , robots, intelligent agents and Web Scraping. A web scraper is, therefore, a software that simulates human browsing on the web to collect detailed information data from different websites. The advantage of a scraper resides on its speed and its capacity to be automated and/or programmed. However, no matter what technique is used, the approach and the objectives remain the same: capture web data and present it in a more structured format. Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or through a web browser. This is accomplished either manually by a user or automatically by a bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is constantly generated on the WWW, web scraping is widely acknowledged as an efficient and powerful technique for collecting big data (Mooney et al. 2015; Bar-Ilan 2001). To adapt to a variety of scenarios, current web scraping techniques have become customized from smaller ad hoc, human-aided procedures to the utilization of fully automated systems that are able to convert entire websites into well-organized data set. State-of-the-art web scraping tools are not only capable of parsing markup languages or JSON files but also integrating with computer visual analytics (Butler 2007) and natural language processing to simulate how human users browse web content (Yi et al. 2003).

2. BACKGROUND OF WEB SCRAPING

From the evolution of WWW, the scenario of internet user and data exchange is fastly changes. As common people join the internet and start to use it, lots of new techniques are promoted to boost up the network. At the same time, to enhance computers and network facility new technologies were introduces which results into automatically decreasing in cost of hardware and website’s related costs. Due to all these changes, large number of users are joined and use the internet facilities. Daily use of internet cose in to a tremendous data is available on internet. Business, academician, researchers all are share their advertisements, information on internet so that they can be connected to people fastly and easily. As a result of exchange, share and store data on internet, a new problem is arise that how to handle such data overload and how the user will get or access the best information in least efforts. To solve this issues, researcher spotout new technique called Web Scraping. Web scraping is very imperative technique which is used to generate structured data on the basis of available unstructured data on the web. Scaping generated structured data then stored in central database and analyze in spreadsheets. Traditional copy-and-paste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analyzers are some of the common techniques used for data scraping. Previously most user uses the common copy-pest technique for gathering and analyzing data on the internet, but it is a tedious technique where lot of data copied by the user and store on computer files. As compared to this technique web scraping software is easiest scraping technique. Now a days, there are lots of software are available in the market for web scraping. Our paper is focused on the overview on the information extraction technique i.e. web scraping, different techniques of web scraping and some of the recent tools used for a web scraping.

4. IMPLEMENTATION

When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. To extract data using web scraping with python, you need to follow these basic steps: ❖ Find the URL that you want to scrape ❖ Inspecting the Page ❖ Find the data you want to extract ❖ Write the code ❖ Run the code and extract the data ❖ Store the data in the required format

5. ADVANTAGES

Automation

The first and most important benefit of web scraping is developing tools that have simplified data retrieval from different websites to only a few clicks. Data could still be extracted before this approach, but it was a tedious and time-consuming process. Imagine that someone would have to copy and paste text, images, or other data every day — what a time-consuming process! Luckily, web scraping tools nowadays make the extraction of data in large volumes both simple and quick.

Easy Implementation

When a website scraping service begins gathering data, you should be confident that you are obtaining data from various websites, not just a single page. It is possible to have a large volume of data with a small investment to help you get the best out of that data.

Low Maintenance

When it comes to maintenance, the cost is something that is often ignored when installing new services. Fortunately, web scraping technologies need little to no maintenance over time. So, in the long run, services and budgets will not undergo drastic changes in terms of maintenance.

Speed

Another feature worth mentioning is the speed with which web scraping services complete

actions. Imagine that a scraping project that would typically take weeks is completed in a

matter of hours. But of course, that depends on the complexity of the projects, resources, and tools used.

7. CONCLUSION

This report presents the state-of-the-art in Web Scraping. I have focused on the background, future of web scraping. We have also dealt with the area of applications of web scraping and advantages. At the end of this study, we have noticed that Web scraping is more needed in one sector: journalism, though it remains the one having less specialized tools.