Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Web Scraping Technology, Cheat Sheet of Web Programming and Technologies

Visvesvaraya Technological University Web Programming and Technologies

Explained about the web scraping technology

Typology: Cheat Sheet

2020/2021

Uploaded on 07/23/2021

Priyam1 🇮🇳

3

(1)

1 document

1 / 7

This page cannot be seen from the preview

Don't miss anything!

1. INTRODUCTION

The web is a major source of information for many professionals in various sectors. It

contains useful and useless, structured and non-structured information, in different formats,

and from various sources. However, in addition to being a very complex activity, Web

Scraping is a time- and resource consuming task, especially when it is carried out manually.

This complexity increase depending on data and collection websites. Many techniques have

been used to retrieve content from a web page: Cut/Paste, http, Query languages for semi-

structured Data , DOM or even Web-Scraping . Many advanced techniques are also used to

collect data from the web. Among these, one can mention API computer languages , robots,

intelligent agents and Web Scraping. A web scraper is, therefore, a software that simulates

human browsing on the web to collect detailed information data from different websites. The

advantage of a scraper resides on its speed and its capacity to be automated and/or

programmed. However, no matter what technique is used, the approach and the objectives

remain the same: capture web data and present it in a more structured format .

Web scraping, also known as web extraction or harvesting, is a technique to extract data from

the World Wide Web (WWW) and save it to a file system or database for later retrieval or

analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or

through a web browser. This is accomplished either manually by a user or automatically by a

bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is

constantly generated on the WWW, web scraping is widely acknowledged as an efficient and

powerful technique for collecting big data (Mooney et al. 2015; Bar-Ilan 2001). To adapt to a

variety of scenarios, current web scraping techniques have become customized from smaller

ad hoc, human-aided procedures to the utilization of fully automated systems that are able to

convert entire websites into well-organized data set. State-of-the-art web scraping tools are

not only capable of parsing markup languages or JSON files but also integrating with

computer visual analytics (Butler 2007) and natural language processing to simulate how

human users browse web content (Yi et al. 2003).

Discover Cheat Sheet of Web Programming and Technologies Visvesvaraya Technological University

Partial preview of the text

Download Web Scraping Technology and more Cheat Sheet Web Programming and Technologies in PDF only on Docsity!

1. INTRODUCTION

The web is a major source of information for many professionals in various sectors. It contains useful and useless, structured and non-structured information, in different formats, and from various sources. However, in addition to being a very complex activity, Web Scraping is a time- and resource consuming task, especially when it is carried out manually. This complexity increase depending on data and collection websites. Many techniques have been used to retrieve content from a web page: Cut/Paste, http, Query languages for semi- structured Data , DOM or even Web-Scraping. Many advanced techniques are also used to collect data from the web. Among these, one can mention API computer languages , robots, intelligent agents and Web Scraping. A web scraper is, therefore, a software that simulates human browsing on the web to collect detailed information data from different websites. The advantage of a scraper resides on its speed and its capacity to be automated and/or programmed. However, no matter what technique is used, the approach and the objectives remain the same: capture web data and present it in a more structured format. Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or through a web browser. This is accomplished either manually by a user or automatically by a bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is constantly generated on the WWW, web scraping is widely acknowledged as an efficient and powerful technique for collecting big data (Mooney et al. 2015; Bar-Ilan 2001). To adapt to a variety of scenarios, current web scraping techniques have become customized from smaller ad hoc, human-aided procedures to the utilization of fully automated systems that are able to convert entire websites into well-organized data set. State-of-the-art web scraping tools are not only capable of parsing markup languages or JSON files but also integrating with computer visual analytics (Butler 2007) and natural language processing to simulate how human users browse web content (Yi et al. 2003).

2. BACKGROUND OF WEB SCRAPING

From the evolution of WWW, the scenario of internet user and data exchange is fastly changes. As common people join the internet and start to use it, lots of new techniques are promoted to boost up the network. At the same time, to enhance computers and network facility new technologies were introduces which results into automatically decreasing in cost of hardware and website’s related costs. Due to all these changes, large number of users are joined and use the internet facilities. Daily use of internet cose in to a tremendous data is available on internet. Business, academician, researchers all are share their advertisements, information on internet so that they can be connected to people fastly and easily. As a result of exchange, share and store data on internet, a new problem is arise that how to handle such data overload and how the user will get or access the best information in least efforts. To solve this issues, researcher spotout new technique called Web Scraping. Web scraping is very imperative technique which is used to generate structured data on the basis of available unstructured data on the web. Scaping generated structured data then stored in central database and analyze in spreadsheets. Traditional copy-and-paste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analyzers are some of the common techniques used for data scraping. Previously most user uses the common copy-pest technique for gathering and analyzing data on the internet, but it is a tedious technique where lot of data copied by the user and store on computer files. As compared to this technique web scraping software is easiest scraping technique. Now a days, there are lots of software are available in the market for web scraping. Our paper is focused on the overview on the information extraction technique i.e. web scraping, different techniques of web scraping and some of the recent tools used for a web scraping.

4. IMPLEMENTATION

When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. To extract data using web scraping with python, you need to follow these basic steps: ❖ Find the URL that you want to scrape ❖ Inspecting the Page ❖ Find the data you want to extract ❖ Write the code ❖ Run the code and extract the data ❖ Store the data in the required format

5. ADVANTAGES

Automation

The first and most important benefit of web scraping is developing tools that have simplified data retrieval from different websites to only a few clicks. Data could still be extracted before this approach, but it was a tedious and time-consuming process. Imagine that someone would have to copy and paste text, images, or other data every day — what a time-consuming process! Luckily, web scraping tools nowadays make the extraction of data in large volumes both simple and quick.

Easy Implementation

When a website scraping service begins gathering data, you should be confident that you are obtaining data from various websites, not just a single page. It is possible to have a large volume of data with a small investment to help you get the best out of that data.

Low Maintenance

When it comes to maintenance, the cost is something that is often ignored when installing new services. Fortunately, web scraping technologies need little to no maintenance over time. So, in the long run, services and budgets will not undergo drastic changes in terms of maintenance.

Speed

Another feature worth mentioning is the speed with which web scraping services complete

actions. Imagine that a scraping project that would typically take weeks is completed in a

matter of hours. But of course, that depends on the complexity of the projects, resources, and tools used.

7. CONCLUSION

This report presents the state-of-the-art in Web Scraping. I have focused on the background, future of web scraping. We have also dealt with the area of applications of web scraping and advantages. At the end of this study, we have noticed that Web scraping is more needed in one sector: journalism, though it remains the one having less specialized tools.

Web Scraping Technology, Cheat Sheet of Web Programming and Technologies

Related documents

Partial preview of the text

Download Web Scraping Technology and more Cheat Sheet Web Programming and Technologies in PDF only on Docsity!

1. INTRODUCTION

2. BACKGROUND OF WEB SCRAPING

4. IMPLEMENTATION

5. ADVANTAGES

Automation

Easy Implementation

Low Maintenance

Speed

actions. Imagine that a scraping project that would typically take weeks is completed in a

7. CONCLUSION