



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Explained about the web scraping technology
Typology: Cheat Sheet
1 / 7
This page cannot be seen from the preview
Don't miss anything!




The web is a major source of information for many professionals in various sectors. It contains useful and useless, structured and non-structured information, in different formats, and from various sources. However, in addition to being a very complex activity, Web Scraping is a time- and resource consuming task, especially when it is carried out manually. This complexity increase depending on data and collection websites. Many techniques have been used to retrieve content from a web page: Cut/Paste, http, Query languages for semi- structured Data , DOM or even Web-Scraping. Many advanced techniques are also used to collect data from the web. Among these, one can mention API computer languages , robots, intelligent agents and Web Scraping. A web scraper is, therefore, a software that simulates human browsing on the web to collect detailed information data from different websites. The advantage of a scraper resides on its speed and its capacity to be automated and/or programmed. However, no matter what technique is used, the approach and the objectives remain the same: capture web data and present it in a more structured format. Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or through a web browser. This is accomplished either manually by a user or automatically by a bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is constantly generated on the WWW, web scraping is widely acknowledged as an efficient and powerful technique for collecting big data (Mooney et al. 2015; Bar-Ilan 2001). To adapt to a variety of scenarios, current web scraping techniques have become customized from smaller ad hoc, human-aided procedures to the utilization of fully automated systems that are able to convert entire websites into well-organized data set. State-of-the-art web scraping tools are not only capable of parsing markup languages or JSON files but also integrating with computer visual analytics (Butler 2007) and natural language processing to simulate how human users browse web content (Yi et al. 2003).
From the evolution of WWW, the scenario of internet user and data exchange is fastly changes. As common people join the internet and start to use it, lots of new techniques are promoted to boost up the network. At the same time, to enhance computers and network facility new technologies were introduces which results into automatically decreasing in cost of hardware and website’s related costs. Due to all these changes, large number of users are joined and use the internet facilities. Daily use of internet cose in to a tremendous data is available on internet. Business, academician, researchers all are share their advertisements, information on internet so that they can be connected to people fastly and easily. As a result of exchange, share and store data on internet, a new problem is arise that how to handle such data overload and how the user will get or access the best information in least efforts. To solve this issues, researcher spotout new technique called Web Scraping. Web scraping is very imperative technique which is used to generate structured data on the basis of available unstructured data on the web. Scaping generated structured data then stored in central database and analyze in spreadsheets. Traditional copy-and-paste, Text grapping and regular expression matching, HTTP programming, HTML parsing, DOM parsing, Webscraping software, Vertical aggregation platforms, Semantic annotation recognizing and Computer vision web-page analyzers are some of the common techniques used for data scraping. Previously most user uses the common copy-pest technique for gathering and analyzing data on the internet, but it is a tedious technique where lot of data copied by the user and store on computer files. As compared to this technique web scraping software is easiest scraping technique. Now a days, there are lots of software are available in the market for web scraping. Our paper is focused on the overview on the information extraction technique i.e. web scraping, different techniques of web scraping and some of the recent tools used for a web scraping.
When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. To extract data using web scraping with python, you need to follow these basic steps: ❖ Find the URL that you want to scrape ❖ Inspecting the Page ❖ Find the data you want to extract ❖ Write the code ❖ Run the code and extract the data ❖ Store the data in the required format
The first and most important benefit of web scraping is developing tools that have simplified data retrieval from different websites to only a few clicks. Data could still be extracted before this approach, but it was a tedious and time-consuming process. Imagine that someone would have to copy and paste text, images, or other data every day — what a time-consuming process! Luckily, web scraping tools nowadays make the extraction of data in large volumes both simple and quick.
When a website scraping service begins gathering data, you should be confident that you are obtaining data from various websites, not just a single page. It is possible to have a large volume of data with a small investment to help you get the best out of that data.
When it comes to maintenance, the cost is something that is often ignored when installing new services. Fortunately, web scraping technologies need little to no maintenance over time. So, in the long run, services and budgets will not undergo drastic changes in terms of maintenance.
Another feature worth mentioning is the speed with which web scraping services complete
matter of hours. But of course, that depends on the complexity of the projects, resources, and tools used.
This report presents the state-of-the-art in Web Scraping. I have focused on the background, future of web scraping. We have also dealt with the area of applications of web scraping and advantages. At the end of this study, we have noticed that Web scraping is more needed in one sector: journalism, though it remains the one having less specialized tools.