





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
AIAIAIAIAIAIAIAIIAaiaiaiaiaiaiiaiaiaia
Typology: Exercises
1 / 9
This page cannot be seen from the preview
Don't miss anything!






On special offer
with crawler4j. A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering. Crawler4j is a java library that will extremely simplify the process of creating the web crawler. Crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes. https://code.google.com/p/crawler4j/downloads/list Basically you have to create the crawler by extending the WebCrawler and create a controller for that. In the crawler you have to override two basic method. They are, shouldVisit - this method is called when visiting a given URl to determine whether it should be visited or not. visit - this method is called when the contents of the given URL is downloaded successfully. You can easily access the URl and the contents of the page from this method. Bellow is simple implementation of the shouldVisit method and it will access the pages in the same domain as the added seed and will avoid from css, js and media files. First you can create a pattern to avoid such types of pages. Pattern filters = Pattern.compile(".*(\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "| wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); Now we can override the shouldVisit method. @Override publicbooleanshouldVisit(WebURLurl) { String href = url.getURL().toLowerCase(); return !filters.matcher(href).matches() &&href.startsWith("http://www.lankadeepa.lk/");}
You can override the visit method and print the details of the accessed pages. @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("Visited: " + url); if (page.getParseData() instanceofHtmlParseData) { HtmlParseDatahtmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); List
QUESTION#04:
For more:https://studylib.net/doc/18715310/aima-chapter-3--solving-problems-by-searching