Gnutella Peer Crawler: Implementing BFS Algorithm and Parsing Responses | Assignments Computer Science

CPSC 463-500: Networks and Distributed Processing

Homework #1 due 2/24/09 (100 pts)

1. Purpose

This homework builds an understanding of application-layer protocols and Windows sockets.

2. Description

Using Winsock and Visual Studio .NET 2005, your goal is to create a Gnutella crawler that

discovers all currently present peers in the system. Your program will first contact a seed web-

server to acquire a set of initial ultrapeers, traverse the entire Gnutella network in the BFS order,

and then record the identities of found ultrapeers and their children (i.e., leaf nodes) in some text

file. Using this information, you will then analyze the collected data to answer several questions

about geographic and domain diversity of peers as well as popularity of individual user agents

(see below).

Requirements for the implementation:

1. Must be able to connect to a GWebCache (specified at the command prompt using a URL

string host[:port][/path] where parts in [] are optional) and download a list of

active seed ultrapeers. Make sure to check that the status code of the response is 200 OK

and the protocol type in the first line of the response is indeed HTTP.

2. Must be able to use BFS to crawl the entire Gnutella network of ultrapeers starting from

the seed list (each ultrapeer must be contacted no more than once, leaf nodes must not be

contacted at all). Make sure to check that the response begins with the correct string com-

pliant with the protocol (i.e., GNUTELLA/version statusCode statusText).

3. During the crawl, the program must record all found ultrapeers and their leaves into a set

and then write it on disk at the end of the crawl (this set needs to contain unique elements

only).

4. The final version must support operation with N threads and crawls up to M contacted ul-

trapeers, where both N and M are specified by the user in the command prompt (e.g.,

crawler.exe gwc1c.olden.ch.3557.nyud.net:8080/gwc/ 200 300000). For sim-

plicity, count each ultrapeer pulled from the BFS queue as “contacted.”

Requirements for the report:

1. Explain the structure of your code and what the crawler does. Document the various sta-

tistics of your crawl – the total number of ultrapeers found, how many of them were re-

sponsive (accepted a connection), how many leaves were found, how long it took to per-

form the crawl, the average rate of crawling responsive ultrapeers per second.

2. Experiment with 1-5000 threads and document performance improvement arising from

using multiple threads (e.g., using such metrics as the delay needed to crawl a certain

number of ultrapeers or the average number of peers found per second).

3. Perform reverse DNS lookups on all obtained IP addresses and build a distribution of the

number of users per domain (e.g., the domain of host.network.cox.com is cox.com;

that of network.amazon.co.uk is amazon.co.uk). The distribution should be sorted

from the most popular to the least popular domain. Discuss how many domains you ob-

Gnutella Peer Crawler: Implementing BFS Algorithm and Parsing Responses, Assignments of Computer Science

Related documents

Partial preview of the text

Download Gnutella Peer Crawler: Implementing BFS Algorithm and Parsing Responses and more Assignments Computer Science in PDF only on Docsity!

CPSC 463-500: Networks and Distributed Processing

Homework #1 due 2/24/09 (100 pts)

1. Purpose

2. Description

3. Details

3.2. Neighbor Lists

3.3. Parsing Replies

3.4. Performance Issues

3.6. Socket Issues

3.7. Thread Issues

3.8. Helpful Commands

3.9. Visual Studio STL

3.10. DNS Lookup Issues

3.11. Wireshark

3.12. Debug vs. Release Mode

CPSC 463-500: Networks and Distributed Processing

Homework #1 Notes on Grading

Part-

Part-

Part 3