Gnutella Peer Crawler: Implementing BFS Algorithm and Parsing Responses, Assignments of Computer Science

The implementation of a gnutella peer crawler using bfs algorithm to traverse the network and parse responses to extract ultrapeer and leaf information. The crawler supports user-specified threads and contacts, records statistics, and experiments with thread performance. Ultrapeers and leaves are distinguished based on their roles in the network, and communication is facilitated using standardized handshake messages.

Typology: Assignments

Pre 2010

Uploaded on 02/10/2009

koofers-user-1v9
koofers-user-1v9 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CPSC 463-500: Networks and Distributed Processing
Homework #1 due 2/24/09 (100 pts)
1. Purpose
This homework builds an understanding of application-layer protocols and Windows sockets.
2. Description
Using Winsock and Visual Studio .NET 2005, your goal is to create a Gnutella crawler that
discovers all currently present peers in the system. Your program will first contact a seed web-
server to acquire a set of initial ultrapeers, traverse the entire Gnutella network in the BFS order,
and then record the identities of found ultrapeers and their children (i.e., leaf nodes) in some text
file. Using this information, you will then analyze the collected data to answer several questions
about geographic and domain diversity of peers as well as popularity of individual user agents
(see below).
Requirements for the implementation:
1. Must be able to connect to a GWebCache (specified at the command prompt using a URL
string host[:port][/path] where parts in [] are optional) and download a list of
active seed ultrapeers. Make sure to check that the status code of the response is 200 OK
and the protocol type in the first line of the response is indeed HTTP.
2. Must be able to use BFS to crawl the entire Gnutella network of ultrapeers starting from
the seed list (each ultrapeer must be contacted no more than once, leaf nodes must not be
contacted at all). Make sure to check that the response begins with the correct string com-
pliant with the protocol (i.e., GNUTELLA/version statusCode statusText).
3. During the crawl, the program must record all found ultrapeers and their leaves into a set
and then write it on disk at the end of the crawl (this set needs to contain unique elements
only).
4. The final version must support operation with N threads and crawls up to M contacted ul-
trapeers, where both N and M are specified by the user in the command prompt (e.g.,
crawler.exe gwc1c.olden.ch.3557.nyud.net:8080/gwc/ 200 300000). For sim-
plicity, count each ultrapeer pulled from the BFS queue as “contacted.”
Requirements for the report:
1. Explain the structure of your code and what the crawler does. Document the various sta-
tistics of your crawl – the total number of ultrapeers found, how many of them were re-
sponsive (accepted a connection), how many leaves were found, how long it took to per-
form the crawl, the average rate of crawling responsive ultrapeers per second.
2. Experiment with 1-5000 threads and document performance improvement arising from
using multiple threads (e.g., using such metrics as the delay needed to crawl a certain
number of ultrapeers or the average number of peers found per second).
3. Perform reverse DNS lookups on all obtained IP addresses and build a distribution of the
number of users per domain (e.g., the domain of host.network.cox.com is cox.com;
that of network.amazon.co.uk is amazon.co.uk). The distribution should be sorted
from the most popular to the least popular domain. Discuss how many domains you ob-
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Gnutella Peer Crawler: Implementing BFS Algorithm and Parsing Responses and more Assignments Computer Science in PDF only on Docsity!

CPSC 463-500: Networks and Distributed Processing

Homework #1 due 2/24/09 (100 pts)

1. Purpose

This homework builds an understanding of application-layer protocols and Windows sockets.

2. Description

Using Winsock and Visual Studio .NET 2005, your goal is to create a Gnutella crawler that discovers all currently present peers in the system. Your program will first contact a seed web- server to acquire a set of initial ultrapeers, traverse the entire Gnutella network in the BFS order, and then record the identities of found ultrapeers and their children (i.e., leaf nodes) in some text file. Using this information, you will then analyze the collected data to answer several questions about geographic and domain diversity of peers as well as popularity of individual user agents (see below).

Requirements for the implementation:

  1. Must be able to connect to a GWebCache (specified at the command prompt using a URL string host[:port][/path] where parts in [] are optional) and download a list of active seed ultrapeers. Make sure to check that the status code of the response is 200 OK and the protocol type in the first line of the response is indeed HTTP.
  2. Must be able to use BFS to crawl the entire Gnutella network of ultrapeers starting from the seed list (each ultrapeer must be contacted no more than once, leaf nodes must not be contacted at all). Make sure to check that the response begins with the correct string com- pliant with the protocol (i.e., GNUTELLA/version statusCode statusText).
  3. During the crawl, the program must record all found ultrapeers and their leaves into a set and then write it on disk at the end of the crawl (this set needs to contain unique elements only).
  4. The final version must support operation with N threads and crawls up to M contacted ul- trapeers, where both N and M are specified by the user in the command prompt (e.g., crawler.exe gwc1c.olden.ch.3557.nyud.net:8080/gwc/ 200 300000). For sim- plicity, count each ultrapeer pulled from the BFS queue as “contacted.”

Requirements for the report:

  1. Explain the structure of your code and what the crawler does. Document the various sta- tistics of your crawl – the total number of ultrapeers found, how many of them were re- sponsive (accepted a connection), how many leaves were found, how long it took to per- form the crawl, the average rate of crawling responsive ultrapeers per second.
  2. Experiment with 1-5000 threads and document performance improvement arising from using multiple threads (e.g., using such metrics as the delay needed to crawl a certain number of ultrapeers or the average number of peers found per second).
  3. Perform reverse DNS lookups on all obtained IP addresses and build a distribution of the number of users per domain (e.g., the domain of host.network.cox.com is cox.com; that of network.amazon.co.uk is amazon.co.uk). The distribution should be sorted from the most popular to the least popular domain. Discuss how many domains you ob-

served, list several top domains, and show the number of users in each. If there are many domains, you can assign sequence numbers in the plot instead of using their full names. See Figure 1(a) for an example.

  1. Map all users to their countries (e.g., domain blah.fr belongs to France) and build a dis- tribution of how many users from each country are represented in your crawl. See http://www.iana.org/cctld/cctld-whois.htm for a list of all valid country codes. Sort the countries by the number of found peers. See Figure 1(b) for an example. Note: you can treat 3-letter and 4-letter domains (e.g., .com, .net, .edu, .biz, .info) as country codes.
  2. Plot the distribution of user-agents used by the contacted ultrapeers using a pie-chart. You should treat different versions of the same agent (e.g., LimeWire 5.22 and 5.21) as different agents. Figure 1(c) serves as a general guideline.

0

20000

40000

60000

80000

100000

120000

140000

fr de be us pl ru country code

users found

BearShare LimeWire Morph Gnucleous 0

10000

20000 30000

40000

50000

60000 70000

80000

90000

100000

1 2 3 4 5 domain sequence number

users found

coxcable.com 90000 verizon.net 87000 sbc.com 46700 att.com 22000 tamu.edu 1800

Figure 1. Sample homework graphs.

Delivery schedule:

  1. The first part is due on 1/29/09 (25 pts): your code must be able to connect to Gnutella cache webservers specified in the command line, download the initial seed, parse it, and print the entire seed on the screen.
  2. The second part is due on 2/10/09 (25 pts): your code must be able to crawl the network using one thread. You must have all proper data structures in place (i.e., the BFS queue, the set of all found peers, and the set of visited ultrapeers) and be able to write the results into a file. Print all debugging information on screen: where you are sending connection requests, entire response headers, what you have parsed from the response and action taken for each returned peer.
  3. The third part is due on 2/24/09 (50 pts): a complete multithreaded implementation ac- companied by a report.

3. Details

The current Gnutella network has a two-tier structure briefly explained in class. Ultrapeers are long-lived (i.e., reliable) nodes that form a communication network (i.e., graph) among them- selves for routing requests and aggregating information about the files shared by their children. Ultrapeers usually maintain up to 30 neighbors and up to 30 children. Leaves on the other hand only attach to 1-2 ultrapeers and do not allow search requests originating from other nodes to pass through them. For more information, see http://en.wikipedia.org/wiki/Gnutella or section 2.6 of the textbook. Gnutella peers communicate with each other using standardized handshake messages de- scribed in http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html. A summary of relevant sec- tions of the protocol are provided below.

67.180.41.229:6346\r\n

See section 2.2.3 of the textbook for a detailed description of HTTP messages.

Notes: 1) new line character \r\n sometimes appears without the carriage return \r, which depends on the server you contact; 2) some GWebCaches rate limit requests and will not respond if contacted too frequently. For more details about bootstrapping in Gnutella, see http://www.pam2004.org/papers/247.pdf.

3.2. Neighbor Lists

The latest Gnutella protocol provides a convenient interface for crawling. Each alive peer must accept crawl requests and reply with a list of its neighbors' IDs. After a connection to a Gnutella user is established, you can send a crawl request to the peer using a standard Gnutella handshake:

GNUTELLA CONNECT/0.6\r\n User-Agent: TAMU_CS_CRAWLER/1.0\r\n X-Ultrapeer: False\r\n Crawler: 0.1\r\n\r\n

Here, GNUTELLA CONNECT/0.6 is the Gnutella connection request specifying protocol version 0.6, User-Agent: TAMU_CS_CRAWLER/1.0 is the name of your client (can be an arbitrary string in place of TAMU_CS_CRAWLER/1.0), X-Ultrapeer: False indicates that your client wants to be viewed as a leaf node, and Crawler: 0.1 marks this message as a crawl request of version 0.1. Each field ends with a Windows-style line break \r\n and the whole message ends with an empty line (i.e., double break \r\n\r\n). After sending a crawl request, your code needs to wait for a reply from the crawled peer and then parse this response to find new peers.

3.3. Parsing Replies

Replies to Gnutella crawl requests are similar to requests you transmit:

GNUTELLA/0.6 200 OK\r\n User-Agent: BearShare 5.2.1.2\r\n Peers: 66.171.38.246:6346,70.61.68.76:6346\r\n Leaves: 128.108.111.162:3559,128.108.111.148:2679\r\n X-Try-Ultrapeers:72.199.169.34:6346,71.160.45.142:6346\r\n\r\n

In the first line, the peer replies with its Gnutella protocol version number and status code, where 200 means OK, 503 means busy, and 593 means OK to be crawled. Each line of the re- maining reply is in the form of field_name: field_value\r\n. Field User-Agent: specifies the name and version of the client software (which is BearShare 5.2.1.2 in this case). Field Peers: provides a list of neighboring ultrapeers, which are separated by commas (the same sepa- rator applies to fields Leaves and X-Try-Ultrapeers). Field Leaves: contains a list of children attached to this peer, which may be an empty set if the crawled peer is a leaf, and field X-Try- Ultrapeers: is an optional field that contains additional ultrapeers that may be contacted (in- clude these in your crawl).

Note that usually there is a space after the colon, but this is not always the case (e.g., in the above example, X-Try-Ultrapeers: is not followed by a space). To accommodate multiple spaces and other random formats, it is recommended that you skip all spaces and possibly tabs following the colon. Also note that field names may be capitalized or occur in a different order from the one shown above. Similarly, the reply message may contain fields unrelated to your crawl, which you should simply ignore. To illustrate these points, additional examples follow below. The first one is from a LimeWire ultrapeer:

GNUTELLA/0.6 593 Hi\r\n User-Agent: LimeWire/4.9.23 (Pro)\r\n X-Ultrapeer: true\r\n Peers: 68.52.201.178:9812,69.244.250.162:6346\r\n Leaves: 216.218.168.49:6346,128.108.151.7:6346\r\n\r\n

Here is a response from a LimeWire leaf:

GNUTELLA/0.6 593 Hi\r\n User-Agent: LimeWire/4.9.7\r\n X-Ultrapeer: false\r\n Peers: 128.108.151.7:6346\r\n Leaves:\r\n\r\n

It is also possible to receive the following reply with timestamps appended to each X-Try- Hub ultrapeer:

GNUTELLA/0.6 503 Crawling a Leaf\r\n User-Agent: morph500 5.2.2.1015 (GnucDNA 1.1.1.4)\r\n Remote-IP: 128.194.135.83\r\n X-Try-Ultrapeers: 24.124.51.119:3047\r\n X-Try-Hubs: 81.248.122.225:20805 2006-08-24T13:33Z,81.242.228.37: 2006-08-24T13:33Z\r\n\r\n

Note that you should parse all responses (including the busy code) and use the returned list of Peers and X-Try-Ultrapeers. If you intend to parse X-Try-Hubs: or other user-agent specific fields, document the details of this step in your report.

3.4. Performance Issues

You may soon find that some peers are dead or behind firewalls, in which case your connec- tion will fail after a lengthy timeout. To speed up the crawl, this homework requires that you use multiple threads to work on the BFS queue. Each thread will pull one IP:port pair from the queue and attempt to crawl the corresponding peer. After obtaining its neighbors and children, the thread will have to check uniqueness of the found ultrapeers and then insert those that are unique (i.e., not seen before) back in the BFS queue. You will thus need to use mutexes to synchronize on updating shared data structures. The course website contains a sample file that uses two threads and inter-thread synchronization during access to a shared array. Here is an example of creating and using mutexes/semaphores:

#include <windows.h>

3.6. Socket Issues

By default, Windows XP SP2 does not allow more than 5000 open sockets. If you experience a problem with opening new sockets (i.e., calls to socket() fail), you can increase this limit using the registry. There is a link on the course webpage to the documentation from Microsoft. Note: this issue does not arise in Windows 2003/2008 Server. Another feature of Windows XP SP2 is that it limits the number of outbound connections to 10/second. There is a fix that modifies tcpip.sys to allow more connections per second (the link is also provided on the course website). To achieve good performance, you will need to raise this limit to about 500/second or more depending on how fast your code runs. Note: this issue does not arise in Windows 2003/2008 Server. Reading from sockets is accomplished using this general algorithm:

char buf [512]; string s = ""; // empty initial string do { // wait to see if the socket has any data (see MSDN for description) // set the timeout to 30 seconds if (select (sock, ..., timeout) > 0) { // leave one byte for NULL termination and get the data bytes = recv(s, buf, 511, 0);

if (bytes == 0) // connection closed, all data received break;

if (bytes == SOCKET_ERROR) { closesocket(sock); continue to the next peer; }

buf[bytes] = 0; // NULL terminate buffer s += buf; // append to the string } } while (true);

Finally, you may experience deadlocks inside recv() when the peer would not provide any data and would not close the connection in response to your request. To avoid this possibility, always use select() before attempting to read from a socket. See MSDN or a winsock tutorial on the usage of this function.

3.7. Thread Issues

Starting too many threads may be difficult in 32-bit operating systems due to the large space needed in the kernel to handle thread control data. It is recommended that you use a system with at least 2 GB of RAM and a 64-bit operating system. Here are several suggestions that will overcome problems with running out of thread memory in the kernel (which usually manifests itself in calls to bad_alloc() with out-of-memory errors in Debug mode). First, reduce the reserve stack size in the project using Visual Studio .NET 2008:

Project Properties->Linker->System->Stack Reserve Size = 65536

Second, use Windows Task Manager to see the number of threads actually running and make your code report any errors returned from CreateThread to the user. To see the thread count per process, use View->Select Columns in Task Manager.

3.8. Helpful Commands

For debugging purposes, use telnet to establish a TCP connection and test HTTP or Gnutella commands manually. For example:

telnet www.google.com 80

opens an HTTP connection to www.google.com on port 80. You can use a similar technique to connect to Gnutella peers if their IP address and port number are known. To see what you are typing on the screen, set the local echo to true by pressing Ctrl-] and typing set localecho fol- lowed by two ENTER keys.

For information about your network configuration, use ipconfig at the command prompt (to see the DNS servers, use ipconfig /all). To manually perform DNS lookups, use nslookup host or nslookup IP at the command prompt.

3.9. Visual Studio STL

Learn how to use sets and queues in STL using MSDN. For example, you can use sets as an easy way of storing unique elements and verifying whether additional elements belong to a set (e.g., based on the observation that new elements increase set size, while duplicates do not):

#include using namespace std;

// Create an empty set s0 of key type integer set s0;

s0.insert(10); // set size 1 s0.insert(20); // set size 2 s0.insert(30); // set size 3 s0.insert(10); // set size 3, contains (10, 20, 30)

STL queues are similar:

#include using namespace std;

queue q; q.push(10); q.push(20); q.push(30);

while (!q.empty()) {

3.10. DNS Lookup Issues

During the final phase of converting IPs into hostnames, you may generate a huge amount of traffic to your local DNS server and potentially crash it. This may lead to suspicion that you are performing malicious activity and purposely trying to compromise network security. To avoid complications, you should either run the lookups slowly (space them out over a 2-3 day period), or better yet install a local DNS server on your computer. The course website has a link explain- ing how to install an open-source DNS server called BIND and a free trial version of Simple DNS Plus. If this is not convenient, you can use a server in the IRL lab called s6.irl.cs.tamu.edu (128.194.135.85) for DNS lookups. Note: do not use CS or TAMU DNS servers as they have been crashed by students in this class during previous years. To manually set the DNS server of your computer to a given IP address (i.e., 127.0.0.1 if your own computer runs DNS or 128.194.135.85 if you are using the IRL server), go into Net- work Connections->Local Area Connection (Properties)->Internet Protocol (TCP/IP) (Properties) and modify the field called “Preferred DNS server.” If you are using DHCP, move the radio button from “Obtain DNS server address automatically” to “Use the fol- lowing DNS server addresses.” To verify that DNS is working as intended, run nslookup at the command prompt without any arguments. The program will tell you the DNS server that is currently being used.

3.11. Wireshark

Wireshark (http://www.wireshark.org/) is a software package that allows you to intercept all packets sent and received by your computer. Wireshark allows you to diagnose implementation problems encountered in this and other homework.

3.12. Debug vs. Release Mode

When using a large number of threads, always run your code in Release mode as it runs 50 times faster in STL functions and occupies 50% less memory. For scalar classes inserted into STL sets, you can roughly estimate 60 bytes per set-element in Debug mode and 30 bytes in Re- lease mode. If you insert other STL objects (such as strings) into a set, then count a minimum of 90 bytes per entry in Debug mode and 55 bytes in Release mode (plus the length of the string). Given roughly 6 million Gnutella peers concurrently online, Debug mode requires 360 MB of RAM just for the set of visited peers; however, if written sloppily this can easily exceed 1 GB. Furthermore, to avoid swapping to disk and showing unacceptably low performance in your report, check that the total memory usage in Task Manager is well below your physical RAM size. You can notice that something is wrong when increasing the number of threads beyond some threshold (such as 2500) leads to significantly lower performance.

CPSC 463-500: Networks and Distributed Processing

Homework #1 Notes on Grading

If your code does not compile, you will receive 0 points.

Part-

The following problems will result in deduction of points from a total of 25:

  1. (up to 5 pts) Program does not correctly connect or form a connection request to the server (e.g., port number not passed through htons(), sin_addr incorrectly obtained, no DNS lookup, socket is not correctly opened).
  2. (up to 5 pts) Program does not form a correct send request (e.g., the syntax of the GET request is not correct, host is missing from the request, incorrect appending of "?client=&hostfile=1").
  3. (up to 5 pts) Incorrect recv() loop (e.g., only one call to recv(), buffer is overridden between calls, incorrect advancement of pointer, incorrect sizing of the remaining buffer, incorrect copying of buffer into a string without first NULL-terminating the buffer).
  4. (up to 5 pts) Incorrect printout of seed ultrapeer IPs and port numbers (e.g., the code prints the HTTP header in addition to other things, prints nothing at all, exceeds buffer limits while printing).
  5. (up to 5 pts) The program assumes a hard-coded IP address or DNS name of the cache. The code must accept user-specified arguments “host:port/directory” in the command line, where both port and directory are optional. The default port number is 80 and the default directory is “/”.

Part-

Additional deductions (point distribution will be decided later):

  1. Incorrect socket operations (e.g., problems with sin_addr, port number, recv() loop, requests to Gnutella, NULL-termination of buffer – whatever has not been fixed from part 1).
  2. Incorrect identification of ultrapeers to be inserted into the BFS queue (e.g., incorrect parsing of response lists, no check for uniqueness), attempts to connect to leaves, in- correct operations on queues (front, push, pop).
  3. Your code does not check for status code 200 OK in the receive buffer when connect- ing to cache nodes or does not check for the correct format of Gnutella responses (i.e., the status line of the response must be compliant with Gnutella). The parser must also accept Gnutella fields in any order, skip over extra spaces between words, and ignore unknown/irrelevant fields.
  4. Missing printout of debugging information, discovered peers are not saved into a file, saved peers do not include leaves, or there is repetition in the list of saved users.

Part 3

Final deductions (point distribution will be decided later):