






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The implementation of a gnutella peer crawler using bfs algorithm to traverse the network and parse responses to extract ultrapeer and leaf information. The crawler supports user-specified threads and contacts, records statistics, and experiments with thread performance. Ultrapeers and leaves are distinguished based on their roles in the network, and communication is facilitated using standardized handshake messages.
Typology: Assignments
1 / 12
This page cannot be seen from the preview
Don't miss anything!







This homework builds an understanding of application-layer protocols and Windows sockets.
Using Winsock and Visual Studio .NET 2005, your goal is to create a Gnutella crawler that discovers all currently present peers in the system. Your program will first contact a seed web- server to acquire a set of initial ultrapeers, traverse the entire Gnutella network in the BFS order, and then record the identities of found ultrapeers and their children (i.e., leaf nodes) in some text file. Using this information, you will then analyze the collected data to answer several questions about geographic and domain diversity of peers as well as popularity of individual user agents (see below).
Requirements for the implementation:
Requirements for the report:
served, list several top domains, and show the number of users in each. If there are many domains, you can assign sequence numbers in the plot instead of using their full names. See Figure 1(a) for an example.
0
20000
40000
60000
80000
100000
120000
140000
fr de be us pl ru country code
users found
BearShare LimeWire Morph Gnucleous 0
10000
20000 30000
40000
50000
60000 70000
80000
90000
100000
1 2 3 4 5 domain sequence number
users found
coxcable.com 90000 verizon.net 87000 sbc.com 46700 att.com 22000 tamu.edu 1800
Figure 1. Sample homework graphs.
Delivery schedule:
The current Gnutella network has a two-tier structure briefly explained in class. Ultrapeers are long-lived (i.e., reliable) nodes that form a communication network (i.e., graph) among them- selves for routing requests and aggregating information about the files shared by their children. Ultrapeers usually maintain up to 30 neighbors and up to 30 children. Leaves on the other hand only attach to 1-2 ultrapeers and do not allow search requests originating from other nodes to pass through them. For more information, see http://en.wikipedia.org/wiki/Gnutella or section 2.6 of the textbook. Gnutella peers communicate with each other using standardized handshake messages de- scribed in http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html. A summary of relevant sec- tions of the protocol are provided below.
67.180.41.229:6346\r\n
See section 2.2.3 of the textbook for a detailed description of HTTP messages.
Notes: 1) new line character \r\n sometimes appears without the carriage return \r, which depends on the server you contact; 2) some GWebCaches rate limit requests and will not respond if contacted too frequently. For more details about bootstrapping in Gnutella, see http://www.pam2004.org/papers/247.pdf.
The latest Gnutella protocol provides a convenient interface for crawling. Each alive peer must accept crawl requests and reply with a list of its neighbors' IDs. After a connection to a Gnutella user is established, you can send a crawl request to the peer using a standard Gnutella handshake:
GNUTELLA CONNECT/0.6\r\n User-Agent: TAMU_CS_CRAWLER/1.0\r\n X-Ultrapeer: False\r\n Crawler: 0.1\r\n\r\n
Here, GNUTELLA CONNECT/0.6 is the Gnutella connection request specifying protocol version 0.6, User-Agent: TAMU_CS_CRAWLER/1.0 is the name of your client (can be an arbitrary string in place of TAMU_CS_CRAWLER/1.0), X-Ultrapeer: False indicates that your client wants to be viewed as a leaf node, and Crawler: 0.1 marks this message as a crawl request of version 0.1. Each field ends with a Windows-style line break \r\n and the whole message ends with an empty line (i.e., double break \r\n\r\n). After sending a crawl request, your code needs to wait for a reply from the crawled peer and then parse this response to find new peers.
Replies to Gnutella crawl requests are similar to requests you transmit:
GNUTELLA/0.6 200 OK\r\n User-Agent: BearShare 5.2.1.2\r\n Peers: 66.171.38.246:6346,70.61.68.76:6346\r\n Leaves: 128.108.111.162:3559,128.108.111.148:2679\r\n X-Try-Ultrapeers:72.199.169.34:6346,71.160.45.142:6346\r\n\r\n
In the first line, the peer replies with its Gnutella protocol version number and status code, where 200 means OK, 503 means busy, and 593 means OK to be crawled. Each line of the re- maining reply is in the form of field_name: field_value\r\n. Field User-Agent: specifies the name and version of the client software (which is BearShare 5.2.1.2 in this case). Field Peers: provides a list of neighboring ultrapeers, which are separated by commas (the same sepa- rator applies to fields Leaves and X-Try-Ultrapeers). Field Leaves: contains a list of children attached to this peer, which may be an empty set if the crawled peer is a leaf, and field X-Try- Ultrapeers: is an optional field that contains additional ultrapeers that may be contacted (in- clude these in your crawl).
Note that usually there is a space after the colon, but this is not always the case (e.g., in the above example, X-Try-Ultrapeers: is not followed by a space). To accommodate multiple spaces and other random formats, it is recommended that you skip all spaces and possibly tabs following the colon. Also note that field names may be capitalized or occur in a different order from the one shown above. Similarly, the reply message may contain fields unrelated to your crawl, which you should simply ignore. To illustrate these points, additional examples follow below. The first one is from a LimeWire ultrapeer:
GNUTELLA/0.6 593 Hi\r\n User-Agent: LimeWire/4.9.23 (Pro)\r\n X-Ultrapeer: true\r\n Peers: 68.52.201.178:9812,69.244.250.162:6346\r\n Leaves: 216.218.168.49:6346,128.108.151.7:6346\r\n\r\n
Here is a response from a LimeWire leaf:
GNUTELLA/0.6 593 Hi\r\n User-Agent: LimeWire/4.9.7\r\n X-Ultrapeer: false\r\n Peers: 128.108.151.7:6346\r\n Leaves:\r\n\r\n
It is also possible to receive the following reply with timestamps appended to each X-Try- Hub ultrapeer:
GNUTELLA/0.6 503 Crawling a Leaf\r\n User-Agent: morph500 5.2.2.1015 (GnucDNA 1.1.1.4)\r\n Remote-IP: 128.194.135.83\r\n X-Try-Ultrapeers: 24.124.51.119:3047\r\n X-Try-Hubs: 81.248.122.225:20805 2006-08-24T13:33Z,81.242.228.37: 2006-08-24T13:33Z\r\n\r\n
Note that you should parse all responses (including the busy code) and use the returned list of Peers and X-Try-Ultrapeers. If you intend to parse X-Try-Hubs: or other user-agent specific fields, document the details of this step in your report.
You may soon find that some peers are dead or behind firewalls, in which case your connec- tion will fail after a lengthy timeout. To speed up the crawl, this homework requires that you use multiple threads to work on the BFS queue. Each thread will pull one IP:port pair from the queue and attempt to crawl the corresponding peer. After obtaining its neighbors and children, the thread will have to check uniqueness of the found ultrapeers and then insert those that are unique (i.e., not seen before) back in the BFS queue. You will thus need to use mutexes to synchronize on updating shared data structures. The course website contains a sample file that uses two threads and inter-thread synchronization during access to a shared array. Here is an example of creating and using mutexes/semaphores:
#include <windows.h>
By default, Windows XP SP2 does not allow more than 5000 open sockets. If you experience a problem with opening new sockets (i.e., calls to socket() fail), you can increase this limit using the registry. There is a link on the course webpage to the documentation from Microsoft. Note: this issue does not arise in Windows 2003/2008 Server. Another feature of Windows XP SP2 is that it limits the number of outbound connections to 10/second. There is a fix that modifies tcpip.sys to allow more connections per second (the link is also provided on the course website). To achieve good performance, you will need to raise this limit to about 500/second or more depending on how fast your code runs. Note: this issue does not arise in Windows 2003/2008 Server. Reading from sockets is accomplished using this general algorithm:
char buf [512]; string s = ""; // empty initial string do { // wait to see if the socket has any data (see MSDN for description) // set the timeout to 30 seconds if (select (sock, ..., timeout) > 0) { // leave one byte for NULL termination and get the data bytes = recv(s, buf, 511, 0);
if (bytes == 0) // connection closed, all data received break;
if (bytes == SOCKET_ERROR) { closesocket(sock); continue to the next peer; }
buf[bytes] = 0; // NULL terminate buffer s += buf; // append to the string } } while (true);
Finally, you may experience deadlocks inside recv() when the peer would not provide any data and would not close the connection in response to your request. To avoid this possibility, always use select() before attempting to read from a socket. See MSDN or a winsock tutorial on the usage of this function.
Starting too many threads may be difficult in 32-bit operating systems due to the large space needed in the kernel to handle thread control data. It is recommended that you use a system with at least 2 GB of RAM and a 64-bit operating system. Here are several suggestions that will overcome problems with running out of thread memory in the kernel (which usually manifests itself in calls to bad_alloc() with out-of-memory errors in Debug mode). First, reduce the reserve stack size in the project using Visual Studio .NET 2008:
Project Properties->Linker->System->Stack Reserve Size = 65536
Second, use Windows Task Manager to see the number of threads actually running and make your code report any errors returned from CreateThread to the user. To see the thread count per process, use View->Select Columns in Task Manager.
For debugging purposes, use telnet
telnet www.google.com 80
opens an HTTP connection to www.google.com on port 80. You can use a similar technique to connect to Gnutella peers if their IP address and port number are known. To see what you are typing on the screen, set the local echo to true by pressing Ctrl-] and typing set localecho fol- lowed by two ENTER keys.
For information about your network configuration, use ipconfig at the command prompt (to see the DNS servers, use ipconfig /all). To manually perform DNS lookups, use nslookup host or nslookup IP at the command prompt.
Learn how to use sets and queues in STL using MSDN. For example, you can use sets as an easy way of storing unique elements and verifying whether additional elements belong to a set (e.g., based on the observation that new elements increase set size, while duplicates do not):
#include
// Create an empty set s0 of key type integer set
s0.insert(10); // set size 1 s0.insert(20); // set size 2 s0.insert(30); // set size 3 s0.insert(10); // set size 3, contains (10, 20, 30)
STL queues are similar:
#include
queue
while (!q.empty()) {
During the final phase of converting IPs into hostnames, you may generate a huge amount of traffic to your local DNS server and potentially crash it. This may lead to suspicion that you are performing malicious activity and purposely trying to compromise network security. To avoid complications, you should either run the lookups slowly (space them out over a 2-3 day period), or better yet install a local DNS server on your computer. The course website has a link explain- ing how to install an open-source DNS server called BIND and a free trial version of Simple DNS Plus. If this is not convenient, you can use a server in the IRL lab called s6.irl.cs.tamu.edu (128.194.135.85) for DNS lookups. Note: do not use CS or TAMU DNS servers as they have been crashed by students in this class during previous years. To manually set the DNS server of your computer to a given IP address (i.e., 127.0.0.1 if your own computer runs DNS or 128.194.135.85 if you are using the IRL server), go into Net- work Connections->Local Area Connection (Properties)->Internet Protocol (TCP/IP) (Properties) and modify the field called “Preferred DNS server.” If you are using DHCP, move the radio button from “Obtain DNS server address automatically” to “Use the fol- lowing DNS server addresses.” To verify that DNS is working as intended, run nslookup at the command prompt without any arguments. The program will tell you the DNS server that is currently being used.
Wireshark (http://www.wireshark.org/) is a software package that allows you to intercept all packets sent and received by your computer. Wireshark allows you to diagnose implementation problems encountered in this and other homework.
When using a large number of threads, always run your code in Release mode as it runs 50 times faster in STL functions and occupies 50% less memory. For scalar classes inserted into STL sets, you can roughly estimate 60 bytes per set-element in Debug mode and 30 bytes in Re- lease mode. If you insert other STL objects (such as strings) into a set, then count a minimum of 90 bytes per entry in Debug mode and 55 bytes in Release mode (plus the length of the string). Given roughly 6 million Gnutella peers concurrently online, Debug mode requires 360 MB of RAM just for the set of visited peers; however, if written sloppily this can easily exceed 1 GB. Furthermore, to avoid swapping to disk and showing unacceptably low performance in your report, check that the total memory usage in Task Manager is well below your physical RAM size. You can notice that something is wrong when increasing the number of threads beyond some threshold (such as 2500) leads to significantly lower performance.
If your code does not compile, you will receive 0 points.
The following problems will result in deduction of points from a total of 25:
Additional deductions (point distribution will be decided later):
Final deductions (point distribution will be decided later):