Simulating a File-Sharing P2P Network, Exams of Network Design

Stanford University, Stanford, CA 94305, USA. Abstract. Assessing the performance of peer-to-peer algorithms is impossible without simulations since testing ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

alexey
alexey 🇺🇸

4.7

(20)

325 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Simulating a File-Sharing P2P Network
Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar
Department of Computer Science
Stanford University, Stanford, CA 94305, USA
Abstract. Assessing the performance of peer-to-peer algorithms is impossible
without simulations since testing new algorithms by deploying them in an ex-
isting P2P network is prohibitively expensive. However, some P2P algorithms
are sensitive to the network and traffic models that are used in the simulations.
In order to produce realistic results, we therefore require simulations that re-
semble real-world P2P networks as closely as possible. We describe the Query-
Cycle Simulator, a simulator for file-sharing P2P networks networks. We link the
Query-Cycle Simulator to measurements on existing P2P networks and discuss
some open issues in simulating these networks.
1 Introduction
Peer-to-peer research has encompassed promising work on algorithms in a vari-
ety of directions, including distributed protocols to construct efficient P2P net-
work topologies, search algorithms for unstructured P2P networks, incentives
to combat freeriding on P2P networks, and algorithms to determine reputation
of peers in a network, among others. Due to the decentralized nature and fast
growth of today’s P2P networks, testing such algorithms in a real-world envi-
ronment by simply deploying them on an existing P2P network and collecting
data on their performance is a daunting task. In some cases, measurements are
easier to carry out due to some easily accessible central control entity in the net-
work that manages node joins and departures [14]. Also, some algorithms may
be tested by deploying them on one or a few controlled nodes in the network
(as in [15]). However, for a wide range of P2P-related algorithms and protocols,
simply deploying and testing them on existing P2P networks is not possible. For
example, most algorithms require each peer in the network to implement the al-
gorithm. Today’s popular peer-to-peer networks [5] have over 20,000 nodes.
Performing a software update for each of these nodes in order to test each novel
P2P algorithm is impractical. As another example, security protocols require
testing under different threat scenarios such as an attack on the network by a
coordinated group of malicious peers. Testing such protocols would require in-
troducing malicious peers into the network, which is also not practical. Thus,
P2P algorithms and protocols are tested by simulation, under network models
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Simulating a File-Sharing P2P Network and more Exams Network Design in PDF only on Docsity!

Simulating a File-Sharing P2P Network

Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA

Abstract. Assessing the performance of peer-to-peer algorithms is impossible without simulations since testing new algorithms by deploying them in an ex- isting P2P network is prohibitively expensive. However, some P2P algorithms are sensitive to the network and traffic models that are used in the simulations. In order to produce realistic results, we therefore require simulations that re- semble real-world P2P networks as closely as possible. We describe the Query- Cycle Simulator , a simulator for file-sharing P2P networks networks. We link the Query-Cycle Simulator to measurements on existing P2P networks and discuss some open issues in simulating these networks.

1 Introduction

Peer-to-peer research has encompassed promising work on algorithms in a vari- ety of directions, including distributed protocols to construct efficient P2P net- work topologies, search algorithms for unstructured P2P networks, incentives to combat freeriding on P2P networks, and algorithms to determine reputation of peers in a network, among others. Due to the decentralized nature and fast growth of today’s P2P networks, testing such algorithms in a real-world envi- ronment by simply deploying them on an existing P2P network and collecting data on their performance is a daunting task. In some cases, measurements are easier to carry out due to some easily accessible central control entity in the net- work that manages node joins and departures [14]. Also, some algorithms may be tested by deploying them on one or a few controlled nodes in the network (as in [15]). However, for a wide range of P2P-related algorithms and protocols, simply deploying and testing them on existing P2P networks is not possible. For example, most algorithms require each peer in the network to implement the al- gorithm. Today’s popular peer-to-peer networks [5] have over 20,000 nodes. Performing a software update for each of these nodes in order to test each novel P2P algorithm is impractical. As another example, security protocols require testing under different threat scenarios such as an attack on the network by a coordinated group of malicious peers. Testing such protocols would require in- troducing malicious peers into the network, which is also not practical. Thus, P2P algorithms and protocols are tested by simulation, under network models

that attempt to mimic typical node interconnections, traffic patterns etc. Since algorithms and protocols are often sensitive to the traffic and network behavior, there is a clear need for accurate P2P network models. Work in this area has been mainly done on the fly to test novel algorithms. Because of this, most P2P simulators have used simple models. For example, [1] assumes entirely random interactions among peers in a P2P network to test a P2P reputation management protocol. In simulating a distributed search algo- rithm, [4] simply use a uniformly random location of files at and generation of queries by peers. In this paper, we present the Query-Cycle Simulator, a P2P file-sharing net- work simulator based on the query-cycle model described in Section 2, and dis- cuss the issues that arise in the accurate modeling of a P2P network. We focus on modeling a file-sharing network such as Gnutella [5].

2 The Query-Cycle Model

We consider a typical P2P network: Interconnected, file-sharing peers are able to issue queries for files, peers can respond to queries, and files can be transferred between two peers to conclude a search process. When a query is issued by a peer, it is propagated by broadcast with hop-count horizon throughout the network (in the usual Gnutella way), peers which receive the query forward it and check if they are able to respond to it. We suggest a simulation process that proceeds in query cycles. In each query cycle, a peer i in the network may be actively issuing a query, inactive, or even down and not responding to queries passing by. Upon issuing a query, a peer waits for incoming responses, selects a download source among those nodes that responded and starts downloading the file. The query cycle finishes when all peers who have issued queries download a satisfactory response. Statistics may be collected at each peer, such as the number of downloads and uploads of the peer.

3 Peer-Level Properties

The system-level dynamics of a P2P network are highly dependent on local, peer-dependent properties, such as the activity level or file-sharing behavior of each peer. In [8], different convergence behavior and different characteristic path lengths are observed in simulating a novel P2P network topology construction algorithm under two different models, one assigning bandwidth capacities to nodes based on a Zipf distribution, the other one based on a real-world distribu- tion measured in [13].

4.2 Content Type

In this section, we describe how we model the individual files each peer chooses to share. It is important to accurately model this because this will determine patterns of peers who interact with one other. A model in which the files peers share are chosen randomly is insufficient, as it will fail to produce clusters of peers that interact with on another, as has been observed [3]. Such properties affect the performance of many algorithms, including search algorithms [2] and reputation algorithms [6]. Real-world observations. In [3], it is observed that peers in a P2P network are in general interested in a subset of the total available content on the network. Furthermore, it is also observed in [3] that peers are often interested only in files from a few content categories. For example, in the domain of educational resources [10], users have a certain affinity towards learning materials related to the course of study they undertake. It also has been observed in [7] that many document storage systems, in- cluding the WWW, exhibit Zipf distributions on the popularity of documents. This reflects the fact that some popular documents are very widely copied and held, while most documents are held by far fewer peers. The same can be said of content categories: there are some content categories (such as “Top 40 Hits” in the music domain”) which are very popular and widely held, while most other categories (such as “Acid Jazz”) are less widely held. Model. We model the properties described above as follows. Briefly, peers are assumed to be interested in a subset of the total available content in the network, i.e., each peer initially picks a number of content categories and shares files only in these categories. Furthermore, we assume that files with different popularities exist within each content category, governed by a Zipf distribution. Files are assigned to peers at initialization in the following manner. According to the probabilistic model described below, each peer i is assigned some content categories Ci. Then, peer i is given an interest level for each content category c ∈ Ci. Finally, peer i is assigned files F according to its content categories and interest levels in those categories. In this model, each distinct file fc,r may be uniquely identified by the content category c to which it belongs and its popularity ranking r within that category. The probabilistic model is based on empirical observations of file distributions in [13] and [7] Assigning content categories. We assume n content categories C = {c 1 ,... , cn}. Some content categories are more popular than others. That is, the files in some content categories are more widely held than the files in other categories. We characterize a content category completely by its popularity rank. That is, c 1 = 1 , c 2 = 2,.. .. We model this popularity by a Zipf distribution: when a peer is initialized, it is set to be interested in content category c ∈ C with probability

p(c) given by p(c) =

(^1) c ∑n i= (^1) i^. We require a peer to be interested in at least^ Cmin content categories, repeating the peer’s interest test until Cmin categories have been chosen. The set Ci^ is the set of content categories that interest peer i. Modeling interest level. A peer i interested in content categories Ci^ is prob- ably not equally interested in all categories c ∈ Ci. Rather, peer i is more likely to be more interested in some categories than others. We model this by assign- ing an interest value wci to each content category c ∈ Ci of interest to peer i. This interest value is determined uniformly at random for each content category for each peer i. The fraction of files shared by peer i that are in category c is

given by p(c|i) = w ∑ ic c′∈Ci^ wci′

. The number of files shared by peer i that are in

category c is given by F (^) ci = p(c|i)F i. Note that the interest value is not correlated with the general popularity of content category c. This reflects the fact that, while a certain category may be of interest to many peers (i.e., Top 40 hits), that category is not necessarily the main interest of those peers. Also note that since we assume a steady-state network, we assume that the interests of peers do not change over time. Modeling Files We now wish to model the individual files held by each peer. Each distinct file may be uniquely identified by the tuple {c, r}, where c represents the content category to which the file belongs, and r represents its popularity rank within content category c. We denote this file fc,r. Within each content category there are some files that are very popular, and some that are held by few people. We model this by a Zipf distribution as well. The fraction of files in content category c that are copies of file fc,r is given by:

p(fc,r|c) =

1 ∑^ r Fc i= 1 i

where Fc is the number of distinct files in category c. Notice that in order to evaluate p(fc,r|c), we need to model the number of distinct files in each content category (see below). The probability that a file f shared by peer i is a copy of file fc,r is given by the level of interest p(c|i) that peer i has in category c times the popularity p(fc,r|c) of file fc,r within category c p(fc,r|i) = p(c|i)p(fc,r|c). At initialization, we assign files to each peer i based on this distribution and the number of files F (^) ci shared by peer i in each category. Each peer stores the {c, r} values for the files that it shares. Modeling the number of distinct files per category. If there is maximum replication going on in the network, there are at maximum F (^) ca files of content category c in the network, where Fc a represents the number of files in category c shared by peer a, the peer who shares the most files in category c. On the other hand, if every single file on the network is distinct, then there are p(c)F distinct

Query activity level is another peer behavior that is of particular interest to P2P research, as the query behavior of peers (in conjunction with the network’s content distribution patterns) determines which peers interact with each other. These interaction patterns are of importance to the effective design of P2P algo- rithms ranging from search algorithms to file indexing protocols.

5.1 Uptime and Session Duration

Participating nodes frequently leave and re-join a P2P network, and we define a peer’s uptime to be the fraction of an observation period that a peer is partici- pating in the P2P network, i.e., issuing, responding to and forwarding queries. Real-world observations. Uptime and session duration of peers have been set in [13]. Observations on the MojoNation P2P network [14] have revealed that up to 84% enter the network one time, and for less than one hour. At the moment, we do not consider these peers in our simulations. They probably do not contribute to the shared data much, and they probably do not issue too many queries. Model. We assume a pool of N peers, and each peer has a certain probability of being online, assigned based on the uptime distribution in [13]. At each query cycle, it is determined for each peer based on its probability of being up if it enters the network and stays there for a certain period of time which is drawn from the session duration distribution in [13].

5.2 Query Activity

Peers in a P2P network issue queries to search for downloadable files that match their interests. A peer’s query activity determines the rate at which it issues queries when it is up. Real-world observations. So far, we are not aware of measurements on query rates of peers in a P2P network. An empirical study on the distribution of query rates of real-world P2P networks would be straightforward and very useful to the accurate modeling of the network. Model. In our model, nodes generate queries based on a Poisson process. The query rate of each node is set upon initialization and is picked uniformly at random from an interval {ratemin, ratemax}. In each query cycle, equation

p(#queries == x) = exp

−λ (^) ·λx x! gives the probability that a node issues x queries, where λ is the node’s query rate.

5.3 Queries

In the query cycle model, each active peer issues a query at each query cycle. The specific query that peer i issues is given by the model described below.

Real-World observations. Peers in general query for files that exist on the network and are in the content category of their interest. The first is true in large and diverse P2P networks, the latter we claim to be true for the majority of queries a peer issues, albeit it is yet to be shown by empirical studies on the query behavior in P2P networks. Model. In our model, a query qc,r represents a query for the file fc,r. We say that a peer only issues queries in in the content categories in which it is interested. The probability that a peer i generates a query qc,r is given by it’s interest level in category c times the popularity of file r in c:

p(qc,r|i) = p(c|i)p(fc,r |c) (2)

(We say here that the popularity p(qc,r|c) of a query qc,r is equal to the popularity p(fc,r|c) of its corresponding file fc,r.) We also suggest that a peer will not issue a query for a file that it already owns.

5.4 Query Responses

In this framework, modeling query responses is straightforward: If peer i re- ceives a query qc,r, and it owns a copy of the corresponding file fc,r, it responds to the peer that has issued the query and offers to upload the file.

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 peer

share

uploads fraction shared files fraction

Fig. 1. Share of uploads and files in a P2P network.

try to locate and to connect to highly connected peers in the network to increase their chances of responding to many queries, a threat scenario to be considered for reputation algorithms.

6.2 Bandwidth

We currently have a simple understanding of a peer’s bandwidth in our simula- tions: Bandwidth at a peer is consumed only while uploading or downloading files. Bandwidth is assigned to a peer upon the creation of the peer based on measurements in [13]. Upon up- or downloading a file, peers always try to use their full bandwidth (the actual transfer rate is limited by the peer with less bandwidth). If a peer has several up- and downloads going on, the available bandwidth is split up equally.

7 Discussion and Conclusion

Figure 1 depicts the load share in a sample network of 20 peers that was sim- ulated based on the considerations above. One graph shows the number of up- loads at a particular peer versus the total number of uploads in the system after 300 query cycles, the other graph shows the number of files shared by a peer versus the total number of files shared by all peers. Although the distribution of files is highly imbalanced – a property observed real-world P2P networks [13]

  • all peers participate in responding to queries, since even peers with only a few files have a fair likelihood of responding to queries for very popular files. This is a property that can also be observed on real-world P2P networks and provides a first indication that our model is somewhat accurate. We have described first ideas and approaches for a P2P network simula- tor. The efficiency of algorithms can only be compared if they can be run on commonly accepted problem sets or simulated on widely accepted models, an insight accepted in many other research domains such as Internet research [12]. We believe the same to be true for P2P algorithms, and we believe it is impor- tant for the community to engage in a discussion of P2P modeling in order to develop some standards by which to simulate P2P networks.

References

  1. K. Aberer and Z. Despotovic. Managing trust in a peer-2-peer information system. In Pro- ceedings of the 10th International Conference on Information and Knowledge Management (ACM CIKM) , New York, USA, 2001.
  2. A. Crespo and H. Garcia-Molina. Routing Indices for P2P Systems. In Proceedings of the 28th Conference on Distributed Computing Systems , July 2002.
  1. A. Crespo and H. Garcia-Molina. Semantic Overlay Networks. In Submitted for publication , October 2002.
  2. M. Freedman and R. Vingralek. Efficient P2P Lookup for a Distributed Trie. In First Inter- national Workshop on P2P Systems , 2002.
  3. Gnutella website. www.gnutella.com. 2002.
  4. S. Kamvar, M. Schlosser, and H. Garcia-Molina. The EigenTrust Algorithm for Reputation Management in P2P Networks. In WWW 2003 , 2003.
  5. R. Korfhage. Information Storage and Retrieval. John Wiley, 1997.
  6. Q. Lv, S. Ratsnasamy, and S. Shenker. Can Heterogeneity Make Gnutella Scalable? In First International Workshop on P2P Systems , 2002.
  7. A. Medina, I. Matta, and J. Byers. On the origin of power laws in internet topologies. Technical report, Boston University Computer Science Department, April 2000.
  8. W. Nejdl and et al. EDUTELLA: A P2P Networking Infrastructure based on RDF. In Proceedings of the 11th World Wide Web Conference , May 2002.
  9. M. Ripeanu and I. Foster. Mapping the Gnutella Network - Macroscopic Properties of Large- scale P2P Networks. IEEE Internet Computing Journal , 6(1), 2002.
  10. S. Floyed and E. Kohler. Internet Research Needs Better Models. In Proceedings of HotNest- I , October 2002.
  11. S. Saroiu, P. K. Gummadi, and S. D. Gribble. A measurement study of peer-to-peer file sharing systems. In Proceedings of Multimedia Computing and Networking 2002 (MMCN ’02) , San Jose, CA, USA, January 2002.
  12. B. Wilcox-O’Hearn. Experiences deploying a large-scale emergent network. In 1st Interna- tional Workshop on P2P Systems , 2002.
  13. B. Yang and H. Garcia-Molina. Improving efficiency of p2p search. In Proceedings of the 28th Conference on Distributed Computing Systems , July 2002.