Survey of Research Towards Robust Peer-to-Peer Networks:
Search Methods
1. Introcution:
P2P research directions: Search, Storage, security, application.
1.1 Related Disciplines:
Database research
1.2 Structured and unstructured Routing
+: guarantee location of a target within a bounded number of hops.
-: (1) highly transient peers are not well supported by DHTS.
(2): don’t support keyword searches and complex query
(“churn”: peers are contnualsly arriving and departing.)
*Unstructured and structured proposals are complementary, not competing. Structella, a hybrid of Gnutella built on top of Pastry.
ref1. Should we build Gnutella on a structured overlay? SIGCOMM
ref2. The case for a Hybrid P2P search Infrastructure.
*Structure vs. unstructured is less useful, because:
1. Most unstructured have evolved and incorporated structures.
Eg. Gunutella: use hierarchy: peers are either ultrapeers or leaf nodes, ultrapeers receive a hashed summary of the resource names available at leaf-nodes. Between ultrapeers, simple query broadcast is still used, though methods to reduce the query load here have been considered. ref3 Ultrapeers: Another step towards Gnutella Scalability.
2. There are emerging schema-based P2P designs, with super-node hierarchies and structure within documents ref4: Design Issues and Challenges for RDF and schema-based peer-to-peer systems.
*The structure is determined by the type of index: semantic-free index (DHT), Semantic index (Gnutella like schema-based).
1.3 Indexing and Searching
ref.Ad-hoc, self-supervising Peer-to-Peer networks. ref Studying search networks with SIL.
2. Index Types:
Indexes can be classified as non-forwarding and forwarding. Non-forwarding, queries jump to target data in a single hop. Semantic and semantic free one-hop schemes. These schemes have been extended to two-hops.
ref. The design of a robust peer-to-peer system. ref. One hop lookups for peer-to-peer overlays. ref. Evaluating GUESS and non-forwarding peer-to-peer search. ref. Structured superpeers: Leveraging heterogeneity to provide constant-time lookup. ref. Efficient Routing for peer-to-peer overlays. Gupta,A.
2.1 Local index
Peers only index their own content and queries are flooded. Enable rich queries, but generate a large volume of query traffic. Eg Gnutella. To improve scalability, use TTL. Expanding ring/iterative deepening. Successively larger TTL counters. Random walk, reduce network load but massively increases the search latency.
Other approaches: Forwards queries within a subset of peers selected according to heuristics on previous performance. ref. Efficient search in peer-to-peer networks. Yang, B. Probabilistic flooding, using percolation theory. ref Criticality based analysis and design of unstructured peer-to-peer networks as complex systems.
2.2 Central index
Napater
2.3 Distributed index
Freenet. Use hash map a file to a key, query is forwarded to node closet to the key. Query response traverse the successful query path in reverse, positing a new routing table entry at each peer. The insert message similarly steps towards the target node, updating routing table entries as it goes, and finally stores the file there. +: avoid flooding, -: exact file name needed, exactly one match is returned. New Freenet, each node gathers response times, connection times and proportion of successful requests for each entry in the query routing table. When searching for a key not in its routing table, it estimate response times from the routing metrics for the nearest known keys and consequently choose the node that can retrieve the data fastest.
DHTs.
3. Semantic-Free Index:
DHTs provide semantic-free, data centric references. They are designed for: low degree, low diameter, greedy routing, robustness.
3.1 Plaxton Trees
It can locate objects using fixed-length routing tables, Object and nodes are assigned a semantic-free address, e.g. A 160 bit key. Every node is effectively the root of a spanning tree. A message routines toward an object by matching longer address suffixes, until it encounters either the object’s root node or anther node with a nearby copy.
*Tapestry ref.
*Pastry: ref. Maintain O(log N) neighbors, rout in O(log N) hops. Maintain a leaf set and a routing table.
3.2 Rings
Chord ref. peer track its predecessor, a list of successors and a finger table ( each hop is at least half the remaining distance around the ring to the target)
3.3 Troi
CAN ref. virtual d-dimensional coordinate space on a d-torus. Each node is responsible for a zone in this coordinate space. O(d) neighbors per node and O(dn1/d) hop count.
3.4 Butterflies
Viceroy ref. Viceroy: a scalable and dynamic emulation of the butterfly. Constant degree, logarithmic diameter. Maintains general ring pointers (predecessor and successor), level ring pointers (nextonlevel and prevonlevel) and butterfly pointers (left, right, and up)
3.5 de Bruijin Graphs
de Bruijn graph of degree k achieve an asymptotically optimal diameter logkn. D2B ref., Koorde, ref Distance Halving, ref Optimal Diameter Routing Infrastructure (ODRI) ref
3.6 Skip Graphs
Skip list is probabilistic, insert and delete operations do not require tree rearrangements and so are faster. The skip list consists of layers of ordered linked lists. All nodes participate in the bottom layer 0 list. Some of these nodes participate in the layer 1 list with some fixed probability. A subset of layer 1 nodes participate in the layer 2 list, and so on. A lookup can proceed quickly through the list by traversing the sparse upper layers until it is close to the target. Unfortunately, nodes in the upper layers of a skip list are potential hot spots and single points of failure. Unlike skip lists, skip graphs provide multiple lists at each level for redundancy and every node participates in one of the lists at each level. Each node in a skip graph has θ(logn) neighbors on average, support prefix and proximity search. Facilitate range searches. SkipNet is a scalable overlay network that provides controlled data placement and guaranteed routing locality by organizing data primarily by string names. ref. SkipNet: A scalable overlay network with practical locality properties.
4. Semantic Index:
Design is often driven by heuristics, maybe not guarantee scarce items will be found.
4.1 Keyword Lookup
*Gnutella based. improvement: index of filename keywords can be forwarded from leaf nodes to ultrapeers. Ultrapeers can ensure leaves only receive queries for which they have a match. ref. Ultrapeers: Another step towards Gnutella scalability. Recently, there has been a proposal to distributed aggregated QRTs amongst ultrapeers. QRTs are compressed by hashing according to the Query Routing Protocol (QRP). ref. Gnutella Ultrapeer Query Routing. A known shortcoming of QRP is the extent of query propagation was independent of the popularity of the search terms. The Dynamic Query Routing Protocol addressed this. It require leaf nodes to send single queries to high-degree ultrapeers which adjust the queries TTL bounds according toe the number of received query results. ref. Gnutella Dynamic Query Routing. An earlier proposal called GUESS, similarly aimed to reduce the number of queries for wildely distributed files. GUESS reuse the non-forwarding idea. A GUESS peer repeatedly queries single ultrapeers with a TTL of q. It chooses the number of iterations and selects ultrapeers so as to satisfy its search needs. ref. Gnutella UDP Extension for Scalable searches GUESS.
*Gnutella-related. Characterized by a bias for high degree peers and very short directed query paths, a disdain for flooding and concern about excessive load on the better peers.
ref. (Can heterogeneity make Gnutella scalable?) use powerful peers, and random walks.
ref. (Making Gnutella-like P2P systems scalable.) devise a topology adaptation algorithm, most peers are attached to high-degree peers. Use random walk search algorithm, one-hop replication, nodes keep pointers to content on adjacent peers. A receiver-controlled token-based flow control.
ref. (should we build Gnutella on a structured overlay?) build Gnutella on top of structured overlay.
ref. (Evaluation GUESS and non-forwarding peer-to-peer search) various peer selection policies. Most recently used, most files shared: good peers could be overloaded victims of their own success.
ref. (Efficient search in unstructured peer-to-peer networks. Cholvi ) describe how similar least recently used and most often used heuristics can be sued by a peer to select peer acquaintances. Limit the number of acquaintances to 25. Decrement a query’s TTL multiple times when it traverses interested peers.
*Index be partitioned by document or by keyword? By document, e.g. Gnutella. Peer may responsible for a set of keywords. Uses an inverted list to find a matching document. ref. (Making peer-to-peer keyword searching feasible using multi-level partitioning.) Multi-level partition (MLP) scheme. Arrange nodes into a group hierarchy, with the same nodes sub-divided into k logical subgroups on level 1, the subgroups are again divided, level by level, until level l.
How can exhaustive searches be achieved without flooding queries to every peer. ref. (Bit Zipper rendezvous, optimal data placement for general p2p queries.) Partitioned by document. ref. (A keyword set search system for peer-to-peer networks) a keyword-set search system, the index is partitioned by sets of keywords.
4.2 Peer Information Retrieval
Information Retrieval (IR) consists 4 elements: a representation of documents in a collection; a representation of user queries; a framework describing relationships between document representation and queries; a ranking function that quantifies an ordering amongst documents for a particular query.
Recall: the fraction of the relevant documents which has been retrieved. Precision: The fraction of the retrieved documents which is relevant.
Ref (Decentralized Meta-data strategies: effective peer-to-peer search)
4.2.1 Vector Model
Vector model represents both documents and queries as term vectors. The similarity of the document and query vectors gives an indication of how well a document matches a particular query. The weighting calculation is critical. Weighting factors: term frequency, inverse document frequency. TFIDF.
Ref (PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities.). Peers are ranked for the probability that they have matching documents. Higher priority peers are contacted and the matching documents are ranked. Each PlanetP peer has a global index with a list of all other peers, their IP addresses and their Bloom filters.
Ref (FASD: a fault-tolerant, adaptive scalable distributed search engine”) extend the Freenet design for richer queries. Eg (apples And oranges Not Bananas), it use TFIDF weighting scheme to build a document’s term vector.
Ref (Hybrid global-local indexing for efficient peer-to-peer information retrieval) . Ref (Peer-to-peer information retrieval using self-organizing semantic overly networks).
4.2.2 Latent Semantic Indexing
Latent Semantic Indexing (LSI). Key idea is to map both the document and query vectors to a concept space with lower dimensions. pSearch incorporated LSI, which ensure that similar documents are clustered near each other, thereby optimizing the network search costs. Ref (On scaling latent semantic indexing fro large peer-to-peer systems)
4.3 Peer Data Management
Motivated by a rich database heritage, Peer data management systems. Peer is associated with a schema that represents the peer’s domain of interest, and semantic relationships between peers are provided locally between pairs of peers. By traversing semantic paths of mappings, a query over one peer can obtain relevant data from any reachable peer in the network.
*Edutella Ref (Design issues and challenges for RDF and schema based peer-to-peer systems). Two facets to information integrating: 1Distributed query processing, where the type of query and the extent of clustering can determine the optimum location to cmbine query results. Mapping from Edutella’s RDF based Query Exchange Language (GEL) to numerous other query languages. 2. Mediation. Use of explicit mediation peers and rule-based mediation in super-peers. Two-pronged approach to mediation. When queries can be answered by one peer, use simple didiatoin model. When results are to be retrieved from several peers, there are more complicated integration mediators or bubs.
*Local Relational model Ref (Data management for peer-to-peer computing: a vision.) Peers, acquaintances, coordination formulas (semantic dependencies btw a peer and its acquaintances), and domain relations (data translation rules btw a peer and its acquaintances). Objective is to establish and evolve acquaintances btw dynamic peers without a global registry or schema and without significant effort by a database administrator.
*Piazza Ref (Piazza: data management infrastructure for semantic web applications) Ref (Schema mediation in peer data management systems) Ref (The Piazza peer data management project) Ref (Efficient Query Reformulation in Peer Data Management Systems) Two kinds of schema mappings: peer descriptions map between each peer’s view of the world and are used to route queries; storage descriptions map a peer’s view of the world to the specific data at the peer.
*Chatty Web (The Chatty Web approach for global semantic agreements) Syntactic similarity, semantic similarity.
* Hyperion Ref ( The Hyperion project: from data integration to data collection)
*PeerDB Ref (PeerDB: a pp passed system for distributed data sharing) Use of mobile agents: relation matching agent and data retrieval agents.
5. Search
Ref (Open problems in data-sharing peer-to-peer systems)
Ref (Towards a unifying framework for complex query processing over structured peer-to-peer data networks)
5.1 Range Queries
*Rely on underlying DHT:
Ref (Approximate range selection queries in peer-to-peer systems). Rely on locality sensitive hashing to ensure that with high probability similar ranges are mapped to the same node. Chord.
Ref (Range queries in DHTs. Ratnasamy) Ref (Brief announcement: prefix hash tree) The prefix Hash Tree is a trie in which prefixes are hashed onto any DHT.
Ref (Scalable efficient range queries for grid information services.) Use one particular space filling curve the Hilbert curve, over a CAN construction. Nearby ranges map to nearby CAN zones.
*Rely on Skip Graphs. Do not necessitate randomizing hash functions and are therefore capable of range searches, but not load balanced.
Ref (Skip graphs/ Aspnes) Ref (Efficient recovery from organization disconnects in skipnet)
*Avoid both DHT and Skip Graph.
Ref (Mercury: supporting scalable multi-attribute range queries). Support for multi-attributerange queries ant load balanceing.
Ref (P-grid: a self-organizing structured P2P system)
*Scalable Distributed Data Structures (SDDSs)
5.2 Multi-attribute Queries
*Scalable Distributed Data Structures (SDDSs)
Ref (Towards a unifying framework for complex query processing over structure peer-to-peer data networks.)
Ref (An architecture for secure wide area service discovery) Nodes push proactive summaries of their data rather than waiting for a query. Summaries are aggregated and stored throughout a server hierarchy to guide subsequent queries.
Ref (MAAN: a multi-attribute addressable network for grid information services) Ref ( RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network) built on Chord to provide both multi-attribute and range queries.
5.3 Join Queries
5.2 Aggregation Queries