# [Networkit] Open Problem: Distance Profile via Sampling

Matteo Riondato matteo at cs.brown.edu
Sat Aug 15 17:07:24 CEST 2015

```> On Aug 14, 2015, at 11:33 PM, Christian Staudt <christian.staudt at kit.edu> wrote:
>
> Hi all, can I get some suggestions on an open problem?
>
> Imagine the following “distance profile” of a graph: For all pairs of nodes (u,v), compute the distance d(u,v) and plot the resulting distribution of distances. This is also sometimes called a “hop plot”.
>
> Now, doing this for all pairs clearly doesn’t scale, but if we are just interested in the shape of the distribution, we do not need to know all distances. I want to sample distances d(u,v) in a complex network so that
>
> a) the result is representative (i.e. the sampled distribution of distances is qualitatively the same as the full distribution)
> b) the computation is efficient (i.e. the number of graph traversals performed is small)
>
> Algorithm idea: x times pick some node u uniformly at random, perform a single-source shortest path search, then y times pick some node v and store distance d(u,v). Then, what should x and y be?
>
> Would be thankful for any suggestions. By the way, seems like there is some connection to approximation of betweenness.

Yes, indeed it is =)

The distribution you want to approximate is also known as the neighborhood function. A similar problem (or is it actually the same? I don’t remember the details) is that of transitive closure. There’s quite a bit of work about approximating it (e.g., a work by Edith Cohen in JCSS, "Size-estimation framework with applications to transitive closure and reachability”, 1997)

Sampling can work, and I have a technical report about using sampling for this and other network measures.
This is not surprising, as we know that by sampling we can approximate a probability distribution in the L1-norm with relatively few samples (there was a paper in ICTS’15 about this and much more).

What is not clear though (and it’s the reason why I still only have a technical report) is whether sampling is the right solution. It probably isn’t “in isolation”, in the sense that for this specific problem, other approaches may be much faster. For example, the HyperANF approach by Boldi, Rosa, and Vigna (WWW’11) seems to be much faster (also because they’re damn good hackers). I actually have ideas about improving the sampling result and this can lead to comparable performances. The thing is, inside a framework, the approach by BRS’11 may still be too complex to implement (they use very elegant techniques in their papers but not so easy to implement well, imho).

I can polish the TR a bit and send it to you in private, let me know.

Best,
Matteo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 163 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.ira.uni-karlsruhe.de/mailman/private/networkit/attachments/20150816/2d72150a/attachment.sig>
```