[Networkit] Interface conventions for analytics algorithms

Michael Hamann michael.hamann at kit.edu
Thu Oct 30 12:45:15 CET 2014


Hi,

Am Donnerstag, 30. Oktober 2014, 07:28:56 schrieb Henning Meyerhenke:
> > Returning a value directly is usually the by far fastest way that also
> > doesn't have the problem of wasting space in the class, the BIG problem
> > of returning a non default-constructable type (you need some kind of
> > optional-type for that, which can certainly be done, but further
> > increases complexity), and dealing with early calls to `getResult()`
> > (before the algorithm has executed, it is not obvious how to handle that
> > case, while for returnvalues that problem just doesn't exist).
> 
> How often do we need very complex return values that could/should not be
> the return value of the run method? I'd rather avoid the getters.

Let's consider two algorithms: AlgebraicDistance [0] and EdmondsKarp [1]. Both 
follow mostly the conventions that Christian wants to introduce (or at least 
how I would interpret them) with the only difference that in AlgebraicDistance 
"run" is named "preprocess".

I couldn't imagine AlgebraicDistance with a run-method that returns a value. 
Either this would need to be the array of loads, but then you couldn't easily 
get the distances, or a pre-computed n*n matrix or we would need to restrict 
algebraic distances to connected vertices (which is again not necessary or 
useful).

Also in the case of the EdmondsKarp algorithm I wouldn't like a run method 
that returns the maximum flow value, the vector of flows and the source set. In 
many cases you are primarily interested either in the flow value or the source 
set. Then you would always need to remember in which order these values are 
returned (or always look this up in the documentation).

I can understand concerns concerning returning (copies of) big objects in 
getters. I see two solutions for this:
- fast (inline) getter methods for the values in the big objects as e.g. 
implemented in EdmondsKarp so in most cases the big object doesn't need to be 
returned at all
- moving the actual object out of the algorithm and thus resetting the state 
of the algorithm. This could be optional in certain cases with a boolean 
parameter that determines if the object shall be copied or moved.

Concerning the complexity of the Python interface: I think we can make this a 
lot easier by returning "self" in the run-method in Python. That way you could 
use

myResult = MyAlgorithm(parameters).run().getResult()

in Python if you are just interested in one result.

Another argument against returning tuples of results is that even though it 
might be relatively easy to ignore some of them in C++ it is not that 
straightforward to convert them to Python objects if some of these values need 
to be wrapped.

> > Concerning the function-arguments: You are certainly right with the
> > approach to pass several of them in the ctor, I would however recommend
> > to leave some room there, since sometimes the benefits of being able to
> > call the same functor twice with slightly different arguments might be
> > significant: Just think of Dijkstra: If we need to get the distances of
> > different places to the same starting-point on the same graph, we would
> > profit enormously from being able to pass end-note as argument and
> > reusing the structures build so far. OTOH we wouldn't profit in any way
> > from reseting the starting-node or the graph, so those should be
> > ctor-args.
> 
> This is very important. At quite a few places we have some preprocessing
> in the ctor. Then we need the flexibility in the run method to invoke it
> for different query parameters. Such a separation of preprocessing/query
> must be possible and efficient.

As shown in the case of AlgebraicDistance I think actually the preprocessing 
should happen in the run()-method and the query in the getter if it is 
sufficiently simple.

However I think all of this shouldn't be a strict rule that is enforced via a 
strict interface. I agree that there are cases where you want to pass a 
stopping criteria (Dijkstra) or a number of iterations to perform 
(DynamicForestFireGenerator for example) as argument to the run method.

However for example taking the example of DynamicForestFireGenerator class: 
Actually it generates two things: a static graph and a stream of graph events. 
Currently you can only get the stream of events which again needs to be 
converted into a graph with a probably relatively expensive Python roundtrip 
(each event is wrapped into a Python object in a Python list). An additional 
getter could be used in order to directly get the generated graph and if the 
return value from the run method was replaced by a getter, too, you could also 
avoid the Python conversion of the list of events if you only need the static 
graph.

Concerning the question of run-method vs. operator(): I favor the run method, 
in my opinion MyAlgorithm(parameters)() doesn't look right. Otherwise I think 
we could directly use global functions which is probably not what we want.

So all in all I'm in favor of what Christian proposes with some additions for 
returning big objects and some flexibility for the run-method to accept some 
iteration count or stopping criteria as parameter.

Michael

[0]: 
https://algohub.iti.kit.edu/parco/NetworKit/NetworKit/files/35b2350f5335a7651694ee1f4efddf3c4ca71b41/networkit/cpp/distmeasures/AlgebraicDistance.h
[1]: 
https://algohub.iti.kit.edu/parco/NetworKit/NetworKit/files/35b2350f5335a7651694ee1f4efddf3c4ca71b41/networkit/cpp/flow/EdmondsKarp.cpp



More information about the NetworKit mailing list