[Networkit] Matrix classes & separation of algorithm and data structure
Michael Hamann
michael.hamann at kit.edu
Wed Jan 21 18:29:27 CET 2015
Hi Kolja,
On Wednesday 21 January 2015 17:17:15, Kolja Esders wrote:
> During this effort I encountered that the regular Matrix class is not well
> suited to store the resulting scores for all algorithms as the similarity
> matrices are oftentimes very dense or need only the storage space of an
> triangular matrix due to symmetry.
> Consequently I've implemented a vector-based SymmetricSimilarityMatrix
> which stores a triangular-matrix in a vector allowing for fast access and
> less memory usage.
> This approach collides with the Matrix classes already present because the
> new class can't inherit from the current Matrix class which is based on a
> graph-implementation.
> To solve this problem I would propose a common interface "Matrix" which
> simply defines a set of pure virtual functions which could then be
> implemented from a SparseMatrix class (the current Matrix class) and the
> SymmetricSimilarityMatrix as well.
> What do you guys think about this approach? Are there better ways to solve
> this?
This sounds good in theory, I have only one concern: performance. Calling
purely virtual functions isn't particularly fast compared to e.g. directly
accessing an element of a vector using a method that can be inlined (if it is
in the header only afaik).
I don't know if this is a problem in your/our case so my suggestion would be
to try if this is a problem or not.
If inheritance cannot be used, templates are usually a good alternative in the
C++ world, i.e. whenever a class should accept a matrix that can be of
different types that share the same, non-inherited interface, just make this a
template class. This is definitely not good from a object-oriented design
perspective but it's usually better for the performance.
> Another problem is the separation of algorithm and the used data structure.
> Right now every algorithm takes a graph as an input and outputs a matrix
> containing the scores for every node-pair. The matrix is stored in the
> class itself and gets allocated during construction.
> Now in order to make use of one algorithm (e.g. common neighbors [CN]) in
> another algorithm (e.g. Jaccard's coefficent using CN) this would mean to
> allocate twice the needed memory (2 matrices, one per class).
> What would be the best approach to separate algorithm and data structure? I
> would suggest that the constructor (or maybe the run method) takes a
> reference to the output matrix to fill with scores. Thus the class wouldn't
> own the matrix.
A different solution would be to optionally return the matrix as rvalue
reference, e.g. with a separate move-getter method. Then the matrix in the
algorithm would be empty again and the algorithm could reject further getter
calls by setting its state back into the original state. I would actually
suggest this for every place in NetworKit where larger objects may be
returned.
A different idea that probably does not work in your case but might work in
other cases is to provide a getter method for individual values of the
container. For example in the connected components algorithm [0] this is done
for the components of individual nodes so you don't need to copy the whole
Partition instance if you just want a single value.
What I like about these getters is that they hide the implementation detail
which data structure is used internally. For example the already mentioned
connected components algorithm might use a union find data structure internally
and generate that Partition on the fly in the getter. However if we want this
kind of encapsulation we can't really avoid the memory duplication, even with
move-getters this would be visible (at least if we do not provide them
everywhere).
I think all in all this is always a tradeoff between performance and elegance.
Passing the matrix in the constructor might work in this case though the
Python transformation won't work if it is one of the automatically transformed
types like a vector, map, set or pair as the values are copied from Python to
C++ and changes are thus not visible in the Python layer. In general it's also
not the most elegant style so if it is not necessary I would advise against
it.
Best regards,
Michael
[0]:
https://algohub.iti.kit.edu/parco/NetworKit/NetworKit/files/1ec89de8cce2ff28885f5d27f8ca81e3974e0453/networkit/cpp/properties/ConnectedComponents.h
More information about the NetworKit
mailing list