[Networkit] Matrix classes & separation of algorithm and data structure

Michael Hamann michael.hamann at kit.edu
Wed Jan 21 18:29:27 CET 2015


Hi Kolja,

On Wednesday 21 January 2015 17:17:15, Kolja Esders wrote:
> During this effort I encountered that the regular Matrix class is not well
> suited to store the resulting scores for all algorithms as the similarity
> matrices are oftentimes very dense or need only the storage space of an
> triangular matrix due to symmetry.
> Consequently I've implemented a vector-based SymmetricSimilarityMatrix
> which stores a triangular-matrix in a vector allowing for fast access and
> less memory usage.
> This approach collides with the Matrix classes already present because the
> new class can't inherit from the current Matrix class which is based on a
> graph-implementation.
> To solve this problem I would propose a common interface "Matrix" which
> simply defines a set of pure virtual functions which could then be
> implemented from a SparseMatrix class (the current Matrix class) and the
> SymmetricSimilarityMatrix as well.
> What do you guys think about this approach? Are there better ways to solve
> this?

This sounds good in theory, I have only one concern: performance. Calling 
purely virtual functions isn't particularly fast compared to e.g. directly 
accessing an element of a vector using a method that can be inlined (if it is 
in the header only afaik).

I don't know if this is a problem in your/our case so my suggestion would be 
to try if this is a problem or not.

If inheritance cannot be used, templates are usually a good alternative in the 
C++ world, i.e. whenever a class should accept a matrix that can be of 
different types that share the same, non-inherited interface, just make this a 
template class. This is definitely not good from a object-oriented design 
perspective but it's usually better for the performance.

> Another problem is the separation of algorithm and the used data structure.
> Right now every algorithm takes a graph as an input and outputs a matrix
> containing the scores for every node-pair. The matrix is stored in the
> class itself and gets allocated during construction.
> Now in order to make use of one algorithm (e.g. common neighbors [CN]) in
> another algorithm (e.g. Jaccard's coefficent using CN) this would mean to
> allocate twice the needed memory (2 matrices, one per class).
> What would be the best approach to separate algorithm and data structure? I
> would suggest that the constructor (or maybe the run method) takes a
> reference to the output matrix to fill with scores. Thus the class wouldn't
> own the matrix.

A different solution would be to optionally return the matrix as rvalue 
reference, e.g. with a separate move-getter method. Then the matrix in the 
algorithm would be empty again and the algorithm could reject further getter 
calls by setting its state back into the original state. I would actually 
suggest this for every place in NetworKit where larger objects may be 
returned.

A different idea that probably does not work in your case but might work in 
other cases is to provide a getter method for individual values of the 
container. For example in the connected components algorithm [0] this is done 
for the components of individual nodes so you don't need to copy the whole 
Partition instance if you just want a single value.

What I like about these getters is that they hide the implementation detail 
which data structure is used internally. For example the already mentioned 
connected components algorithm might use a union find data structure internally 
and generate that Partition on the fly in the getter. However if we want this 
kind of encapsulation we can't really avoid the memory duplication, even with 
move-getters this would be visible (at least if we do not provide them 
everywhere).

I think all in all this is always a tradeoff between performance and elegance.

Passing the matrix in the constructor might work in this case though the 
Python transformation won't work if it is one of the automatically transformed 
types like a vector, map, set or pair as the values are copied from Python to 
C++ and changes are thus not visible in the Python layer. In general it's also 
not the most elegant style so if it is not necessary I would advise against 
it.

Best regards,
Michael

[0]: 
https://algohub.iti.kit.edu/parco/NetworKit/NetworKit/files/1ec89de8cce2ff28885f5d27f8ca81e3974e0453/networkit/cpp/properties/ConnectedComponents.h



More information about the NetworKit mailing list