[Networkit] How to read in a large graph (and output a sparse matrix)

Obada Mahdi omahdi at gmail.com
Sun Aug 14 21:00:39 CEST 2016


NetworKit's internal graph representation is not aimed at being the most
compact representation for sparse graphs and trades space for
flexibility (e.g. efficient updating). The adjacency list structure used
by networkit.Graph does add considerable overhead, which may not be an
appropriate choice for a simple scenario like using it as an
intermediate representation for converting graph formats.

The easiest solution may be to just circumvent NetworKit altogether and
directly populate a scipy representation which can then be saved out. I
will include an example script below.

Looking at NetworKit's source, it seems that it does indeed use quite
some memory. While there may be some room for optimization, much of what
could be considered "overhead" for a simple conversion task is based on
design decisions of NetworKit, which opts for sort of an all-purpose
implementation that allows for efficient update operations in the
context of dynamic graphs.

On 11.08.2016 11:22, Maximilian Vogel wrote:
[reading an edge list with non-contiguous IDs; m=62500000, n~m/2]
> On 02.08.2016 17:26, Raphael C wrote:
> Well, that's actually what the reader should do when passing
> continuous=False. I'm quite surprised that 8GB are not sufficient.

Actually, it may not be such a surprise after all. The adjacency list
representation used by networkit.Graph uses a number of *separate*
arrays of arrays (represented as instances of std::vector), where each
top-level array is indexed by node ID. The memory required for
representing a weighted, undirected graph with m edges and n nodes includes:

- edge weights: (3*n + 2*m)*8 bytes

Edge weights are stored in a std::vector of std::vector [1], which
requires three pointers (3x64bit on x86_64) per node just for managing
the dynamic lists, plus 2x64bit (two "edgeweight", which is defined as
"double" [2]) per edge, effectively storing an undirected edge as two
reciprocal directed arcs.

Note: The current implementation ("Dev" branch) seems to be a bit buggy,
or at least inconsistent. In Graph::addNode() [3], the std::vector
inEdgeWeights is initialized to a copy of an empty vector for every node
even for undirected weighted graphs, which means it uses up additional
(3*n)*8 bytes that should never be accessed (they remain empty, only
outEdgeWeights seem to be used for undirected graphs).

- node IDs: (3*n + 2*m)*8 bytes

Adjacent node IDs corresponding to edge weights are stored like edge
weights, with the value indicating the ID of the neighboring node.
Again, one undirected edge is treated as two directed arcs.

- node degrees: (n)*8 bytes

The (outgoing) node degrees are stored separately. I assume that this
supports dealing with dynamic graphs without requiring to compact arrays
each time an edge is removed, while still allowing to determine node
degrees in constant time. Counts are 64 bit integers.

- boolean indicators for node IDs: (n/8)*1 bytes

This amounts to a total of (7*n + 4*m)*8 bytes + (n/8)*1 bytes plus
constant overhead. For m=62500000 and n=m/2, this already amounts to 3.5
GB of memory. Considering the additional memory consumption caused by
the bug mentioned above, we arrive at (10*n + 4*m)*8 bytes + (n/8)*1
bytes plus constant, totaling at about 4.1 GB. If there are no edge
weights, they can be represented implicitly via adjacent node IDs, in
which case the current implementation still requires (10*n + 2*m)*8
bytes (about 3.3 GB) if I am not mistaken.

More things to consider: While the graph is being built, the std::vector
arrays dynamically resize themselves based on a heuristic built into the
STL. This means that each vector may "overshoot" its final memory
requirements, which for the most commonly used heuristic may be as much
as double of its final size in the worst case. Only after the graph is
read completely, networkit.readGraph shrinks the memory reserved by each
std::vector [4]. In addition, the memory required for mapping
non-continuous to continuous node IDs has to be considered unless the
edge list has been converted to use continuous node IDs before.

>> Is there anything else I can try? I simply want to read in a graph and
>> output it in any sparse adjacency matrix which scipy can read.
> graphio.writeMat relies on the algebraic module. I've checked the source
> and it appears to use scipy for the adjacency matrix representation[1]
> as well as the file output[2]. The conversion process definitely
> requires more RAM as you'll end up with the NetworKit graph object as
> well as a scipy.sparse.lil_matrix object. [...]

As Maximilian has pointed out, increased memory consumption is to be
expected because of the intermediate representations used. I suggest to
combine the implementations of networkit.graphio.writeMat and
networkit.algebraic.adjacencyMatrix, which basically builds an instance
of scipy.sparse.lil_matrix by setting non-zero matrix elements edge by
edge, converting it to CSR format and saving it using scipy.io.savemat.

The following script should do the job, assuming that node IDs are
continuous---it is my understanding that you have already managed to
convert your edge list into a form that uses continuous node IDs. I have
not had the time to test it with a larger example graph, let us know
whether this works for you.

#!/usr/bin/env python3
# Simple script to convert an (unweighted) edge list with 0-based continuous
# node IDs to a scipy-compatible sparse matrix, which can be loaded via
# scipy.io.loadmat().
# Written by Obada Mahdi <omahdi at gmail.com> 2016/08/14
import sys
import scipy, scipy.sparse, scipy.io

if len(sys.argv) != 3:
    sys.stdout.write("Usage: %s <edgelist_input> <output>\n" %

n = 0
sys.stdout.write("First pass over edge list %s\n" % (sys.argv[2],))
fd = open(sys.argv[1])
for line in fd:
    u, v = tuple([int(x) for x in line.split()])
    n = max(n, u, v)
sys.stdout.write("Maximum node ID is %d\n" % (n,))
# Adjust for 0-based indexing
n = n+1
# Rewind and populate lil_matrix
fd.seek(0, 0)
m = scipy.sparse.lil_matrix((n, n))
for line in fd:
    u, v = tuple([int(x) for x in line.split()])
    m[u, v] = 1
    m[v, u] = 1
sys.stdout.write("Writing output to %s\n" % (sys.argv[2],))
scipy.io.savemat(sys.argv[2], {"key": m.tocsr()})

As far as NetworKit is concerned, it might be interesting to look into
the current implementation of networkit.Graph and, for example, combine
some arrays into compound structs to reduce overhead, at least for data
that is never optional (e.g. store outEdgeWeights along with
corresponding node IDs). A more sophisticated approach would delegate to
different C++ implementations under the hood, tailored to specific
needs. This, of course, would not allow to dynamically change the graph
representation (from undirected to directed, for example).




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ira.uni-karlsruhe.de/pipermail/networkit/attachments/20160814/f76a1d5e/attachment.sig>

More information about the NetworKit mailing list