[Networkit] [ACTION REQ.] GitHub Migration
michael.hamann at kit.edu
Wed Jan 11 13:13:26 CET 2017
On 10.01.2017 15:37, Henning Meyerhenke wrote:
> Am 04.01.17 um 15:51 schrieb Matteo Riondato:
>>> On Jan 3, 2017, at 10:54 AM, Maximilian Vogel <maximilian.vogel at student.kit.edu> wrote:
>>> - the input/ folder has about 80MB with out.ca-cit-HepTh accounting for 59MB.
>> Maybe this directory could be a separate git repo added as a git submodule to the “main” repo.
>> This way, “standard” clones (`git clone $repo`) don’t have to download files they most likely won’t need (But they can get them easily with `git submodule --init` (or whatever it is.)
> Good point, Matteo, thanks! We should still have a few test files in the
> main repo, but only very small ones. The documentation and installation
> process should then indicate where/how to find the other files.
The files currently in the repository are very unlikely a significant
problem. gzip can compress out.ca-cit-HepTh to 8.1M which should give a
more realistic estimate of the impact on the repository size. The
problem is that branches from student thesis (and other projects?) were
merged into the NetworKit repository and many of them contained input
files and evaluation data (and maybe even binaries?) that should never
have been committed into the repository. Also commits of Juypter
Notebooks with output seem to account for quite some space usage,
although again many of them are not even in the current version of
NetworKit. For example, there are more than 13 versions of the Profiling
Notebook checked in with sizes between 5 and 22MB each - and the
included PNG output is hardly compressible. There are also at least 4
versions of a file named
CommunityDetection-DPar/NetworKit-CommunityDetection-DPar of 24MB each.
Have a look at  if you want to generate some more statistics yourself.
Concerning repository size I would also strongly suggest to not to work
in branches but to work in forks and to have only 1-2 branches (master
and stable if tags are not enough for releases) in the main repository
so not everybody who wants to clone NetworKit needs to have a copy of
every project that was based on NetworKit but got never merged into the
main branch. Note that you can easily add additional remotes to your
local repository in Git and the shared history won't be downloaded again.
Further some ideas to keep repository size down:
* Do not keep outputs of the Notebooks in the Git repository but
re-generate them for releases only (as a side effect that also ensures
that they still work with every release). This should also make diffs of
Notebooks much more readable.
* If any larger project (e.g. a student thesis) shall be merged,
either squash the commits during merge (easily possible with the GitHub
pull request interface) so large files committed in between don't end up
in the main repository, cherry-pick only specific (clean) commits or
just copy and re-commit the wanted files to a new branch.
* All significant changes (changes containing more than the correction
of a typo in the documentation) should be reviewed using pull requests
so that accidentally committed large files, outputs of Notebook etc. can
be spotted before they end up in the main repository.
I'm not so much in favor of dropping the whole history as for some
source files it can be interesting who wrote or changed them. However,
we should definitely drop the history of these large files. Some ideas:
* With a list of large files that were once in the repository but are
definitely not needed anymore we could erase them from the history (see
also  for some code doing that).
* Using a similar technique one could also just erase certain
directories that contain/contained large files (like Doc/Notebooks,
input/, CommunityDetection-DPar/ etc.) and simply re-commit the current
version in a new commit.
More information about the NetworKit