[Networkit] [ACTION REQ.] GitHub Migration

Michael Hamann michael.hamann at kit.edu
Wed Jan 11 13:13:26 CET 2017

Hi all,

On 10.01.2017 15:37, Henning Meyerhenke wrote:
> Am 04.01.17 um 15:51 schrieb Matteo Riondato:
>>> On Jan 3, 2017, at 10:54 AM, Maximilian Vogel <maximilian.vogel at student.kit.edu> wrote:
>>> - the input/ folder has about 80MB with out.ca-cit-HepTh accounting for 59MB.
>> Maybe this directory could be a separate git repo added as a git submodule to the “main” repo.
>> This way, “standard” clones (`git clone $repo`) don’t have to download files they most likely won’t need (But they can get them easily with `git submodule --init` (or whatever it is.)
> Good point, Matteo, thanks! We should still have a few test files in the
> main repo, but only very small ones. The documentation and installation
> process should then indicate where/how to find the other files.

The files currently in the repository are very unlikely a significant 
problem. gzip can compress out.ca-cit-HepTh to 8.1M which should give a 
more realistic estimate of the impact on the repository size. The 
problem is that branches from student thesis (and other projects?) were 
merged into the NetworKit repository and many of them contained input 
files and evaluation data (and maybe even binaries?) that should never 
have been committed into the repository. Also commits of Juypter 
Notebooks with output seem to account for quite some space usage, 
although again many of them are not even in the current version of 
NetworKit. For example, there are more than 13 versions of the Profiling 
Notebook checked in with sizes between 5 and 22MB each - and the 
included PNG output is hardly compressible. There are also at least 4 
versions of a file named 
CommunityDetection-DPar/NetworKit-CommunityDetection-DPar of 24MB each. 
Have a look at [0] if you want to generate some more statistics yourself.

Concerning repository size I would also strongly suggest to not to work 
in branches but to work in forks and to have only 1-2 branches (master 
and stable if tags are not enough for releases) in the main repository 
so not everybody who wants to clone NetworKit needs to have a copy of 
every project that was based on NetworKit but got never merged into the 
main branch. Note that you can easily add additional remotes to your 
local repository in Git and the shared history won't be downloaded again.

Further some ideas to keep repository size down:
  * Do not keep outputs of the Notebooks in the Git repository but 
re-generate them for releases only (as a side effect that also ensures 
that they still work with every release). This should also make diffs of 
Notebooks much more readable.
  * If any larger project (e.g. a student thesis) shall be merged, 
either squash the commits during merge (easily possible with the GitHub 
pull request interface) so large files committed in between don't end up 
in the main repository, cherry-pick only specific (clean) commits or 
just copy and re-commit the wanted files to a new branch.
  * All significant changes (changes containing more than the correction 
of a typo in the documentation) should be reviewed using pull requests 
so that accidentally committed large files, outputs of Notebook etc. can 
be spotted before they end up in the main repository.

I'm not so much in favor of dropping the whole history as for some 
source files it can be interesting who wrote or changed them. However, 
we should definitely drop the history of these large files. Some ideas:

* With a list of large files that were once in the repository but are 
definitely not needed anymore we could erase them from the history (see 
also [0] for some code doing that).
* Using a similar technique one could also just erase certain 
directories that contain/contained large files (like Doc/Notebooks, 
input/, CommunityDetection-DPar/ etc.) and simply re-commit the current 
version in a new commit.



More information about the NetworKit mailing list