Functional overview
Add support of git lfs in Tuleap to allow a better management of big files in git.
From a user stand point, it's the possibility to use git lfs transparently with a git repository remote on a Tuleap server, either in ssh or https.
Once git lfs is available, the limit of file size allowed in a git repository should be lowered an enforced to a more aggressive value (50MB proposed).
As git lfs comes with the ability to easily store huge files, it's proposed to develop this feature with a per-project quota for file storage.
From a system administration standpoint, it should also be possible to use a cheap filestorage system to avoid storing huge file on high end storage (SAN & co). One option would be to allow usage of minio or openio storage (via AWS S3 compatibility layer) when availble. Storage on regular filesystem (for instance NFS) would also be supported.
Technical overview
API
There is a list of end points to implements to comply with the LFS server API. The implementation of those end-points should leverage the work being done on git front router (request #11450). The end-point to implement:
- https://tuleap.example.com/plugins/git/projectname/foo/bar.git/info/lfs/objects/batch [POST] +verify
- https://tuleap.example.com/plugins/git/projectname/foo/bar.git/info/lfs/locks [POST|GET] + verify
- https://tuleap.example.com/plugins/gitlfs [PUT|GET] (for storage & retrieval of files)
Only "basic" transport would be supported (it's the only transport available for 100% of clients). Tus.io for chunked upload/download might come later.
SSH
ssh/gitolite must be updated to allow advertising of lfs end-points as well as manage authentication transfer (from ssh to https).
ssh/gitolite should be updated to refuse files larger than 50MB
HTTPS
There is nothing special to change for support of lfs for people accessing in HTTPS
Impact on existing features
We have to take into account
- fork of lfs based repositories
- pullrequests on lfs objects
- gitphp / git repository browsing of lfs objects (should be part of the work on Modern Git view epics #10400)
PHP & nginx
There is not technical constraints to manage efficiently very huge file in php as we are managing them with PUT and GET. Those 2 verbs in addition to nginx + fpm usage allow to manage arbitrary file size (tested up to 1GB with 256MB of RAM allowed to php-fpm). Nginx should be allowed to accept big client requests.
Storage
There are 2 approaches for storage management
- one central "git lfs store" common to the platform
- one "git lfs store" per repository
The first one has the advantage of efficiency in term of space as the same file will be stored only once (given that the file is stored based on the sha256 of it's content). If the same video of 1GB is shared across 100 repositories in various projects & forks, only 1 GB will be used on FS.
It's also very simple to manage fork of repositories (nothing to do as the reference to the file doesn't change).
However it means that we need to keep a "reference counter" of the files used (which repository use which file) so we know which files can be garbage collected when repositories and projects are deleted.
The "per repo" strategy get rid of the "references counter" but takes more space (no de-duplication) and there are trick to do on repositories fork.
Quota & limits management
We should implement and enforce a quota per project and platform
- To limit the amout of storage consumed by a project (with defaults & exceptions). The current quota "informative only" feature for site admin could be re-used
- To limit the max size of objects stored in LFS (even if we can push video of 10GB, do we really want them on our sever).
3rd party storage
It's a bit out of scope for this feature but if there is an object storage solution (like AWS S3) it should be possible to use it instead of storing on filesystem.
Spike remaining:
- ref management
- authentication
- verify upload sha256
- flysystem to abstract storage
Resources