•  
      request #7724 Gerrit replication housekeeping
    Infos
    #7724
    Ahmed HOSNI (hosniah)
    2015-04-08 13:45
    2014-12-29 13:48
    7888
    Details
    Gerrit replication housekeeping
    Hi all,

    We noticed several times that Gerrit queue contains an important number of waiting replication job (more than 30 waiting push...). We need to ensure the Tuleap git mirror consistencyas much as possible by implementing something similar to what has been previously done with gitolite admin housekeeping.

    For that purpose we need to check if there are replication jobs waiting in gerrit queue via a tiny script that would be processed each 30 minutes, and then if number of waiting jobs is greater than a quota param then we enforce replication using "gerrit replicate" command.

    We would add a checkbox for each gerrit server, in order to enable/disable gerrit replication housekeeping and a text field for the maximum number of waiting replication jobs for each connected gerrit server.


    SCM/Gerrit
    Empty
    Empty
    • [x] enhancement
    • [ ] internal improvement
    Nouha Terzi (terzino), Denis PILAT (denis_pilat), Patrick Renaud (patrick.renaud)
    Stage
    Ahmed HOSNI (hosniah)
    Under review
    Empty
    Attachments
    Empty
    References
    Referenced by request #7724

    Follow-ups

    User avatar
    last edited by: Yannis ROSSETTO (rossettoy) 2015-04-08 13:46
    So I've made several test and I was able to have Gerrit replication over HTTP.

    To do this, I've made my tests with a Tuleap user named testman (with a username and a password). In the gerrit replication config file, I modified the url parameter in order to have something like :

    url = http://sonde.cro.enalean.com/plugins/git/${name}.git

    Then, I edited the file secure.conf like this:

    [remote "sonde.cro.enalean.com"]
    username = testman
    password = MYPASSWORD

    This file provides the credentials needed for the http request.

    The last thing I've done is to add the RW+ access to testman for my git repository (these access rights were no more granted because of the gerrit migration):

    $> cd /var/lib/codendi/gitolite/admin/
    $> vi conf/projects/repo.conf
    $> git commit -am "Grant testman replication stuff" && git push origin master

    For now, in order to have the replication working, we had to provide an SSH key. This key will be passed to a user named forge__gerrit_1. If we want the replication works over HTTP, we will have to provide an HTTP password, give this password to forge__gerrit_1 and allow this user to RW+ on the Tuleap repository.

    Is it clear ?
    User avatar
    Ahmed HOSNI (hosniah)2015-03-03 15:23
    Hello Manuel and Patrick,

    Yet another SSH nightmare (i'm just turning the page of Apache Mina's SSHD).

    As of today, there is a bunch of jenkins jobs polling our Tuleap server in order to check refs consistency between git/Tuleap repositories Vs remote git/Gerrit repositories.
    We definitely need to move gerrit replication over https then get rid of those "agressive" jobs.

    I need commitment of Denis as PO, i'll then open a dedicated SLA for gitolite config in order to recieve pushes from gerrit using https.

    Thanks a lot Patrick for the helpful answer.

    Best regards,
    Ahmed
    User avatar
    Patrick, thanks for your very detailed answer.
    Ahmed did not reply, what do you think about Patrick warnings ?
    User avatar
    Hi Manuel,

    Tx for involving me in this topic. It's actually more complicated than that, and because of what I will explain below, I'm afraid there is nothing Tuleap can do about the issue.

    First, the problem encountered with the Gerrit replication over ssh:// is well known and documented. A Gerrit server that gets into that situation simple would not recover!! It's been a while now that we have abandoned using ssh:// for replication, but if my memory serves me well, the JSch bug behind this situation is so bad that restarting the replication plugin is not even sufficient for getting rid of the stuck threads. A full restart of Gerrit is needed. And then soon after the same issue comes back as soon as the system load increases to reach the threshold where the JSch bug manifests itself. There's just no way out, it's ugly...

    Forcing a replication, as proposed, will solve nothing. Since the threads are already stuck, forcing the replication will simply worsen the situation. Been there before....

    This is not a Tuleap issue, it is a Gerrit issue and needs to be dealt with at the Gerrit level. Therefore, anything you can imagine to work around this at the Tuleap level will be to no avail.

    If the Gerrit in use next to Tuleap is under any kind of serious load then you have no option but to get out of ssh:// for replication. Seriously. We at Ericsson are using git:// over an ssh tunnel and that works as a charm: fast and reliable, but unfriendly to deal with I admit. I would recommend adopting the http(s) protocol for replication instead if git over ssh appears a bit complex to manage. And if Tuleap's gitolite can be configured to accept push operations from Gerrit using http(s).

    Voilà. In hope this helped a bit.

    BR,
    -Patrick
    User avatar
    Thanks for the very detailed answer, I add Patrick Renaud from Ericsson in CC to get his feedback.

    Patrick, what do you think about Ahmed proposal, would that make sense on your side ?

    I do agree with you that the solution you provide sounds more lightweight and has less maintenance overhead than SonyMobile one.

    • CC list Patrick Renaud (patrick.renaud) added
    User avatar
    Ahmed HOSNI (hosniah)2015-02-18 12:50
    last edited by: Ahmed HOSNI (hosniah) 2015-02-18 12:52

    I 'm not sure if it's another topic, given this old thread (2010), Shawn was speaking about a race condition on Jsch that could be the origin of the replication issue...JGit could also be guilty given this discussion

    Actually, I don't understand why forcing replication every 30mn should help, if the jobs are already in queue, why aren't they processed 

    According to gerrit documentation , there are 3 cases that the automatic schedule of gerrit replication is not designed for. The Async gerrit Replicate command should help administrators :

    1. Destination disappears, then later comes back online.
    2. After repacking locally, and using rsync to distribute the new pack files to the destinations.
    3. After deleting a ref by hand. 

    and, more important why "magically" running "gerrit replicate" would actually perform the replication ?

    To be honest, i'm  'throwing a bottle into the sea'. I'm even not sure that the RC is covered by the case 1 (Network issue)...I agree with the term "Magically" since there is no RC rationally identified: This is what always worked for me on Live server when facing this incident (it has occurred 3 times during 2014).

    We are facing a Master/slave data consistency issue that was raised as a QoS issue from business point of view. As a wrokaround, i need my Tuleap to be able to recover any missing refs that should be forwarded by automatiic replication schedule managed by master. I identified two options:

    1- Put the solution implemented by SonyMobile Guys  into a Tuleap contrib : Tuleap should check continuously 'stream-events' (or ls-remote with the risk to bring the server at his knees when running this extra...) then decide if some refs needs to be fetched.

    The target master (gerrit.st.com) has more than 500 gerrit code projects (~50 GB of git repositories) for ~1400 users (more than 250 daily users) with many hyperactive big repositories like AOSP and Linux kernel forks (long history + thousands of branches and tags + million of refs).

    The approach itself sounds risky (the main python function is running into a "while True"  to make the snippet running as a daemon...i guess i would be banned from contribution if i push this for review !) 

    2- The solution already implemented: I'm trying to reduce the amount of time during which the slave is missing some refs to at worst 30 minutes. The git System Check would run a gerrt replicate command on the gerrrit 'Master' in order to consume any waiting replication Job.

    There is no harm for Tuleap , i do not add anything to gerrit queues, it's just firing an async event that should be normally trigerred by a gerrit repository change. If the queue is empty, it sounds that the command does nothing (there is no warnings within the gerrit log).

    User avatar
    I've read the thread but I fail to find the reference to the solution you implement here.

    Actually, I don't understand why forcing replication every 30mn should help, if the jobs are already in queue, why aren't they processed and, more important why "magically" running "gerrit replicate" would actually perform the replication ?

    For the replication through git:// it's much more complex than you can expect as there is good solutions with permissions management on Tuleap side (there is no way to check/protect things). One year ago, it was also mentioned that replication over https might actually be more efficient (as replication over ssh is buggy because of bug in Jgit). But that's an different topic, isn't it ?
    User avatar
    Ahmed HOSNI (hosniah)2015-02-17 15:37

    Hi Manuel,

     there is maybe another issue that triggers this queuing and forcing it might alter gerrit behaviour ?

    The origin of queuing is a perf issue within FOSS Gerrit Replication plugin.

    Potential RC For ST's gerrit As raised in gerrit community ML, the Gerrit replication plugin (we configure it using ssh://) has issue with repositories high refs counts. We are facing this "data challenges" since we deal with an important number of hyperactive repos.

    Magnitude: 1k+ branches, 1Million+ refs  => We notice both replication performance issue and slow UI rendering

    IMHO, WAN bandwidth should be the origin of this mess, take a look to the hypotesis of MartinFick here. 

    Replication over ssh makes things worst.

    I've no idea if we can afford Ercisson's workaround, using git:// via ssh tunnels: autossh + git daemon (they replicate to localhost + port forwarding + git daemon on the Tuleap side).

     

     

     

     

    User avatar
    Are you sure it's safe to force the replication every 30mn ?
    I mean, if replication jobs are in queue, why aren't they processed ? there is maybe another issue that triggers this queuing and forcing it might alter gerrit behaviour ?
    Worst case, your git server is already very busy and every 30mn, you ensure gerrit will load a little bit more with a bunch of new stuff.

    • Status changed from New to Under review
    User avatar
    Ahmed HOSNI (hosniah)2015-02-02 12:10
    Since we need the workaround for an issue we are facing within our prod environment, we'll first focus on forcing replication on gerrit 2.5.2 server (aka gerrit legacy servers, which are managed by ssh gerrit driver).
    It could be extended later for other versions (AFAIK there is no REST endpoint for replication remote triggering and since 2.8+ driver is a Guzzle HTTP client, it would probably need some refactoring in order to be able to run ssh commands within a gerrit 2.8+ driver).
    User avatar
    Ahmed HOSNI (hosniah)2014-12-29 13:49
    • Original Submission
      Something went wrong, the follow up content couldn't be loaded
      Only formatting have been changed, you should switch to markup to see the changes
    • Assigned to changed from None to Ahmed HOSNI (hosniah)