•  
      request #32305 MediaWiki XML import doesn't recreate users (Migration from 1.23 doesn't show the pages)
    Infos
    #32305
    Manuel Vacelet (vaceletm)
    2023-09-26 14:05
    2023-06-14 17:20
    33936
    Details
    MediaWiki XML import doesn't recreate users (Migration from 1.23 doesn't show the pages)

    We've got an issue with a mediawiki instance that got migrated but after migration there is no content. We don't know how to troubleshoot this.

    Before the migration "Main Page" had content (and there are a couple of other pages), after migration, "Main Page" comes without content "ready to be created".

    See attached migration log, everything seems fine

    Mediawiki Standalone
    14.9
    Empty
    • [ ] enhancement
    • [ ] internal improvement
    Robert Vogel (rvogel), Dejan Savuljesku (dsavuljesku)
    Stage
    Empty
    Closed
    2023-09-21
    Attachments
    References
    Referencing request #32305

    Follow-ups

    User avatar
    • Summary
      -Migration from 1.23 doesn't show the pages 
      +MediaWiki XML import doesn't recreate users (Migration from 1.23 doesn't show the pages) 
    User avatar
    Robert Vogel (rvogel)2023-09-20 15:36

    revision.rev_actor being set to 0 for all revisions is okay for MediaWiki 1.35. Once the update to MediaWiki 1.39 is performed this column will be filled properly by the update script.

    User avatar

    Ok, I managed to have a working solution but I'm unsure about it.

    I made a step, prior to migration that:

    • Find all rev_user_text from revision table whose rev_user is 0
    • For each of them force creation of the user with maintance script createAndPromote.php
    • Update revision table for update rev_user to the new matching user_id

    When I do that prior the upgrade, the content is migrated \o/

    However I noticed that, after migration, the revision table rev_actor is filled with 0 and, when looking at page history, I cannot see the previous editors. I suspect that's another issue because, I notice the same behaviour on instances that were migrated without the import/export process in between.

    User avatar

    Keeping note on my progress:

    • On revision table, it seems that rev_user_text is filled with the name of the user that runs the import but with rev_user set to 0 and no corresponding entry in user table.
    User avatar
    Robert Vogel (rvogel)2023-08-22 13:56
    last edited by: Robert Vogel (rvogel) 2023-08-22 13:57

    Apparently --username-prefix was introduced around 1.32. So yes, I believe this is the issue. As importDump.php does not actually import the users, it will create rev_user=0 in MW 1.23.

    As a workaround you could just create all the users in advance. I have created a little script for that: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TuleapWikiFarm/+/951455

    It must be run before importDump.php with the same XML file, like this:

    php extensions/TuleapWikiFarm/maintenance/createUsersFromImportXML.php --src pages.xml
    
    User avatar
    Robert Vogel (rvogel)2023-08-22 13:24

    Thanks for sharing this! I believe I can give you some explanations here.

    First of all: the MediaWiki dumpBackup.php maintenance script only exports contents. No users, user groups, preferences or anything else. In fact it will not even include uploaded files if not specifically told so.

    The "contents" are mainly just <revision>s, grouped by the wiki<page> (uniquely identified by the <title>) they belong to. Those revisions also contain information about their <id> and the <contributor> user information. But those are not really imported again. The revision id gets completely ignored during the import and the user receives a special treatment: Here also the <id> is completely ignored, and the <username> gets "prefixed" (by default with "imported>").

    This is by design, as the feature is mainly used by Wikipedians to transfer contents from one Wikipedia (e.g. "fr" ) to another (e.g. "en"). In this case, they want to preserve the username for proper credits, but then again make sure it is not accidentally assigned to an existing user. In theory (or the past; nowadays this is no longer the case) there could be two completely different people using John.smith as a username in two different wikis. That's why on import they make the username imported>John.smith by default.

    This username is intentionally invalid (due to the >) and therefore produces a rev_user=0 after the import.

    Now, long story told short:

    • You can prevent this behavior by adding --username-prefix= without any value to importDump.php
    • There is no way of exporting/importing actual user data (mails addresses, group assignments, properties, ...)

    In general I am still a little bit confused, as properly invalid usernames should not cause an issue with the migration. We only faced this issue in case of valid usernames that had no expression in the user table.

    Well, as I am writing this, the only issue that comes to my mind is that you actually imported without the prefix (maybe this was not default as of 1.23, I'll check).

    User avatar

    Further investigations. I tried to tweak the XML dump before import but I didn't manage to get a state where users are imported.

    On the bright side, if I tweak at database level directly to feed user table as well as correct user_id field in revision, the migration is running smoothly.

    User avatar

    I managed to get more information there. It appears that the issue probably comes from dump import/export.

    Let me explain the full process:

    1. There is a project on Production server we want to dry run the migration
    2. export the project (mediawiki 1.23 backupDump.php)
    3. import the project (mediawiki 1.23 importDump.php)
    4. run migration on imported project => failure

    When I look at user table I'm getting after step 3, the table is empty. When I look at revision table, the user_id is always 0. It's likely to explain by itself why we are stuck at the migration.

    Then I looked at the xml produced by backupDump and I realized that all contributor nodes have the correct username but the user_id is always 0. I don't know if it's supposed to be this way.

    Do you have an idea how to fix-up things ?

    User avatar
    Robert Vogel (rvogel)2023-06-19 10:34

    I confirm, this looks good. So it may be a different issue and we need to dig a little bit deeper.

    Could you please add

    $wgDebugLogFile = "$IP/cache/32305.log";
    

    to you LocalSettings.php. You can save the file to a different location of course, depending on the environment.

    Then try to access the broken page in the webbrowser.

    Check the file for DB queries that contain the page name (e.g. Some_broken_page; be sure to use underscores instead of spaces) and the page ID. You can obtain the page ID by calling mw.config.get( 'wgArticleId' );on the JavaScript console of the webbrowser, when the respective URL has been loaded.

    Examples of such queries:

    [DBQuery] Title::newFromID [0s] database: SELECT  page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`    WHERE page_id = 1210  LIMIT 1
    
    [DBQuery] MediaWiki\Page\PageStore::getPageByNameViaLinkCache [0s] database: SELECT  page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`    WHERE page_namespace = 8 AND page_title = 'Hf-header-SectionAnchorTest'  LIMIT 1 
    
    [DBQuery] MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds [0.001s] database: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE page_id = 1210 AND (rev_id=page_latest)  LIMIT 1 
    
    [DBQuery] MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds [0.001s] database: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE page_namespace = 0 AND page_title = 'Some_broken_page' AND (rev_id=page_latest)  LIMIT 1
    

    Try to run those queries manually and check for their results. My best guess is that there is an issue with one of the the RevisionStore::fetchRevisionRowFromConds queries.

    Maybe you could also share the file here (in a redacted version of course)

    User avatar

    From my understanding, it seems OK, do you confirm ?

    mysql> SELECT user_id FROM plugin_mediawiki_152.mwuser ORDER BY user_id ASC;
    +---------+
    | user_id |
    +---------+
    | 1 |
    | 2 |
    | 3 |
    | 4 |
    | 5 |
    | 6 |
    | 7 |
    | 8 |
    | 9 |
    | 10 |
    | 11 |
    | 12 |
    | 13 |
    | 14 |
    | 15 |
    | 16 |
    | 17 |
    | 18 |
    | 19 |
    | 20 |
    +---------+
    20 rows in set (0,00 sec)
    
    mysql> SELECT DISTINCT( rev_user ) as user_id FROM plugin_mediawiki_152.mwrevision ORDER BY user_id ASC;
    +---------+
    | user_id |
    +---------+
    |       1 |
    |       2 |
    |       8 |
    |       9 |
    |      16 |
    +---------+
    5 rows in set (0,00 sec)
    
    User avatar

    I don't have the answer on the target instance yet but as previously said, we delete entries directly in the DB :'(

    There are deletes in in user and user_groups

    As well as manual modifications of names in user, recentchanges and revision

    User avatar
    Robert Vogel (rvogel)2023-06-15 14:27

    If it turns out to be a mismatch between the revision.rev_user and user.user_id, you could try to fix that by replacing all unmatched revision.rev_user with some user.user_id that exists. You can create a dedicated user for this purpose by running

    php maintenance/createAndPromote.php DeletedUser somePassword
    

    Be aware that this should be done before running the update to 1.35.

    User avatar
    Robert Vogel (rvogel)2023-06-15 14:23

    Another thing that can be checked is if appending ?action=history to the URL of a broken pages shows an actual version history or not.

    User avatar
    Robert Vogel (rvogel)2023-06-15 08:29

    This could be caused by an inconsitency in the source database. Maye rows have manually been deleted from the user table. Can you please run the following two SQL statements on the source DB and compare the output?

    SELECT user_id FROM user ORDER BY user_id ASC;
    
    SELECT DISTINCT( rev_user ) as user_id FROM revision ORDER BY user_id ASC;