Duplicate Threads, Posts, Forums and Groups after import of 4.xx

  • Affected Version
    WoltLab Suite 5.3

    Hello

    I imported the database from an earlier version and rebuilt the data.

    The forum has over one million posts and this process took 2 days.

    Now I have duplicates of almost everything in the forums. I don't know how to remove the duplicate posts and threads.

    The old database itself does not seem to have duplicate content so I don't know how importing it has caused duplicates in the new forum.

    Any ideas on how to fix this? There are now almost 3 million posts because half are duplicates

  • I did import it before but had that same problem, so I deleted the database and forum.

    I created a new database and fresh install of the forum and then ran the import from the shell, only one time.

    I'm not super technical, and especially with databases.

  • As Alex sad before, looks like the import has startet multiple times. otherwise there wont be duplicate content.

    Hopefully you use already the import via cli. if not, then make sure you have only 1 browser open and do not refresh or close the browser until its finised.

    on cli make sure no other user is able to start the import.

    Uzimaster
    --------------------------------------------
    Si vis pacem, para bellum

  • I did use cli and I only had that window open and let it run until it was done on a new database.

    I can try it again. Delete the database and run the import again, but I just did that. Took a long time to run but I let it complete without ever closing the putty window

    Is there any other way to remove the duplicate content?

  • Is there any other way to remove the duplicate content?

    No, because there is no reliable way to detect duplicate content.

    I suggest that you take a look into the database right after the import is done. For 1-1.5m posts this should take about 6-8 hours, so that's something you can have running over night.

  • Also, you should double check if MySQL's (or MariaDB's) innodb_flush_log_at_trx_commit is set to 1. If it is, change it to 2 for the duration of the import and data rebuild, it massively decreases the burden on the disks, speeding up things by up to an order of magnitude for heavy write traffic.

    Thank you!

  • It looks like the problem is my original source database. The duplicate content is in that database

    Does anyone know of a way to remove duplicate entries from MySql?

    Man. this is a nightmare.

  • That is a bit difficult, especially given that you did not knew about it in the first place. First things first, make sure to have a full backup of the old database, better be safe than sorry.

    You should examine the first two posts and then try to find their duplicates based on the content and time. This should provide you with the ID where the duplication begins. Next you will need to verify if the numbers add up, e.g. that there is an equal amount of posts before that ID compared to the number of posts afterwards. If they match you could issue a DELETE statement that removes all posts that have a postID greater than the first duplicated postID.

    Afterwards you can clean up the threads, this query will remove all orphaned threads:

    SQL
    DELETE FROM wbb1_thread WHERE threadID NOT IN (SELECT DISTINCT threadID FROM wbb1_post);

    You should then check if the users are duplicated too, the procedure should be the same as for the posts themselves.


    I hope this helps!

    PS: We also have a managed hosting offer available and we could take care of the whole cleanup process for you if you decide to migrate to us. ;)

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!