Saturday, August 18, 2007

Fighting WackoWiki SPAM

For a long time we have had a WackoWiki installation for our open source Far Manager plugins project. Recently I've received an alert from a user that this installation is being seriously spammed. It seemed like our project, developed by and for our close community of Far Manager users gained some additional popularity while WackoWiki engine gained some automation in a form of killer bots. We definitely needed to stop it somehow until we'll be able to move our precious content into a safer place.

I started with wakka.config.php and secured writing rights to authenticated users only:

"default_write_acl" => "$",
"default_comment_acl" => "$",


Of course some kind of CAPTCHA would be a better solution to keep bots at bay, because it gives unauthenticated users ability to contribute, but considering that the last release of WackoWiki was about two years ago it probably doesn't worth patching. The better way would be to move to a new engine. The bad thing that so far there isn't any easy way to export the whole content of WackoWiki - seems like only a page or cluster is possible: http://trac.seagullproject.org/wiki/Internal/Wiki2Trac

Before going any further it is highly advisable to backup database.

mysqldump -h mysql4-f -u farpluginsadmin -p --skip-extended-insert farpluginsdb | bzip2 > farpluginsdb.tar.bz2


The write restriction placed earlier in config file will apply only to new pages. I have to adjust Access Control Lists to already existing pages also to make them editable only by authenticated users. From phpMyAdmin I executed:


UPDATE `wakka_acls` SET `list`='$' WHERE `privilege`='write' AND `list`='*';
UPDATE `wakka_acls` SET `list`='$' WHERE `privilege`='comment' AND `list`='*';


Now when the flow of SPAM seems to be stopped it was time to clean up the mess and restore original pages. The simplest part was to get rid of spam comments. Opening last comments page in one window and phpMyAdmin in another I started to delete comments based on host mask:


SELECT * FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `user` LIKE '%.bezeqint.net';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `user` LIKE '%.bezeqint.net';


Then I started filtering by keywords in comment's body:


SELECT * FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%phentermine%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%phentermine%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%cialis%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%casino%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%panties%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `body` LIKE '%lesbian%';
...


Deleting all comments on particular page:


SELECT * FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `super_comment_on` LIKE '%NovyeKommentarii%';
DELETE FROM `wakka_pages` WHERE `supertag` LIKE 'comment%' AND `super_comment_on` LIKE '%NovyeKommentarii%';


Cleaning ACL table of deleted comment pages:


SELECT * FROM `wakka_acls` LEFT JOIN `wakka_pages` USING(`supertag`) WHERE `wakka_pages`.`supertag` IS NULL;
DELETE FROM `wakka_acls` USING `wakka_acls` LEFT JOIN `wakka_pages` USING(`supertag`) WHERE `wakka_pages`.`supertag` IS NULL;


Time for backup and for the most difficult task - reverting vandalized pages. WackoWiki stores content in two tables - one holds latest version of each pages in wiki format ('body' field) together with cached html version ('body_r' field), and other table stores pages history in wiki-format only. Thanks developers if we clean body_r field with cached content - it will be filled again automatically. So the task to restore one page is rather simple - move all fields for the page from wakka_revisions table to wakka_pages and make sure body_r is clear.

But before restoring any content from wakka_revisions I have to put more time in cleaning up revisions from all previously entered spam garbage. That was easy considering low activity of users during past months. Looking through tables with the phpMyAdmin I've noticed that almost all the pages were starting with some kind of link and edited by unknown users. All this stuff went straight into the dustbin.


SELECT * FROM `wakka_revisions` WHERE ORDER BY `wakka_revisions`.`time` DESC;
DELETE FROM `wakka_revisions` WHERE DATE(`time`) > '2007-04-20' ORDER BY `wakka_revisions`.`time` DESC;


After cleaning revisions table I have to check what new pages were added by spam bots - such pages were still present in wakka_pages after cleanup of wakka_revisions. But ordinary new pages with no revisions are also returned by this query, so you may consider to add time constraint. For me it was enough to add simple sorting by id and remove those few spam pages manually.


SELECT `wakka_pages`.* FROM `wakka_pages` LEFT JOIN `wakka_revisions` USING(`supertag`) WHERE `wakka_revisions`.`supertag` IS NULL ORDER BY `wakka_pages`.`id` DESC;


After this cleanup you may want to repeat procedure of purging ACL table above.

Now that all useless revisions and comments seem to be cleaned out it is time to revert spammed pages to their last good revision. In fact - to last revision before current. The problem now is to get know what pages are spammed and what are not. There are only 365 pages in our wiki, so it will be easy to use `time` field to revert only pages modified during periods of bot activity.

So, to revert page we need to restore fields `time`, `body` and `user` from last revision, clean up `body_r` field to get it repopulated on next access and delete the revision from history. There is actually an empty `body_r` field in revisions field and we will copy it instead if cleaning explicitly. Everything will be done in two steps. First step restores given page fields to its last revision before current. Second deletes restored revision from history table, i.e. revision that has the same value for time field as current page itself.

Query to get data row of latest revision before current from page history:

SELECT w1.supertag, w1.time, w1.body, w1.body_r, w1.user
FROM wakka_revisions w1
LEFT JOIN wakka_revisions w2 ON w1.supertag = w2.supertag AND w1.time < w2.time
WHERE w2.time IS NULL;


Query to get historical revisions that duplicate page contents:

SELECT * FROM `wakka_revisions` JOIN `wakka_pages` USING (`supertag`, `time`)


Select query that joins page contents and last revision together:

SELECT rev.supertag, rev.time, rev.body, rev.body_r, rev.user, pages.user
FROM wakka_pages as pages
JOIN (SELECT w1.* FROM wakka_revisions w1
LEFT JOIN wakka_revisions w2 ON w1.supertag = w2.supertag AND w1.time < w2.time WHERE w2.time IS NULL)
AS rev
USING (`supertag`);


Now actual update query for the page named 'plugring':

UPDATE wakka_pages as page
JOIN (SELECT w1.* FROM wakka_revisions w1
LEFT JOIN wakka_revisions w2 ON w1.supertag = w2.supertag AND w1.time < w2.time
WHERE w2.time IS NULL)
AS rev USING (`supertag`)
SET page.time=rev.time, page.body=rev.body, page.body_r=rev.body_r, page.user=rev.user
WHERE `rev`.`supertag` = 'plugring';


Using date constraints like "WHERE DATE(`page`.`time`)='2007-05-18'" and latest changes page or wakka_pages table sorted by time in DESC order it is rather easy to rollback spam pages. The only thing that should be done afterwards is cleanup:

Delete revisions that duplicate page contents:

DELETE `wakka_revisions` FROM `wakka_revisions` JOIN `wakka_pages` USING (`supertag`, `time`);


After WackoWiki is cleaned out of SPAM I would consider migrating to other engine, as Wacko is unsupported over several last years. I'd choose MediaWiki or Trac depending on project nature. Up to know I haven't seen any automated way to do this, but I have some ideas as always.