metachronistic

Sat, 04 Nov 2006

Mirror the Wikipedia

Filed under: blogging — Tags: , , — cswingle @ 9:34:49 -0900

wikipedia

Update Thu Jan 10 09:38:42 AKST 2008: Unless you really need a complete mirror like this, a much faster way to achieve something similar is to use Thanassis Tsiodras’s Wikipedia Offline method. Templates and other niceties don’t work quite as well with his method, but the setup is much, much faster and easier.


I’ve come to depend on the Wikipedia. Despite potential problems with vandalism, pages without citations, and uneven writing, it’s so much better than anything else I have available. And it’s a click away.

Except when flooding on the Richardson Highway and a mistake by an Alaska railroad crew cut off Fairbanks from the world. So I’ve been exploring mirroring the Wikipedia on a laptop. Without images and fulltext searching of article text, it weights in at 7.5 GiB (20061130 dump). If you add the fulltext article search, it’s 23 GiB on your hard drive. That’s a bit much for a laptop (at least mine), but a desktop could handle it easily. The image dumps aren’t being made anymore since many of the images aren’t free from Copyright, but even the last dump in November 2005 was 79 GiB. It took about two weeks to download, and I haven’t been able to figure out how to integrate it into my existing mirror.

In any case, here’s the procedure I used:

Install apache, PHP5, and MySQL. I’m not going to go into detail here, as there are plenty of good tutorials and documentation pages for installing these three things on virtually any platform. I’ve successfully installed Wikipedia mirrors on OS X and Linux, but there’s no reason why this wouldn’t work on Windows, since apache, PHP and MySQL are all available for that platform. The only potential problem is that the text table is 6.5 GiB, and some Windows file systems may not be able to handle files larger than 4 GiB (NTFS should be able to handle it, but earlier filesystems like FAT32 probably can’t).

Download the latest version of the mediawiki software from http://www.mediawiki.org/wiki/Download (the software links are on the right side of the page).

Create the mediawiki database:

$ mysql -p
mysql> create database wikidb;
mysql> grant create,select,insert,update,delete,lock tables on wikidb.* to user@localhost identified by 'userpasswd';
mysql> grant all on wikidb.* to admin@localhost identified by 'adminpasswd';
mysql> flush privileges;

Untar the mediawiki software to your web server directory:

$ cd /var/www
$ tar xzf ~/mediawiki-1.9.2.tar.gz

Point a web browser to the configuration page, probably something like http://localhost/config/index.php, and fill in the database section with the database name (wikidb) users and passwords from the SQL you typed in earlier. Click the ‘install’ button. Once that finishes:


$ cd /var/www/
$ mv config/LocalSettings.php .
$ rm -rf config/

More detailed instructions for getting mediwiki running are at: http://meta.wikimedia.org/wiki/Help:Installation

Now, get the Wikipedia XML dump from http://download.wikimedia.org/enwiki/. Find the most recent directory that contains a valid pages_articles.xml.bz2 file.

Also download the mwdumper.jar program from http://download.wikimedia.org/tools/. You’ll need Java installed to run this program.

Configure your MySQL server to handle the load by editing /etc/mysql/my.cnf, changing the following settings:

[mysqld]
max_allowed_packet = 128M
innodb_log_file_size = 100M

[mysql]
max_allowed_packet = 128M

Restart the server, empty some tables and disable binary logging:

$ sudo /etc/init.d/mysql restart
$ mysql -p wikidb
mysql> set sql_log_bin=0;
mysql> delete from page;
mysql> delete from revision;
mysql> delete from text;

Now you’re ready to load in the Wikipedia dump file. This will take several hours to more than a day, depending on how fast your computer is (a dual 1.8 Ghz Opteron system with 4 GiB of RAM took a little under 17 hours with an average load around 3.0 on the 20061103 dump file). The command is (all on one line):

$ java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb

You’ll use the administrator password you chose earlier. You can also use your own MySQL account, since you created the database, you have all the needed rights.

After this finishes, it’s a good idea to make sure there are no errors in the MySQL tables. I normally get a few errors in the pagelinks, templatelinks and page tables. To check the tables for errors:

$ mysqlcheck -p wikidb

If there are tables with errors, you can repair them in two different ways. The first is done inside MySQL and doesn’t require shutting down the MySQL server. It’s slower, though:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

The faster way requires shutting down the MySQL server:

$ sudo /etc/init.d/mysql stop (or however you stop it)
$ sudo myisamchk -r -q /var/lib/mysql/wikidb/pagelinks.MYI
$ sudo /etc/init.d/mysql start

There are several important extensions to mediawiki that Wikipedia depends on. You can view all of them by going to http://en.wikipedia.org/wiki/Special:Version, which shows everything Wikipedia is currently using. You can get the latest versions of all the extensions with:

$ svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions extensions

svn is the client command for http://subversion.tigris.org/. It’s a revision control system that eliminates most of the issues people had with CVS (and rcs before that). The command above will check out all the extensions code into a new directory on your system named extensions.

The important extensions are the parser functions, citation functions, CategoryTree and WikiHero. Here’s how you install these from the extensions directory that svn created.

Parser functions:

$ cd extensions/ParserFunctions
$ mkdir /var/www/extensions/ParserFunctions
$ cp Expr.php ParserFunctions.php SprintfDateCompat.php /var/www/extensions/ParserFunctions
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/ParserFunctions/ParserFunctions.php");
$wgUseTidy = true;
^d

(the last four lines just add those PHP commands to the LocalSettings.php file. It’s probably easier to just use a text editor.

Citation functions:

$ cd ../Cite
$ mkdir /var/www/extensions/Cite
$ cp Cite.php Cite.i18n.php /var/www/extensions/Cite/
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/Cite/Cite.php");
^d

CategoryTree:

$ cd ..
$ tar cf - CategoryTree/ | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
$wgUseAjax = true;
require_once("$IP/extensions/CategoryTree/CategoryTree.php");
^d

WikiHero:

$ tar cf - wikihiero | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/wikihiero/wikihiero.php");
^d

If you want the math to show up properly, you’ll need to have LaTeX, dvips, convert (from the ImageMagick suite), GhostScript, and an OCaml setup to build the code. Here’s how to do it:

$ cd /var/www/math
$ make
$ mkdir ../images/tmp
$ mkdir ../images/math
$ sudo chown -R www-data ../images/

My web server runs as user www-data. If yours uses a different account, that’s what you’d change the images directories to be owned by. Alternatively, you could use chmod -R 777 ../images to make them writeable by anyone.

Change the $wgUseTeX variable in LocalSettings.php to true. If your Wikimirror is at the root of your web server (as it is in the examples above), you need to make sure that your apache configuration doesn’t have an Alias section for images If any of the programs mentioned aren’t in the system PATH (like if you installed them in /usr/local/bin or /sw/bin on a Mac) you’ll need to put them in /usr/bin or someplace the script can find them.

MediaWiki comes with a variety of maintenance scripts in the maintenance directory. To allow these to function, you need to put the admin user’s username and password into AdminSettings.php:

$ mv /var/www/AdminSettings.sample /var/www/AdminSettings.php

and change the values of $wgDBadminuser to admin (or what you really set it to when you created the database and initialized your mediawiki) and $wgDBadminpassword to adminpasswd.

Now, if you want the Search box to search anything besides the titles of articles, you’ll need to rebuild the search tables. As I mentioned earlier, these tables make the database grow from 7 GiB to 23 GiB (as of the September 25, 2006 dump), so make sure you’ve got plenty of space before starting this process. I’ve found a Wikimirror is pretty useful even without full searching so don’t abandon the effort if you don’t have 20+ GiB to devote to a mirror.

To rebuild everything:

$ php /var/www/maintenance/rebuildall.php

This script builds the search tables first (which takes several hours), and then moves on to rebuilding the link tables. Rebuilding the link tables takes a very, very long time, but there’s no problem breaking out of this process once it starts. I’ve found that this has a tendency to damage some of the link tables, requiring a repair before you can continue. If that does happen, note the table that was damaged and the index number where the rebuildall.php script failed. Then:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

(replace pagelinks with whatever table was damaged.) I’ve had repairs take a few minutes, to 12 hours, so keep this in mind.

After the table is repaired, edit the /var/www/maintenance/rebuildall.php script, comment out these lines:

# dropTextIndex( $database );
# rebuildTextIndex( $database );
# createTextIndex( $database );
# rebuildRecentChangesTablePass1();
# rebuildRecentChangesTablePass2();

and insert the index number where the previous run crashed into this line:

refreshLinks( 1 );

Then run it again.

One final note: Doing all of these processes on a laptop can be very taxing on a computer that might not be well equipped to handle a full load for days at a time. If you have a desktop computer, you can do the dumping and rebuilding on that computer, and after everything is finished, simply copy the database files from the desktop to your laptop. I just tried this with the 20061130 dump, copying all the MySQL files from /var/lib/mysql/wikidb on a Linux machine to /sw/lib/mysql/wikidb on my MacBook Pro. After the copying was finished, I restarted the MySQL daemon, and the Wikipedia mirror is now live on my laptop. The desktop had MySQL version 5.0.24 and the laptop has 5.0.16. I’m not sure how different these can be for a direct copy to work, but it does work between different platforms (Linux and OS X) and architectures (AMD64 and Intel Duo Core).

Advertisements

34 Comments »

  1. Hi,

    I attempted the same thing, and came up with almost the same set of instructions as you did. I would suggest before embarking on this.

    1. dump mysql database to a sql file
    2. update my.ini file (preferably use single file per innodb table, in case things go wrong u can simply delete that database, otherwise all innoDB tables go into one file)

    As of now I have not found any way to integrate images. i even downloaded the 76G file. I had some success with wikipedias of languages with fewer articles. Extract the tarball into images and run rebuildImages.php

    Did you find any ways to include images from commons ?

    Cheers

    Aladin

    Comment by aladin — Wed, 15 Nov 2006 @ 4:50:55 -0900

  2. Aladin,

    > Did you find any ways to include images from commons ?

    Not so far. I untarred the huge image dump (20051126_upload.tar) and spent awhile linking it in various places trying to figure out what it was looking for. I don’t remember now if I was running rebuildImages.php each time, though.

    I don’t care so much about the article images, but I really would like to figure out how to get all the Wiki specific images in there (like the little flag icons, for example). Are those the images from ‘commons’? I’ve seen people make reference to this, but I don’t know exactly what it means.

    Anyway, if you do discover a procedure for adding some of the images, let me know!

    Cheers,

    cswingle

    Comment by cswingle — Wed, 15 Nov 2006 @ 6:21:12 -0900

  3. Hi,
    Thanks for the details. I am trying to load wikipedia onto my desktop which is a 2.6GHz Celeron system. Because of local problems I can not run the system for several hours at a stretch. However, as mentioned, it will take several hours to load the xml dump into MySQL. Can anyone tell me if the Wikipedia dump file can be loaded incrementally.
    Regards.
    Abhijit

    Comment by Abhijit — Thu, 16 Nov 2006 @ 2:11:22 -0900

  4. Abhijit,

    I don’t think you can convert the XML dump file to SQL incrementally, but it’s possible that the ‘mwdumper’ part of the process would go much faster if it wasn’t having to way for the SQL server to catch up. What I’d try is to run mwdumper, but redirect the output to a file rather than piping it directly into MySQL. Once you’ve got the Wikipedia as SQL, you can cut up the SQL file with a text editor and load in one table at a time. Basically you’d want to run everything from the DROP TABLE IF EXISTS statement, to the ALTER TABLE ENABLE KEYS statement for each table. The ‘text’ table is the largest, so you might want to do that one when you’ve got the longest stretch on the computer.

    The other thing is if you’ve got access to another computer, you can do the ‘mwdumper’ part on that system, and then use the SQL file on your Celeron box. Getting the file from one to the other might be challenging, though, as the November 5 dump was around 3.8 GiB when compressed with bzip2.

    Cheers,

    cswingle

    Comment by cswingle — Thu, 16 Nov 2006 @ 6:10:44 -0900

  5. cswingle,
    Thanks for the solution. I have access to a 3GHz xeon server where I can do the mwdumper part and shall use a laptop to transfer the SQL files after splitting.
    Regards
    Abhijit

    Comment by Abhijit — Thu, 16 Nov 2006 @ 21:40:05 -0900

  6. Hi,

    I have had partial success working with the import. I imported the data with mwdumper, i got 2.8 million revs with the Sep 2006 data. Not sure if that is everything. I tried doing a search and results show “no article title matches” even though several articles with that text in the title are present (yes i used more than 3 chars for the search).

    i tried everything to get the title search to work. the searchindex.MYD table stays at 0 bytes. Finally i gave up and started the rebuildtextindex.php script, now this will enable full text searching. Something I dont want. Any idea how to get only the title to be searched ? You mention it in your blog, anything special u did.

    Another interesting point to note is that your tables are being created as MyISAM whereas im getting InnoDB tables. Did u mnaully alter the tables ? using innoDB tables one does not need to repair etc. After the import with mwdumper I did not run rebuildall.php

    Looking forward to your reply

    regards,

    Aladin

    Comment by aladin — Thu, 23 Nov 2006 @ 17:08:37 -0900

  7. Aladin,

    I wish I had some advice for you. What I did to put the Wikipedia on my laptop was to run the mwdumper on a desktop until it was finished. Then I used mysqldump to dump that database, copy it to my laptop and then ‘cat’ the dump file into the database on my laptop. I didn’t do anything else, and all the tables in the database are MyISAM except the hitcounter table, which is a ‘memory’ table.

    I think I might have overstated the title search capability of a wiki mirror before running the rebuild scripts. If you type something into the search box for which an article exists, that article will show up. But if there isn’t an article that matches, it doesn’t offer any options as you might expect if it was really searching titles. I don’t find it to be too much of a problem because the links between articles are working. So when I do a “search” that doesn’t yield an article, I just think of something slightly more general, go to that page and look for a link to the article I was really interested in. It’s not an ideal situation, but it does work, and it’s better than adding 15 or 16 GiB to the size of the database for that one additional feature (at least on my laptop).

    I don’t know why you’ve got all InnoDB tables, but I wonder if it’s a MySQL version issue. I’m using version 5, and maybe 3 or 4 use InnoDB by default.

    Cheers,

    cswingle

    Comment by cswingle — Fri, 24 Nov 2006 @ 8:42:02 -0900

  8. Greetings from the Finger Lakes region East-central NY.

    You are v-e-r-y interesting, intelligent, and industrial.

    My subject is on the Rochester NY beer recipe interest you wrote on; my Great Grand father Robert Neuhierl was the head brew-master there, came from Germany. I have an artical that shares his history with that brewery til the 40’s. Just recently I became owner of his entire recipe book hand written in German. I do not know German, but this must have some value? Or do you know brewers who have interests?

    Comment by Gregory Neuhierl — Mon, 04 Dec 2006 @ 10:19:01 -0900

  9. Hello,

    Thanks for sharing valuable information. I am trying to import Wikipedia dumps at my MS Windows desktop machine but never get success. Importing dump files using a command (ex. java -Xmx200M -jar mwdumper.jar –format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb) always fails and the inserted records count is always less then 436,000. More exactly it always fails at page, revision (436,000) and text (436,613).

    I first thought that the dump file might be corrupted and so tried two different dump (11/04/2006 and 11/30/2006) versions but both dumps fails at the same record. This makes me very frustrated. Are there any limitations in MySQL version 4 or mwdumper.jar that I am not aware of in case of MS-Windows machines?

    You comments will be very appreciated.

    Comment by Pillou — Tue, 05 Dec 2006 @ 21:09:27 -0900

  10. Pillou,

    I’m afraid I don’t have any experience with this process on Microsoft Windows. You don’t say what the exact error was, but it could be a file size limit problem if your partition is formatted with FAT32 instead of NTFS (4 GB limit with FAT32). If it’s a MySQL error, it’s possible that doing the table inserts one at a time, and then doing a ‘repair table extended’ on each table after insertion might make a difference. You could also try using MySQL 5, which is the version I’ve had success with under Linux and OS X. Inserting the tables one at a time is done by sending the output of mwdumper to a file, instead of piping it directly to mysql. Then you’d just split the large SQL file into separate files, one file per table.

    I can’t remember where I read it, but I seem to recall reading that MySQL table files (the actual *.frm *.MYI, *.MYD files in your MySQL data directory) are compatible from one system to another. So a final option would be to find a system with the same version of MySQL on it, mirror the Wikipedia on it, and then just copy the contents of the wikidb data directory from one system to your Windows system. I haven’t tried this, so I don’t know if it works or not, especially between different operating systems.

    Good luck!

    Chris

    Comment by cswingle — Wed, 06 Dec 2006 @ 6:40:04 -0900

  11. Pillou,

    I had the exact same problem on my dual core MacBook Pro with 3 GB RAM. Someone must have added a page with a huge amount of content in the last couple of months. I tried this exact procedure today and the import failed on page 436,000 for the latest wikipedia dump.

    I solved it by increasing the max_allowed_packet in the MySQL config file “my.conf”

    for both the client and server change the following:

    [mysqld]
    #max_allowed_packet=32M
    max_allowed_packet=128M

    [mysql]
    #max_allowed_packet=32M
    max_allowed_packet=128M

    Make sure to restart mysql server after the change to pick up the new settings.

    I also modified the call to mwdumper as follows (increasing the memory allocated from 200 to 600M):

    java -Xmx600M -server -jar mwdumper.jar –format=sql:1.5 enwiki-20061130-pages-articles.xml.bz2 | mysql5 -u admin -p wikidb

    I’m on page 617,000 of the load right now and it seems ok so far.

    -Pete

    Comment by Pete — Wed, 06 Dec 2006 @ 23:30:54 -0900

  12. Pete,

    Thanks for your comment on Pillou’s problem. I’ve incorporated your changes to the blog post.

    Cheers,

    Chris

    Comment by cswingle — Thu, 07 Dec 2006 @ 6:09:38 -0900

  13. I’ve tried to use mwdumper with exactly the example/recommended parameters, and it starts up, starts outputing lines with page and rev counts, yet my SQL tables are not getting filled, most of the tables have just one record, the rest have none. What might I be doing wrong?

    Comment by jlarson — Thu, 11 Jan 2007 @ 5:49:21 -0900

  14. jlarson,

    Did you wait for mwdumper to finish? What operating system / filesystem are you using?

    Comment by cswingle — Thu, 11 Jan 2007 @ 6:25:22 -0900

  15. i did not wait– i was hoping to see some things right away. does it extract everything first and then set into the tables? i am using opensuse linux, mysql5. as far as file system, when i installed suse i had it repartition my windows drive. i am not sure what that makes my filesystem. but i am pretty new to the LAMP world so everything is a steep learning curve, probably missing something dumb. thanks for the quick reply!!!

    Comment by jlarson — Thu, 11 Jan 2007 @ 7:36:21 -0900

  16. another question is, roughly how big is this database (in MB/GB) once extracted? specifically the enwiki-…-pages-articles.xml.bz2

    Comment by jlarson — Thu, 11 Jan 2007 @ 7:40:10 -0900

  17. jlarson,

    I’d wait several hours at least. While it’s going you can see what’s happening on the MySQL server with:

    $ echo show full processlist | mysql -p

    (or connect to the server with ‘mysql -p’ and then type ‘show full processlist;’ at the mysql> prompt.)

    While mwdumper is running, you should see ‘wikidb’ as the database in use and something in the Info field. At least then you’ll know what the server is doing.

    You might also use ‘sudo du -hs /var/lib/mysql’ (or wherever the database stores it’s data on openSuSE), to see if the overall size of the database is expanding. While MySQL is building indexes and adding data to tables, it uses temporary files that may be hidden from a normal ‘ls -lh’ command.

    Finally, make sure you’re not running out of hard drive space. I’ve had big data insertions result in a completely full hard drive, often due to MySQL log files (/var/log/mysql on a Debian system).

    Comment by cswingle — Thu, 11 Jan 2007 @ 7:45:57 -0900

  18. jlarson,

    As far as the size of the database, the 20061130 dump resulted in 7.5 GiB of data in /var/lib/mysql/wikidb/ (where the Wiki database lives on a Debian system). If you rebuild the indexes so all the search functions work, the database grows to be more than 25 GiB. For me, fulltext searching isn’t worth the extra space on my hard drive since it’s just a mirror of something I can get easily when the Internet is working, and without fulltext searching you can still find articles by their title.

    Comment by cswingle — Thu, 11 Jan 2007 @ 7:50:42 -0900

  19. OK, thanks. I do not have the room and the project I am working on does not require that I have the entire thing right away (and it will be a bit before i can get the drive space :). So I am going to use –filter=titlematch:^F which I think should give me just articles starting with “F” (no reason for F, just random). Anyway so its plugging along now at “937,000 pages (551.868/sec), 937,000 revs (551.868/sec)”. But “sudo du -s /var/lib/mysql” is still reporting the same size as when I began. So two questions:

    1) Does the “937,000 pages” just represent pages its stepped through and at this point still rejecting because they do not start with “F”?

    2) If so, then how high does the pages number go? Wikipedia claims 1.5 million articles but I assume the count is much higher with UserTalk~ and so on involved. (actually I am using –filter:notalk as well).

    you have been really helpful

    Comment by jlarson — Thu, 11 Jan 2007 @ 13:03:35 -0900

  20. jlarson,

    The ‘page’ table contains 4,197,766 rows in it, so that’s probably the upper limit on that number that prints out during the mwdumper process. I’ve never tried using the –filter option, so I don’t know whether mwdumper is reporting all the pages, or just the F pages. Given that your almost up to a million and the total is four million, I suspect it’s displaying the total, not those that are filtered.

    It is strange that you’re not seeing any size difference in the mysql data directory, though. Given that you’ve parsed at least one quarter of the XML, it sure seems like it should have hit upon *some* of the ‘F’ pages. Maybe you can’t do regular expression lingo in the –filter option? Dunno.

    Anyway, I’m afraid you’re just going to have to wait and find out. Do report back, though, on what you find out. I’ve been trying to incorporate what people have told me back into the original post so it remains relevant.

    Comment by cswingle — Thu, 11 Jan 2007 @ 13:14:58 -0900

  21. i finally found time to figure this out and now my data is in, thanks for your help! i am trouble finding out how to get some of the other tables like categorylinks populated. any clues?

    Comment by jlarson — Sun, 14 Jan 2007 @ 10:39:22 -0900

  22. jlarson,

    I’m glad you got it all in. I think you’ll get all the tables (except the images) populated by running ‘rebuildall.php’ from the maintenance directory (see the end of the original wiki post above). I’m not 100% certain if you get categories, though, and your database will be between three and four times larger. Before you run ‘rebuildall.php’, you might check to see what other scripts are in the maintenance directory and see if any of them look promising.

    Comment by cswingle — Sun, 14 Jan 2007 @ 14:05:20 -0900

  23. ah! I see I can download the categorylinks sql file. good to go then. i do not want to rebuildall as I can’t afford the HD space :(. there are a lot of other goodies in the maintenance directory though. also, btw, how well has your simple spam protection technique worked out? is it something you came up with or cribbed from elsewhere?

    Comment by jlarson — Sun, 14 Jan 2007 @ 15:38:21 -0900

  24. jlarson,

    Spam protection works pretty well. Akismet and moderation are really what stops most of it, though.

    Comment by cswingle — Sun, 14 Jan 2007 @ 16:26:15 -0900

  25. I’ve completed the process using the current mediawiki release. It works, but the front page looks funny; the search box sinks to the bottom of the left hand side and not all the nav is present.

    Are you using a release version of the PHP, or retrieving from svn what they are running on the live site?

    Comment by benson margulies — Thu, 25 Jan 2007 @ 13:28:39 -0900

  26. A caveat: be very careful when reconfiguring innodb. Make sure that innodb_fast_shutdown was 1, and not 2, when you shut down mysql before changing the innodb log size.

    Comment by benson margulies — Thu, 25 Jan 2007 @ 13:29:55 -0900

  27. benson margulies,

    Thanks for the innodb tip.

    As for the navigation, if you don’t rebuild everything the categories and portals won’t work. I can’t remember which pieces came to life when I did the rebuild, but I’m pretty sure the categories worked after that. Even after all that, though, there were still some strange display issues, but to me they all looked like they were related to not having images so some elements would clash when there wasn’t anything in the space where and image should have been.

    I retrieved the extensions using svn, but that’s it. The mediawiki I was using was a stock tar.gz installation package.

    Comment by cswingle — Thu, 25 Jan 2007 @ 14:18:45 -0900

  28. Here’s a simplification of the recipe.

    When I use the stock mysql and php from Ubuntu, and mediawiki 1.9 or the current svn tip, I get no innodb tables at all. So, I’d recommend:

    1) skip-innodb in my.cnf, instead of growing the log
    2) Turn off the bin-log altogether in my.cnf, instead of using the set command.

    My rather hypersized AMD64 box is now cheerfully running a load at about 1,1200 pages / sec.

    Comment by benson margulies — Fri, 26 Jan 2007 @ 8:00:36 -0900

  29. Thank You

    Comment by Mark — Tue, 17 Apr 2007 @ 22:40:10 -0900

  30. I must say the information on this site have been very usefull to me, A big thank you to all the contributers.

    I have manged to populate my MySQL Database with pages/articles from Wikipedia dump on local mirror but some of the pages are not being displayed properly, the textis displayed at the wrong place on the side menu, and characters like the one bellow are being displayed in their place instead

    |{{#if:|, | and }}[[{{{2}}}|{{{2}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{3}}}|{{{3}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{4}}}|{{{4}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{5}}}|{{{5}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{6}}}|{{{6}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{7}}}|{{{7}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{8}}}|{{{8}}}]]}}{{#if:
    |{{#if:|, |, and }}[[{{{9}}}|{{{9}}}]]}}{{#if:

    I will really appreciate any suggestions in regard to this

    Uno

    Comment by Uno — Wed, 03 Oct 2007 @ 2:25:14 -0900

  31. Hi Uno,
    Have you obtained and installed the extensions Cite.php and ParserFunctions.php as indicated.
    Abhijit

    Comment by Abhijit Sarma — Thu, 15 Nov 2007 @ 21:49:45 -0900

  32. Hi, Annbody any comments or ideas, I plan to bring up a public mirror of wikipedia CD/DVD again. should it be legal or should i restrict access to my private ip addresses only ? any suggestions ?
    Mohnkhan

    Comment by Mohiuddin Khan Inamdar — Thu, 20 Dec 2007 @ 3:43:12 -0900

  33. Hi, your article helps me a lot. One question, if I just want to query a lemma in mysql, how can I write the sql-sentence? Thanks.

    Comment by Zhou Xin — Wed, 09 Jan 2008 @ 20:34:32 -0900

  34. […] metachronistic » Mirror the Wikipedia […]

    Pingback by Some Datasets Available on the Web » Data Wrangling Blog — Thu, 17 Jan 2008 @ 17:08:10 -0900


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: