AI, Journal

Importing Wikipedia dump into mysql

Reading Time: < 1 minute

So I was thinking about using Wikipedia data to make a knowledge base and practice some NLP techniques on it. The first step is to import the English portion of Wikipedia into a mysql database so I query it as needed.

The first thought is to go to the Wikipedia download page.

I first tried to download the already made sql, but those SQL script available to download doesn’t actually include the text we see on Wikipedia. So I have go to the XML files, and follow instructions provided by this source.

Basically, we need to use a tool called MWDumper, that will convert XML into SQL scripts. We can download the compile java here, with the instructions here.

This code provided by the blog are mostly correct, except table page have one more column. All we need to do is to add the column like this:

ALTER TABLE page
ADD COLUMN page_counter INT AFTER page_restrictions;

Another change is that one of the column in revision is too small, so we need to change the field property.

ALTER TABLE `revision`
CHANGE `rev_comment` `rev_comment` blob NOT NULL AFTER `rev_text_id`;

There are also duplicate page_titles in page, so make sure they are not set to UNIQUE

ALTER TABLE `page`
ADD INDEX `page_name_title` (`page_namespace`, `page_title`),
ADD INDEX `name_title` (`page_namespace`, `page_title`),
DROP INDEX `page_name_title`,
DROP INDEX `name_title`;

After that it should just be a waiting game until everything is done. My slow server took about 2 days. The final size is about 126 GB on database. Happy NLPing!