Skip to content

Instantly share code, notes, and snippets.

@molivier
Last active December 30, 2015 16:07
Show Gist options
  • Save molivier/663d81697da201aa1511 to your computer and use it in GitHub Desktop.
Save molivier/663d81697da201aa1511 to your computer and use it in GitHub Desktop.
Wikipedia Extractor
# https://github.com/attardi/wikiextractor
# Download latest wikipedia dump
wget http://download.wikimedia.org/frwiki/latest/frwiki-latest-pages-articles.xml.bz2
# Extract and compress articles
python WikiExtractor.py -b 250K -o extracted frwiki-latest-pages-articles.xml.bz2
# Build single file extraction
find extracted -name 'wiki*' -exec cat {} \; > text.xml
rm -rf extracted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment