Skip to content

Instantly share code, notes, and snippets.

@khlmnn
Created July 3, 2014 12:42
Show Gist options
  • Save khlmnn/3cc07407a002bb1773cd to your computer and use it in GitHub Desktop.
Save khlmnn/3cc07407a002bb1773cd to your computer and use it in GitHub Desktop.
Convert the Wall Street Journal section of the Penn Treebank to CoNLL format
#!/bin/sh
#
# This Gist converts the Wall Street Journal part of the Penn Treebank
# (more specifically, sections 2–24) to CoNLL 2007 format using
# PennConverter. As suggested by the authors of PennConverter, the script
# first applies the NP bracketing patch by David Vadas.
#
# In order to make this script work, you will need the following files:
#
# * treebank-3.tar.gz, containing the standard distribution of the PTB
#
# * PTB_NP_Bracketing_Data_1.0.tgz, containing the NP bracketing patch by
# David Vadas (see http://sydney.edu.au/engineering/it/~dvadas1/)
#
# * pennconverter.jar, containing PennConverter (see
# http://nlp.cs.lth.se/software/treebank_converter/)
#
# Place these files into the same directory as this script and execute the
# script. This will produce three files: train.conll (corresponding to WSJ
# sections 2–21), dev.conll (section 22), and test.conll (section 23).
set -e
root=$(pwd)
treebank="$root/treebank-3/parsed/mrg/wsj"
pennconverter="java -Xmx1G -jar $root/pennconverter.jar -conll2007"
tar xzf treebank-3.tar.gz
tar xzf PTB_NP_Bracketing_Data_1.0.tgz
cd treebank-3 && patch -p1 < $root/PTB_NP_Bracketing_Data_1.0/ptb_wsj_np_bracketing_00_24.diff
cat $(for section in $(seq 2 21); do
find $treebank/$(printf %02d $section) -name '*.mrg'
done) > $root/train.mrg
cat $(find $treebank/22 -name '*.mrg') > $root/dev.mrg
cat $(find $treebank/23 -name '*.mrg') > $root/test.mrg
$pennconverter < $root/train.mrg > $root/train.conll
$pennconverter < $root/dev.mrg > $root/dev.conll
$pennconverter < $root/test.mrg > $root/test.conll
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment