Skip to content

Instantly share code, notes, and snippets.

@embr
Last active December 14, 2015 19:49
Show Gist options
  • Save embr/5139127 to your computer and use it in GitHub Desktop.
Save embr/5139127 to your computer and use it in GitHub Desktop.
running Stanford Classifier on web request data
#
# Features
#
useClassFeature=true
#
# generate features on url
10.useSplitNGrams=true
10.usePrefixSuffixNGrams=true
#
# generate features on mimetype
12.useSplitNGrams=true
12.usePrefixSuffixNGrams=true
#
# generate features on UA
15.useSplitNGrams=true
15.usePrefixSuffixNGrams=true
#
# Printing
#
printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
#
# Training input
#
trainFile=logs.train
testFile=logs.test
zcat /a/squid/archive/sampled/sampled-1000.log-20121101.gz | head -n 10000 | sed 's/\s/\t/g' | awk 'NF==14{print}{}' | head -n 90000 | sed 's/\(.*\)/pre\t\1/' > logs.train
zcat /a/squid/archive/sampled/sampled-1000.log-20121201.gz | head -n 10000 | sed 's/\s/\t/g' | awk 'NF==14{print}{}' | head -n 90000 | sed 's/\(.*\)/post\t\1/' > logs.train
zcat /a/squid/archive/sampled/sampled-1000.log-20121102.gz | head -n 10000 | sed 's/\s/\t/g' | awk 'NF==14{print}{}' | head -n 90000 | sed 's/\(.*\)/pre\t\1/' > logs.test
zcat /a/squid/archive/sampled/sampled-1000.log-20121202.gz | head -n 10000 | sed 's/\s/\t/g' | awk 'NF==14{print}{}' | head -n 90000 | sed 's/\(.*\)/post\t\1/' > logs.test
java -jar stanford-classifier.jar -prop logs.prop -printClassifier HighWeight
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment