Skip to content

Instantly share code, notes, and snippets.

@louislung
Last active June 23, 2019 14:48
Show Gist options
  • Save louislung/8a1b2bbf7946cd276e37e09da5edecd1 to your computer and use it in GitHub Desktop.
Save louislung/8a1b2bbf7946cd276e37e09da5edecd1 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(min_df=0, max_df=0.99, max_features=10000)
X_train = count_vectorizer.fit_transform(article_contents.main_content.iloc[0:train_row])
X_train = count_vectorizer.inverse_transform(X_train)
with open("uci_train_starspace_formatted.txt", 'w+') as file:
for i in range(train_row):
file.write(' '.join(X_train[i]) + ' ' + label_prefix + Y_train.iloc[i])
file.write('\n')
file.close()
# The result file will look like this (all separeted by space, and label will have prefix __label__)
# how are you ... __label__b
# this is just an example ... __label__c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment