Chu-Yu Hsu ChuyuHsu

15 followers · 9 following

National Taiwan University
台灣
https://www.linkedin.com/profile/view?id=181732295

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

ChuyuHsu / 词性标记.md

Created July 21, 2016 09:26 — forked from luw2007/词性标记.md

词性标记：包含 ICTPOS3.0词性标记集、ICTCLAS 汉语词性标注集、jieba 字典中出现的词性、simhash 中可以忽略的部分词性

词的分类

实词：名词、动词、形容词、状态词、区别词、数词、量词、代词
虚词：副词、介词、连词、助词、拟声词、叹词。

ICTPOS3.0词性标记集

n 名词

nr 人名

ChuyuHsu / marisa_count_vectorizer.py

Created June 14, 2016 06:06 — forked from kmike/marisa_count_vectorizer.py

	import numpy as np
	import marisa_trie
	from sklearn.feature_extraction.text import CountVectorizer
	from sklearn.externals import six

	class MarisaCountVectorizer(CountVectorizer):

	# ``CountVectorizer.fit`` method calls ``fit_transform`` so
	# ``fit`` is not provided
	def fit_transform(self, raw_documents, y=None):

ChuyuHsu / init.coffee

Last active May 17, 2016 05:39

automatic update by http://atom.io/packages/sync-settings

	# Your init script
	#
	# Atom will evaluate this file each time a new window is opened. It is run
	# after packages are loaded/activated and after the previous editor state
	# has been restored.
	#
	# An example hack to log to the console when each text editor is saved.
	#
	# atom.workspace.observeTextEditors (editor) ->
	# editor.onDidSave ->

ChuyuHsu / tensorflow rnn variable seq length

Created May 7, 2016 13:19 — forked from evanthebouncy/tensorflow rnn variable seq length

	import tensorflow as tf
	import numpy as np

	if __name__ == '__main__':
	np.random.seed(1)
	# the size of the hidden state for the lstm (notice the lstm uses 2x of this amount so actually lstm will have state of size 2)
	size = 1
	# 2 different sequences total
	batch_size= 2
	# the maximum steps for both sequences is 10

ChuyuHsu / one-hot.py

Created December 25, 2015 04:36 — forked from ramhiser/one-hot.py

Apply one-hot encoding to a pandas DataFrame

	import pandas as pd
	import numpy as np
	from sklearn.feature_extraction import DictVectorizer

	def encode_onehot(df, cols):
	"""
	One-hot encoding is applied to columns specified in a pandas DataFrame.

	Modified from: https://gist.github.com/kljensen/5452382

ChuyuHsu / spark_parallel_boost.py

Created December 24, 2015 02:54 — forked from wpm/spark_parallel_boost.py

A simple example of how to integrate the Spark parallel computing framework and the scikit-learn machine learning toolkit. This script randomly generates test and train data sets, trains an ensemble of decision trees using boosting, and applies the ensemble to the test set. The ensemble training is done in parallel.

	from pyspark import SparkContext

	import numpy as np

	from sklearn.cross_validation import train_test_split, Bootstrap
	from sklearn.datasets import make_classification
	from sklearn.metrics import accuracy_score
	from sklearn.tree import DecisionTreeClassifier

	def run(sc):

ChuyuHsu / pptpd.sh

Created November 19, 2015 11:48 — forked from alvin2ye/pptpd.sh

Automaticlly install pptpd on Amazon EC2 Amazon Linux

	# Automaticlly install pptpd on Amazon EC2 Amazon Linux
	#
	# Ripped from http://blog.diahosting.com/linux-tutorial/pptpd/
	# pptpd source rpm packing by it's authors
	#
	# WARNING:
	# first ms-dns setting to 172.16.0.23, 172.16.0.23 was showing on my
	# /etc/resolv.conf, I'm not sure this is the same on all Amazon AWS zones.
	#
	# You need to adjust your "Security Groups" which you are using too.

ChuyuHsu / apply_df_by_multiprocessing.py

Last active September 15, 2015 14:41 — forked from yong27/apply_df_by_multiprocessing.py

pandas DataFrame apply multiprocessing

	import multiprocessing
	import pandas as pd
	import numpy as np

	def _apply_df(args):
	df, func, kwargs = args
	return df.apply(func, **kwargs)

	def apply_by_multiprocessing(df, func, **kwargs):
	workers = kwargs.pop('workers')

ChuyuHsu / useful_pandas_snippets.py

Last active September 18, 2015 07:12 — forked from bsweger/useful_pandas_snippets.md

Useful Pandas Snippets

	#List unique values in a DataFrame column
	pd.unique(df.column_name.ravel())

	#Convert Series datatype to numeric, getting rid of any non-numeric values
	df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

	#Grab DataFrame rows where column has certain values
	valuelist = ['value1', 'value2', 'value3']
	df = df[df.column.isin(value_list)]