Skip to content

Instantly share code, notes, and snippets.

@sprintingdev
Created March 15, 2014 12:15
Show Gist options
  • Save sprintingdev/9566192 to your computer and use it in GitHub Desktop.
Save sprintingdev/9566192 to your computer and use it in GitHub Desktop.
MapReduce : Find hashtags and counts.
#!/usr/bin/python
import sys
import csv
import re
for line in sys.stdin:
for hashtag in re.findall(r"#[a-zA-Z_]+", line):
print "{0}\t{1}".format(hashtag.lower(), 1)
#!/usr/bin/python
import sys
hashtag_count = 0
old_hashtag = None
for line in sys.stdin:
data_mapped = line.strip().split("\t")
if len(data_mapped) != 2:
# Something has gone wrong. Skip this line.
continue
hashtag = data_mapped[0]
if old_hashtag and old_hashtag != hashtag:
print old_hashtag, "\t", hashtag_count
old_hashtag = hashtag
hashtag_count = 0
hashtag_count += 1
old_hashtag = hashtag
if old_hashtag != None:
print old_hashtag, "\t", hashtag_count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment