Skip to content

Instantly share code, notes, and snippets.

@minhlab
Created July 15, 2017 08:18
Show Gist options
  • Save minhlab/596fe30cda39bebf27137ea9cb2d6b40 to your computer and use it in GitHub Desktop.
Save minhlab/596fe30cda39bebf27137ea9cb2d6b40 to your computer and use it in GitHub Desktop.
from collections import defaultdict
import re
s=open('/Users/cumeo/Downloads/dup-docs.txt').read()
pairs = [re.sub(' +', ' ', line).strip().split(' ') for line in s.strip().split('\n')]
clusters = defaultdict(set)
for doc1, doc2 in pairs:
c = clusters[doc1]
c.update(clusters[doc2])
c.update({doc1, doc2})
del clusters[doc2]
for v in clusters.values(): print(', '.join(v))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment