Created
June 29, 2020 14:13
-
-
Save halfak/0a68a71b1dbdacc99bb50aaef47af03e to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import time | |
import mwapi | |
from deltas.tokenizers import wikitext_split | |
'''text = """ | |
This is a sentence [[derp|link]]. | |
Here is another paragraph with the number 10. | |
"""''' | |
session = mwapi.Session("https://en.wikipedia.org") | |
doc = session.get(action="query", prop="revisions", | |
titles="Alan Turing", rvprop="content", rvslots="main", | |
formatversion=2) | |
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content'] | |
start = time.time() | |
for i in range(100): | |
list(wikitext_split.tokenize(text)) | |
print("We can process", 1/((time.time() - start)/100), "Alan Turing's per second") |
Author
halfak
commented
Jun 29, 2020
I think we can block CJK together but putting a "+" at the end of the CJK definition here: https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py#L45
This just demos tokenization
import time
import mwapi
from deltas.tokenizers import wikitext_split
session = mwapi.Session("https://en.wikipedia.org")
doc = session.get(action="query", prop="revisions",
titles="Aaron Halfaker", rvprop="content", rvslots="main",
formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
for token in wikitext_split.tokenize(text):
print(repr(token))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment