-
-
Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
local function sentence_lines (el) | |
local inlines = el.content | |
for i = 2, #inlines do | |
if inlines[i].t == 'Space' and | |
inlines[i-1].t == 'Str' and | |
inlines[i-1].text:match '%.$' then | |
inlines[i] = pandoc.SoftBreak() | |
end | |
end | |
return el | |
end | |
return { | |
{SoftBreak = function () return pandoc.Space() end}, | |
{Para = sentence_lines}, | |
{Plain = sentence_lines}, | |
} |
Also shouldn't the pattern for matching the end-of-sentence rather be '[%.%!%?]$'
or even '[%.%!%?]%)?$'
?
Yes to both :)
This seemed like a good start, but as soon as I tried to apply it to a book it left a bit on the table.
I've been re-hashing it to also handle quotations, content in divs and blockquotes, emphasized text at end of sentences, etc. If anybody is looking they can see my latest iteration of it in CaSILE here
I am sure there will be more edge cases to come, such as handling ordinal numbers it Turkish (which end it a period), avoiding false positives for abbreviations, and so forth.
@alerque Of course your version will give false positives if the next word starts with a non-ASCII lower case but I guess you can live with that. :)
I suppose lpeg.utfR can help though. It is trivial to build an lpeg pattern which matches either ASCII or Swedish/Turkish/whatever "special" lowercase letters at the start of a string. FWIW I have written a Perl script which builds a Lua table of ranges of codepoints which match/don't match some Unicode properties as expressed through Perl regexes to help with building an lpeg pattern which matches all chars in those ranges but I have as yet not dared to build a pattern which matches all lower/upper case letters since that would involve lots of single-char utfR patterns which I fear would eat lots of memory, but smaller language-specific sets are no problem.
@alerque Nice! Maybe this could be extracted into a separate project at some point?
@bpj Yes of course with the much more aggressive approach to not leave sentences on the table there will be false positives (I know I'll have to tackle some abbreviation issues at some point), but with the original I was getting hundreds of paragraphs in a book that hand 2-10 sentences not split up. I'll definitely be looking into lpeg.utfR
because better locale dependent case detection will be important.
@tarleb Yes definitely I had that in mind already, but at the moment I'm going to be rolling it out to a few dozen book projects in two languages over the next few weeks/month and it will be easier to iterate on in conjunction with other normalization stuff I use, but when it gets a little more mature and can move at it's own pace it definitely should land in it's own project.
To be run with
--wrap=preserve
, right?