Skip to content

Instantly share code, notes, and snippets.

@tarleb
Created October 29, 2022 15:05
Show Gist options
  • Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
One sentece per line
local function sentence_lines (el)
local inlines = el.content
for i = 2, #inlines do
if inlines[i].t == 'Space' and
inlines[i-1].t == 'Str' and
inlines[i-1].text:match '%.$' then
inlines[i] = pandoc.SoftBreak()
end
end
return el
end
return {
{SoftBreak = function () return pandoc.Space() end},
{Para = sentence_lines},
{Plain = sentence_lines},
}
@bpj
Copy link

bpj commented May 22, 2024

To be run with --wrap=preserve, right?

@bpj
Copy link

bpj commented May 22, 2024

Also shouldn't the pattern for matching the end-of-sentence rather be '[%.%!%?]$' or even '[%.%!%?]%)?$'?

@tarleb
Copy link
Author

tarleb commented May 22, 2024

Yes to both :)

@alerque
Copy link

alerque commented Aug 14, 2024

This seemed like a good start, but as soon as I tried to apply it to a book it left a bit on the table.

I've been re-hashing it to also handle quotations, content in divs and blockquotes, emphasized text at end of sentences, etc. If anybody is looking they can see my latest iteration of it in CaSILE here

I am sure there will be more edge cases to come, such as handling ordinal numbers it Turkish (which end it a period), avoiding false positives for abbreviations, and so forth.

@bpj
Copy link

bpj commented Aug 16, 2024

@alerque Of course your version will give false positives if the next word starts with a non-ASCII lower case but I guess you can live with that. :)

I suppose lpeg.utfR can help though. It is trivial to build an lpeg pattern which matches either ASCII or Swedish/Turkish/whatever "special" lowercase letters at the start of a string. FWIW I have written a Perl script which builds a Lua table of ranges of codepoints which match/don't match some Unicode properties as expressed through Perl regexes to help with building an lpeg pattern which matches all chars in those ranges but I have as yet not dared to build a pattern which matches all lower/upper case letters since that would involve lots of single-char utfR patterns which I fear would eat lots of memory, but smaller language-specific sets are no problem.

@tarleb
Copy link
Author

tarleb commented Aug 16, 2024

@alerque Nice! Maybe this could be extracted into a separate project at some point?

@alerque
Copy link

alerque commented Aug 16, 2024

@bpj Yes of course with the much more aggressive approach to not leave sentences on the table there will be false positives (I know I'll have to tackle some abbreviation issues at some point), but with the original I was getting hundreds of paragraphs in a book that hand 2-10 sentences not split up. I'll definitely be looking into lpeg.utfR because better locale dependent case detection will be important.

@tarleb Yes definitely I had that in mind already, but at the moment I'm going to be rolling it out to a few dozen book projects in two languages over the next few weeks/month and it will be easier to iterate on in conjunction with other normalization stuff I use, but when it gets a little more mature and can move at it's own pace it definitely should land in it's own project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment