Skip to content

Instantly share code, notes, and snippets.

@skywodd
Created May 28, 2018 16:32
Show Gist options
  • Save skywodd/e2efa2fffe34e835c817d7ad20938b16 to your computer and use it in GitHub Desktop.
Save skywodd/e2efa2fffe34e835c817d7ad20938b16 to your computer and use it in GitHub Desktop.
SemanticText syntax proposal

SemanticText

SemanticText is a proposal for a new text-based writing syntax for the web (and much more).

Status: Draft - Request for comments

Revision: 2.0.0

Context of creation

I'm a writer and a developer, I write code during the day and content online at night. To do this, I use lot of software and various syntax for writing documentations, articles, messages, etc.

My website Carnet du Maker[^1] currently use a BBCode-like syntax for text input, like nearly all forums on the internet. It works well, but it's far from perfect. My users are sometime complaining about it and myself too for various reasons.

Writing rich text documents is part of my day job but the currently tools available are not quite good for my taste. I need a system for writing rich text documents quickly without headache. This is why this document came to existence.

^1: https://www.carnetdumaker.net/ is a French website I created to publish tutorials, ideas and stuff. I mostly write about electronics, computer sciences and technical things.

Current state-of-art

At the time of writing, the most used https://en.wikipedia.org/wiki/Lightweight_markup_language(lightweight markup language) for writing rich text documents online is https://en.wikipedia.org/wiki/Markdown(Markdown).

More and more websites use Markdown for things like blogs, forums, websites, programming tools, documentation, etc. It's a great tool, but not perfect for everything or everyone.

Other markup languages does exist and are used for online services like BBCode (mostly for forums), raw HTML (for blogs), reStructuredText (for webzine / technical documentation) or LaTex (for white-papers).

The problem

All markup languages currently available fall in one the three following categories :

  • Too close to computer code: The raw text look like an unreadable mess mixed with special markers (like BBCode, HTML, LaTex)
  • Too close to a programming language: Writing require a graphical editor with indentation support and/or other "programming" features (like reStructuredText)
  • Too complicated because of the syntax: You're lost in thousand of rules or syntax variants with little changes / customizations (like AsciiDoc, Markdown)

In some case, this is perfectly fine. Having a syntax looking like computer code or requiring some programming software is not a problem for a developer or a web-designer for example. Or having a complex syntax is fine if you're using it every day to create complex documents or white-papers.

But in other case, it's just not the right tool.

Let's talk about Markdown

N.B. In this paragraph, I will speak about Markdown. Not because I hate Markdown (I don't) but because it's the most used syntax.

The Markdown syntax is really messy by design. It's really a syntax made by one person for one task. There is no "basic grammar", each features of Markdown is made without real cohesion.

Worst, the Markdown syntax is not well defined and subject to various interpretations and customizations. The http://spec.commonmark.org/(CommonMark) initiative try to normalize everything, but it's too late. Multiple (conflicting) Markdown syntax variants exist, and most website use custom additions which make things worst.

Also, the Markdown syntax has a fatal design flaw: the syntax is built on top of HTML. https://en.wikipedia.org/wiki/HTML(Hypertext Markup Language) is the standard (and only) syntax for creating web pages and web applications. It's the perfect tool for creating websites, but not for writing rich text documents.

Using HTML for writing rich text documents is like using a programming software for writing a book. It's definitely possible, but it's really not the smartest decision, unless you're using (https://en.wikipedia.org/wiki/LaTeX). It's like using a Swiss knife to cut down a tree. Possible, but not really practical.

By being built on top of HTML, Markdown is linked to it for good and bad. It's not really a generic text writing syntax, but more like a "HTML shortcut" syntax for the web.

This is a shame because writers don't write for a specific platform. They write to pass information to someone else. We (writers) need something to write texts with meaning. Independently from any platform or specific technology.

P.S. To be fair and crystal-clear, Markdown was created to be a "web syntax". By design it was assumed to be used with HTML and to be just a shortcut for common things. But as said, when you have an hammer everything looks like a nail.

The constrains

Writers are not developers or computers scientists, just writers. The syntax they used must be simple, intuitive and efficient.

Creating yet another programming documentation syntax is doomed by design.

How would you react if you're suddenly required to write all your documents in https://en.wikipedia.org/wiki/Rich_Text_Format(RTF) by hand? You would be disappointed and furious because RTF is verbose, difficult and nearly impossible to remember. I'm caricaturing, but really, the same thing apply to Markdown for most users.

We (the developers community) need to thinks more like non-computer people.

First, we must assumed that the user only known how to use an input device (note, I didn't say "keyboard") and how to write a sentence in is native language.

Second, non-computer peoples have fear about most things they use on a computer. They fear about doing a mistake, clicking the wrong button, the wrong link, etc. We (developers) must keep this in mind.

TL;DR In the context of this document, this imply that the user must feel like he is writing text, not doing programming.

The idea

The idea of SemanticText is simple : making writing rich text documents simple.

!Remark: The syntax will obviously still being "too complicated" for some users, but it should at least be simple to use for most users. The goal is not to make the "perfectly simple" syntax possible, but to make something which don't look like computing science.

The main design goals are:

  • Simple to use: Users must be able to type a document without any difficulty, like writing in plain old text.
  • Simple to remember: Users must not be required to remember any complex rules, just some basic rules and only for advanced features.
  • Simple to implement: Implementation must be simple and developers friendly, with minimal deployment pain.

The technical design goals are:

  • Line based: Parsing must be doable line by line, as the input stream is read.
  • Error safe: For any given input text, an output must be generated, with attached warning or errors if possible.
  • Safe by design: Dangerous features must be avoided at all costs and common mistake should be avoided by the design of the syntax itself.

Also,

The SemanticText syntax is - as given by it's name - semantic. All graphical decisions must be done by the rendering logic, not the writer itself, to avoid falling in the "designer-writer" pitfall. Writers write documents with meaning, developers made websites which works, graphical designers made these two work together in harmony. Allowing a writer or a developer to do graphical stuff always end badly.

TL;DR The rendering logic must say "this is how the text will be displayed", not the writer itself.

As a final note, the following constrains must be remembered:

  • No special [WYSIWYG](What You See Is What You Get) or front-end requirement: These will not be available on some platforms
  • No indentation: Indentation is impossible to work with in a web browser or on a mobile device and should be avoided
  • Minimal typing: Less writing equal less pain (especially on mobile devices), and more work done

Syntax elements

The SemanticText syntax define three types of elements:

  • In-line elements: A pattern in a paragraph of text, used to add special meaning on a specific part of the text

  • Block elements: A prefix pattern in the first line define the start of the block, and the block continue until another block is reached A block can span across following blocks if allowed, like a multi-lines title. In this case, the block stop when an incompatible block or an hard stop is reached.

  • Closure: A first line pattern start the closure context, a second line pattern close the context, the text in between is treated as raw data. This is only for code snippets which require a special treatment in order to keep the formatting as-is.

In SemanticText, text and in-line elements are grouped into paragraphs. Any number of consecutive lines are merged into a single paragraphs. A blank line act as a separator between two paragraphs, like so:

This is the first line of the first paragraph.
This is the second line of the second paragraph.

This is the first line of the second paragraph.

N.B. How the paragraphs are displayed (hard newline, soft newline, justified, etc) is defined by the rendering logic.

With blocks elements, because of the "block span" feature, a double blank line act as a "hard stop", like so:

- First list item of the first list

- Second list item of the first list
- Third list item of the first list


- First list item of the second list

With this syntax, it's possible to have multiple lists or compatible blocks at the same level without merging.

Two blank lines is a "one level" hard stop command, three blank lines is a "two level" stop commands, and so. If you have a list in a list and want to stop both of them to start another top-level list, use three blank lines for example.

Escape sequences

In the current proposal, SemanticText does not include any escape sequences support. The syntax itself is made to not conflict with classical writing behaviours.

The only edge case when escape sequences are required is when you want to demonstrate the syntax itself, like in this document. And even in this edge case, only the single back quote would need escaping. It don't worth the pain.

For possible future change, the backslash is currently not used in the syntax.

Miscellaneous

Before processing, any NULL character (U+0000) must be replaced with the Unicode replacement character (U+FFFD) to avoid security issues.

If the first line of the document is an https://fr.wikipedia.org/wiki/Shebang(Unix shebang), the line must be discarded before parsing the document.

Any white spaces at the start or end of a line outside a closure must be ignored.

The syntax elements

All syntax elements are based on the four following concepts:

  • Square brackets [] for all references / internal links
  • Guillemets <> for all external links
  • Curly brackets for medias and images
  • Marker with a colon for all declarations

Titles

  • Title declaration with an automatic ID (block):
# Title level 1
## Title level 2
### Title level 3
#### Title level 4

--> No more than four levels should be allowed, this allow remapping at rendering and avoid deep structures --> The title text itself cannot include any in-line pattern (just raw text) --> Can span other multiple text lines until a blank line or another block --> At least one white space is required after the last dash. --> No white space are allowed between the dash.

  • Title declaration with a custom ID (block):
#id1: Title level 1
##id2: Title level 2
###id3: Title level 3
####id4: Title level 4

--> The ID can be any sequence of digits, upper/lower case, dash and underscore (ASCII slug format) --> At least one white space is required after the colon. --> One or more white spaces are allowed around the ID. --> No white space are allowed between the dash.

  • Title reference (in-line):
[#id]
[#id](Text of the reference)

--> The reference text can be any sequence of text without in-line patterns (just raw text) --> If the reference text is not specified, the related title text is used instead --> One or more white spaces are allowed around brackets, but not between the ID and dash.

Text formatting

  • Emphasis (in-line):
_text_
  • Strong emphasis (in-line):
*text*
  • Inline code (in-line):
`code`
  • Citation (in-line):
"Text"
  • Keyboard shortcuts (in-line):
|F2|
|CTRL+F|
  • Redacted text (in-line):
~text~
  • Highlighted text (in-line):
^text^

Links

  • Link with optional text (in-line):
<URL>
<URL>(Text of the link)

--> If the text of the link is not given, the raw link should be used instead --> The link text can be any sequence of text without in-line patterns (just raw text) --> No white space is allowed between the closing bracket and the opening parentheses --> One or more white spaces are allowed around the URL and link text

  • Link with title and optional text (in-line):
<Title: URL>
<Title: URL>(Text of the link)

--> The link title can be any sequence of text without in-line patterns or colon (just raw text) --> One or more white spaces are allowed around the URL, link title and link text

  • Email (in-line):
<foo.bar@example.com>
<foo.bar@example.com>(Text of the link)

--> If the text of the link is not given, the raw link should be used instead --> The link text can be any sequence of text without in-line patterns (just raw text) --> No white space is allowed between the closing bracket and the opening parentheses --> One or more white spaces are allowed around the email address and link text

  • Email with recipient name (in-line):
<Recipient name: foo.bar@example.com>
<Recipient name: foo.bar@example.com>(Text of the link)

--> The recipient name can be any sequence of text without in-line patterns or colon (just raw text) --> One or more white spaces are allowed around the email address, link title and link text

Medias

  • Image with (optional) alternative text (block):
{http://example.com/foobar.jpg}
{http://example.com/foobar.jpg}(Alternative text)

--> The alternative text can be any sequence of text without in-line patterns (just raw text) --> No white space is allowed between the closing bracket and the opening parentheses --> One or more white spaces are allowed around the URL and alternative text --> Clickable images are not supported by design to allow custom rendering (like lightbox for web content)

  • Embed with type (block):
{video: https://example.com/rick_astley.mp4}
{audio: https://example.com/never_gaonna_give_you_up.mp3}

--> The embed type can be any sequence of digits, upper/lower case, dash and underscore --> Custom types can be used to include custom medias or objects (more on that later) --> One or more white spaces are allowed around the embed type and URL

  • Embed without type (block):
{https://www.youtube.com/watch?v=dQw4w9WgXcQ}

--> Detect the embed type by using URL patterns for each supported services --> One or more white spaces are allowed around the URL

Lettering

  • NotaBene (block):
N.B. Take note of this text

--> Can span other multiple text lines until a blank line or another block --> Inner text can include in-line patterns

  • PostScriptum (block):
P.S. Simple PostScriptum after some other text

--> Can span other multiple text lines until a blank line or another block --> Inner text can include in-line patterns

  • Too long didn't read (block):
TL;DR Resumed explanation for lazy people

--> Can span other multiple text lines until a blank line or another block --> Inner text can include in-line patterns

Lists

  • Unordered items (block):
- First list element
- Second list element

- Root list element
-- Nested list element
--- Sub nested list element

--> Any change in the nesting level start a new list at the current level --> The item block accept paragraphs of text, with in-line patterns and any other blocks --> The current item is closed by a new item block declaration of the same nesting level or an hard stop --> Holes in nesting level are ignored.

  • Ordered items (block):
1. First list element (numeric)
2. Second list element (numeric)

a. First list element (alphabetic - lower case)
b. Second list element (alphabetic - lower case)

A. First list element (alphabetic - upper case)
B. Second list element (alphabetic - upper case)

--> Any change in the list type start a new list at the current level --> The item block accept paragraphs of text, with in-line patterns and any other blocks, included nested lists --> The current item is closed by a new item block declaration of the same type or an hard stop --> Holes in item numbers are ignored, duplicated number are also ignored.

Quotes

N.B. The quotes syntax is voluntary verbose (indentation-like) to avoid long quotes. A good quote only quote the text strictly required to understand the context of the quote, not a whole document.

  • Citation (block):
> Hello world

--> Multiple consecutive quote lines are merged together --> At least one space is required after the bracket

  • Nested quotes example:
> Top level quote
> > Nested quote

Tables

N.B. This part of the syntax is really cryptic because tables are the most difficult things to implement. I've came up with this "list based" syntax. Pretty understandable I think.

  • Rows (block):
= One row vertical span
== Two rows vertical span
=== Three rows vertical span
  • Cells (block):
- One cell horizontal span
-- Two cells horizontal span
--- Three cells horizontal span
  • Header cells (block):
+ One header cell horizontal span
++ Two header cells horizontal span
+++ Three header cells horizontal span
  • Example of a complete table:
=
+ Country
+ Country code
+ UTC offset
=
- French
- FR
- +01:00
=
- Germany
- DE
- +02:00

Acronyms

  • Definition (in-line):
[Acronym](Acronym definition)

--> The acronym can be any sequence of digits and upper/lower case, separated by a dot or dash if required --> The definition can be any sequence of text without in-line patterns (just raw text) --> For multiple acronyms definitions, a definition lists should be preferred

Definition lists

  • Definition (block):
&Term: Definition

--> The term text can be any sequence of text without in-line patterns (just raw text) --> The declaration block accept one or more paragraphs of text, included in-line patterns and/or blocks --> The declaration block is closed by another term declaration block or an hard stop

  • Reference (in-line):
[&Term]
[&Term](Text of the reference)

--> Will use the term text itself as link text if not provided --> The reference text can be any sequence of text without in-line patterns (just raw text)

Footnote

  • Definition (block): `
^id: Text of the footnote

--> The ID can be any sequence of digits, upper/lower case, dash and underscore (ASCII slug format) --> The footnote text can be any sequence of text with in-line patterns until another block or hard stop --> For better user experience, the footnote ID should link back to the first reference of the footnote

  • Reference (in-line):
[^id]

--> Display a link to the given footnote at the current text offset --> For better user experience, the reference marker should display an unique incremental number instead of the raw ID

Figures

  • Definition (block):
.id: Caption

Figure
  • Definition (alternative, block):
.id:
Caption

Figure

--> The ID can be any sequence of digits, upper/lower case, dash and underscore (ASCII slug format) --> The first following block is the caption, the remaining block(s) are the figure itself --> The caption block accept one paragraphs of text and in-line patterns, but no blocks --> The remaining figure block(s) are not restricted in any way, can be images, tables, or anything else. --> An hard stop (two blank lines) is required to close the figure --> For better user experience, , the rendering code should display an incremental number instead of the raw ID

  • Reference with text (in-line): [.id](Text of the reference)

--> The reference text can be any sequence of text without in-line patterns (just raw text) --> The character # must be replaced in the reference text by the displayed figure ID (or number)

Code blocks

  • Verbatim text (closure):
```
Simple code block
```

--> Indentation and spaces at end of lines are preserved and must be displayed as-is

  • Code with highlighting (closure):
``` python
Code block with syntax highlighting
```
  • Code with highlighting and options (closure):
``` python key=value
Code block with syntax highlighting and options
```

--> At least three ticks are required, but if the source code include the block marker, more ticks can be added to avoid premature end of the closure

Interpreted blocks

Interpreted blocks are used to include special elements like mathematical formula, interactive images, etc. Interpreted blocks work like code blocks but instead of displaying the source code, the result of the interpreted source code is displayed. The supported interpreters are implementation-dependent.

!Danger: In any case, user-input MUST NOT be trusted. Interpreting the given code must NOT result in any harmful side-effects on the rendering machine and user machine. If you allow raw JavaScript for example, or Python code execution, you're gonna have a really bad day, really quick.

  • Interpreted code (closure):
%%% latex
Formula and stuff
%%%
  • Interpreted code with highlighting and options (closure):
%%% latex key=value
Formula with options
%%%

--> Like code blocks, but display the interpreted result instead of the raw source code --> At least three percent are required, but if the source code include the block marker, more percent can be added to avoid premature end of the closure --> Must be limited to formula, UML and other graphs syntax, with strict limitations to avoid abuses and security issues

Admonitions

Admonitions are used to make a paragraph "catch" the eye of the reader. Admonitions can be used to alert the reader of something really important, a tip, a warning, etc.

  • One line declaration (block):
!Type: Body of the admonition
  • Multi-lines declaration (block):
!Type:
Body of the admonition
on multiple lines
  • Mixed declaration (block):
!Type: Body of the admonition
on multiple lines with the
first line also used

--> The declaration block accept one or more paragraphs of text, included in-line patterns and/or blocks --> The declaration block is closed by another admonitions declaration block or an hard stop --> The supported types are implementation-dependent but the following types must at-least be supported: Danger, Warning, Remark, Congratulation --> For better user experience, the type should support i18n to be more writer-friendly across all supported languages

Todo lists

  • Items with state (block):
- ( ) Task pending
- (x) Task done
  • Sub-items with state (block):
- ( ) Task pending
-- ( ) sub task
- (x) Task done

--> Behave exactly like a unordered list, but with check box instead of bullets.

Facts

Facts are used to highlight something, give extra questions or information

  • Definition (block):
--> A fact about the life or something else

--> The fact text can be any sequence of text with any in-line patterns, but no blocks --> A fact block is closed by another fact or an hard stop

Excerpt

Excerpt is a writer term for "preview". An excerpt is used to allow the user to see some of the content freely before having to do something (like clicking a link, registering an account, or paying for the document).

  • Excerpt cut (end of preview) (block):
Text before the marker is included into the preview

$$$

Text after the marker is displayed with some sort of logic

--> The marker should never be displayed, and if multiple markers are found in a document, only the first should be used

Users

Link to an user profile in the context of a website.

  • Profile link (in-line):
@username

--> The username can be any sequence of unicode points, whitespaces and colon excluded --> No white space is allowed between the username and arobase --> At least one white space is required around the pattern

  • Reference (block):
@username:
Text of the reference

--> Can span one or more blocks, stopped by another user reference or an hard stop

Emojis

  • Named reference (in-line):
:id:
:happy:

--> Text reference name can only include lower / upper case and digits, the name should be treat as case-insensitive

  • Textual replacement (in-line):
:)

--> When at start of a line, at least one space is required after the pattern --> When at end of a line, at least one space is required before the pattern --> When in the middle of line, at least one space is required before and after the pattern --> May also be used replace some in-line cosmetics like double dash and ->, etc.

Custom assets

For custom website extensions of the syntax.

  • Include (block):
{type: something}

--> Can be used to include custom objects in the document --> Follow the embed syntax but the resource URI can be anything (a single id, a key/value map, etc.) --> No brackets are allowed in the something part

  • External reference (in-line):
[/path/to/ref]
[/path/to/ref#chapter]
[/path/to/ref](Text of the reference)
[/path/to/ref#chapter](Text of the reference)

--> If the text of the reference is not given, the raw path should be used instead --> The reference text can be any sequence of text without in-line patterns (just raw text) --> No white space is allowed between the closing bracket and the opening parentheses --> One or more white spaces are allowed around the path and reference text --> Can be used to links articles of the same context (website, bundle of documents, etc) --> A #id can be added at the end of the path to link a specific chapter of a remote document --> Path must start with a single slash

Document comments

  • Declaration (block):
;;; Comment text

--> The comment may be displayed to the reader according to the rendering logic --> The comment text can be any sequence of text with any in-line patterns, but no blocks --> A comment block is closed by another comment or an hard stop

Annexes

Things ignored because they're not semantic:

  • Align left, right, center and justify (truly cosmetic, should be set one-for-all at rendering level)
  • Make upper-case, lower-case, capitalize (same as above)
  • Direction rtl / ltr (same as above)
  • Colours (Google hate this and abuses can be very nasty)
  • Size (Very bad idea, abuses guaranteed)

Things removed from previous revisions:

  • Spoilers: graphical by design, will not work if printed for example, also not really useful
  • Sub and sup script: No idea how to expose them, maybe in the inline formula as operator or something
  • Underlined and overlined text: difficult to read, no real meaning, prefer simple/strong emphasis principle
  • Metadata support: may be useful in rare edge case, but kill the "not a programming language" aspect of the syntax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment