Hey, would you mind giving me tips/information on improving the syntax highlighting for my programming language? Ive been working on the language pretty seriously for months but the vscode highlighting isnt that good still.
Sure! It's kind of a lot for a Reddit comment, but here are some resources that should get the ball rolling:
-
This is a template repo (not an actual GitHub template, so you have to do a manual find-and-replace after cloning/forking) with tooling to generate the
*.tmLanguage.json
files from TypeScript objects. This is a huge help because you can use regex literals to define the patterns (better syntax highlighting, and no more double-escaping everything!), split the grammar into multiple files, write helper functions to remove some boilerplate, etc. There's also aregex
function there that you can invoke like a tagged template to combine multiple regular expressions, like:export const functionCall = { match: regex`/${identifier}(?=\()/`, name: "entity.name.function.mylang", }
-
@vscode-devkit/nx
&@vscode-devkit/grammar
These might be more useful if you're familiar with Nx. The 1st is an Nx plugin version of the infrastructure from the repo above, so you can
nx generate
/nx build
your extension/grammar projects, and the 2nd is the type definitions andregex
function from that repo extracted to an npm library. -
This is my latest VS Code project, and it's a pretty comprehensive case study — it uses recursive language embedding, multi-line matches, scoping based on indentation level, etc. If there's something tricky that can be done with a VS Code grammar, there's probably an example of it here.
-
TextMate 1.x Manual | Language Grammars
This is the go-to reference for how VS Code/TextMate grammar definitions actually work. All the examples here are written in a different syntax, but it should be pretty trivial to mentally convert them to JSON/TypeScript.
-
TextMate 1.x Manual | Regular Expressions
This is the reference for the exact flavor of regular expressions supported by VS Code/TextMate grammars.
-
This is a general-purpose regex debugging/authoring tool which I use all the time for more complicated expressions. The PCRE flavor isn't an exact match for the VS Code/TextMate engine, but it's close enough that it won't make a difference in the vast majority of cases.
-
For those corner cases where a PCRE expression isn't working the way you would expect, this is a regex tester app that does use the exact same regex flavor as VS Code/TextMate. The UX is not nearly as nice as Regexr, which is why I only use it as a backup.
And here are some general tips:
-
When deciding what scope name to use for a particular token, I always check to see what TypeScript uses for something similar. TypeScript is Microsoft's baby, and it's the language used to build VS Code, so it has by far the most comprehensive support from the built-in themes and most third-party themes. So if you yoink all of your scope names straight from the TypeScript grammar you're pretty much guaranteed to have good results with any theme instead of having to ask folks to install one that specifically supports your language.
You can see all of the TextMate scopes that are applied to a token by placing your cursor on that token and invoking the Developer: Inspect Editor Tokens and Scopes command from the Command Palette (Ctrl+Shift+P on Windows).
-
99% of the time I use one of three patterns for tokenizing:
-
Simple match, for when you can trivially tokenize a string of characters with a simple expression:
{ match: /\b(?:if|else|for|in|switch|case)\b/, name: "keyword.control.mylang", }
-
Match / captures, for when certain tokens have specific meanings when they're grouped together a certain way. The numbers correspond to the capture groups created by the expression:
{ match: /(\.)([_a-zA-Z][_a-zA-Z0-9]*)/, captures: { // Matches the dot character 1: { name: "punctuation.accessor.mylang" }, // Matches the identifier after the dot 2: { name: "variable.property.mylang" }, } }
-
Begin / End / Patterns. This is the most complicated one but also probably the one I use the most. The
begin
pattern marks the start of some chunk of code, theend
pattern marks its end, andpatterns
lets you selectively include the patterns (from your repository or inline) that will be used to tokenize everything that comes inbetween.This type of pattern does not give a damn about line breaks — it will keep tokenizing until it hits something that matches
end
.beginCaptures
/endCaptures
let you tokenize thebegin
/end
matches themselves, in the same format as thecaptures
key in the previous example.export const funcSignature = { // Matches `fn foo(` begin: regex`/(fn)\s+(${identifier})\s*(\()/`, beginCaptures: { // Matches the `fn` 1: { name: "storage.type.function.mylang" }, // Matches the identifier 2: { name: "entity.name.function.mylang" }, // Matches the `(` 3: { name: "punctuation.definition.parameters.begin.mylang" }, }, end: /\)/, endCaptures: { // `0` in a `captures` block matches the entire expression 0: { name: "punctuation.definition.parameters.end.mylang" }, }, patterns: [ // Include the `comments` pattern from your `repository` { include: "#comments" }, // Inline pattern for a parameter declaration { begin: regex`/(${identifier})\s*(:)/`, beginCaptures: { // The parameter name 1: { name: "variable.parameter.mylang" }, // The `:` 2: { name: "keyword.operator.type.annotation.mylang" }, }, // Using a lookahead to match either a comma or a closing paren. Because it's a lookahead, // the parent pattern will still be able to match the closing paren itself -- this pattern // gets popped as soon as it sees that the _next_ character matches end: /(?=[,\)])/, // Nested pattern inclusion for the type itself, since those can be pretty complicated // (type arguments, namespaces and scope operators, etc.) patterns: [ // We'll need to repeat anything that was valid for the enclosing pattern // that's also valid here { include: "#comments" }, // Types are used in lots of different places, so that pattern also has its own key in // the repository { include: "#types" }, ], }, // We could include a whole `punctuation` pattern here, but a comma is the only other token // that's actually valid in this pattern, so we'll just define it inline { match: /,/, name: "punctuation.separator.parameter.mylang", }, ], }
-
That's a lot, I know (and I don't even want to know how badly Reddit is going to butcher my Markdown formatting). Not everything, but between that and the links above hopefully it's enough to get you started. Feel free to DM me if you have any specific questions or issues!