Skip to content

Instantly share code, notes, and snippets.

View lam0819's full-sized avatar

Alan Lam lam0819

View GitHub Profile
@lam0819
lam0819 / chunking-regex.ts
Created August 15, 2024 04:59 — forked from hanxiao/testRegex.js
Use regex to do chunking by using all semantic cues
// Used in https://jina.ai/tokenizer (Aug. 14th version)
// Define variables for magic numbers
const MAX_HEADING_LENGTH = 6;
const MAX_HEADING_CONTENT_LENGTH = 200;
const MAX_HEADING_UNDERLINE_LENGTH = 200;
const MAX_HTML_HEADING_ATTRIBUTES_LENGTH = 100;
const MAX_LIST_ITEM_LENGTH = 200;
const MAX_NESTED_LIST_ITEMS = 5;
const MAX_LIST_INDENT_SPACES = 7;
const MAX_BLOCKQUOTE_LINE_LENGTH = 200;