Skip to content

Instantly share code, notes, and snippets.

@lancejpollard
Created August 3, 2024 23:48
Show Gist options
  • Save lancejpollard/e77df75e67e4f796d5f32a667fda37ce to your computer and use it in GitHub Desktop.
Save lancejpollard/e77df75e67e4f796d5f32a667fda37ce to your computer and use it in GitHub Desktop.
Groq and easy LLaMa in Node.js
import 'dotenv/config'
import { generateText } from 'ai'
import { createOpenAI as createGroq } from '@ai-sdk/openai'
import fs from 'fs/promises'
import _ from 'lodash'
const wait = ms => new Promise(res => setTimeout(res, ms))
const groq = createGroq({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY,
})
// fine-tuning with groq: https://chatgpt.com/c/d4729271-b256-4654-8c8c-5b6a12983f9c
// https://sdk.vercel.ai/docs/guides/llama-3_1
make()
async function make() {
const definitions = JSON.parse(
await fs.readFile(
`import/language/sanskrit/sanskrit.inria.json`,
`utf-8`,
),
)
const outs = JSON.parse(
await fs.readFile(`import/language/sanskrit/gloss.json`, `utf-8`),
)
for (const entry of definitions) {
if (outs.find(x => x.slug === entry.word)) {
continue
}
const senses = entry.sense
.map(x => x.text?.split(/\s*;\s*/))
.flat()
.filter(x => x)
if (!senses.length) {
continue
}
try {
await wait(500)
const { text } = await generateText({
model: groq('llama-3.1-70b-versatile'),
prompt: `Extract the definitions from this set of messy definitions. Ignore non-english text Devanagari or IAST Sanskrit text. Extract only the English clearly and simply. Don't base your answer on past responses you've given me. Strip out the Devanagari text and junk, and make the final set of definitions a small set. Don't include the definition if it says "See x", linking to other definitions. Simplify each definition if it's not already simplified, to ideally 1-3 word definitions, removing or merging duplicate or similar definitions where applicable. Do not include duplicate definitions. Omit otherwise meaningless text from the input, which doesn't appear to be English. It's okay if the definition is longer than 3 words if it can't easily be shortened. Send the definitions as a JSON array of strings under the "definitions" key. Finally, take your summarized definitions, and summarize those into one short definition (ideally also 1-3 words), and return that under the "gloss" JSON key. For the gloss, don't create the gloss to be too different in meaning from the original definitions, make the gloss more descriptive only if necessary to match the definition. Otherwise keep the gloss short. Format all definitions and the gloss in lowercase unless it is a proper name, and don't use abbreviations where they can be easily expanded to the normal word. And then also, add the part of speech of the word that's being defined in the "role" field. Also, be conservative. If you can't figure it out clearly and simply, then return empty JSON (i.e. return nothing). Better to be safe than sorry. That's it. So only return one JSON object with 3 keys: gloss, definitions, and role. Don't use slashes / or semicolons ; or colons or any periods or non-text in any of the responses. Also, don't add wylie text. Don't abbreviate words either. Only return JSON and nothing else, don't write prose or send backticks, just JSON. Here is the input as JSON: ${JSON.stringify(
senses,
null,
2,
)}`,
})
try {
const out = text.match(/```([\s\S]+)```/gm)
? JSON.parse(RegExp.$1)
: JSON.parse(text)
if (!out.gloss?.trim()) {
continue
}
outs.push({
slug: entry.word,
...out,
definitions:
out.definitions &&
_.uniq(out.definitions).map(x =>
x.replace(/\s*\/\s*/g, ' / '),
),
})
console.log(outs[outs.length - 1])
await fs.writeFile(
`import/language/sanskrit/gloss.json`,
JSON.stringify(outs, null, 2),
)
} catch (e) {
console.log(text)
console.log(e)
}
} catch (e) {
console.log(e)
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment