Skip to content

Instantly share code, notes, and snippets.

@pszemraj
Created August 29, 2024 23:35
Show Gist options
  • Save pszemraj/60d42a43f5868467a5089da6a48791e4 to your computer and use it in GitHub Desktop.
Save pszemraj/60d42a43f5868467a5089da6a48791e4 to your computer and use it in GitHub Desktop.
filter df dataset col for LLM refusals in instruct data
# !pip install -q sentence-splitter
import os
from sentence_splitter import split_text_into_sentences
REFUSAL_TERMS = [
"sorry",
"i can't",
"unfortunately,",
"as a language model",
"as an ai language model",
"i cannot",
]
def not_refusal(example) -> bool:
first_sentence = split_text_into_sentences(
example["response"],
language="en",
)[0]
return not any(term in first_sentence.lower() for term in REFUSAL_TERMS)
ds = ds.filter(not_refusal, num_proc=os.cpu_count())
ds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment