Koba Khitalishvili KobaKhit

Classifying Text in Snowflake SQL

With LLMs becoming available in Snowflake as part of their Cortex suite of products in this piece we will explore what the experience is like when classifying text. First of all, Snowflake has native CLASSIFY_TEXT function that does exactly what it says when given a piece of text and an array of possible categories. Second, one could classify text using emebeddings (EMBED_TEXT_768) and similarity to possible categories calculated by one of the distance function like cosine similarity (VECTOR_COSINE_SIMILARITY). Finally, when going the embeddings + similarity route we could use a cross join with a categories table or create a column for each category's similarity score and then assign the greatest one. So we have thre

Spines in SQL

Given a starting date 2024-02-01 I would like to generate 7 days into the future until February 8th (2024-02-08), ex.g.

dt
2024-02-01
2024-02-02
2024-02-03
2024-02-04

	from pyspark.sql.functions import monotonically_increasing_id, row_number
	from pyspark.sql import Window
	from functools import reduce

	def partitionIt(size, num):
	'''
	Create a list of partition indices each of size num where number of groups is ceiling(len(seq)/num)

	Args:
	size (int): number of rows/elemets

	import tableauserverclient as TSC
	import pandas as pd
	from io import StringIO

	class Tableau_Server(object):

	"""docstring for ClassName"""
	def __init__(self,username, password,site_id,url, https = False):
	super().__init__() # http://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods

	<apex:page >
	<html>
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>

	<!-- User Id in a span -->
	<span id = 'user' style = 'display: none;'>
	<apex:outputText label="Account Owner" value="{!$User.Id}"></apex:outputText>
	</span>

	<!-- Embed placeholder -->

	class Reddit():
	def __init__(self,client_id, client_secret,user_agent='My agent'):
	self.reddit = praw.Reddit(client_id=client_id,
	client_secret=client_secret,
	user_agent=user_agent)

	def get_comments(self, submission):
	# get comments information using the Post as a starting comment
	comments = [RedditComment(author=submission.author,
	commentid = submission.postid,

	library(tidyr)

	setwd("~/Desktop/unnest")

	fname = "file-name.csv"
	df = read.csv(paste0(fname,'.csv'), stringsAsFactors = F)

	df$seats =
	sapply(1:nrow(df), function(x) {
	seats = c(df[x,]$first_seat,df[x,]$last_seat)

	import requests
	import base64
	import pprint

	import pandas as pd
	import json
	from tqdm import tqdm

	# https://stubhubapi.zendesk.com/hc/en-us/articles/220922687-Inventory-Search

	# http://srome.github.io/Parsing-HTML-Tables-in-Python-with-BeautifulSoup-and-pandas/
	class HTMLTableParser:
	@staticmethod
	def get_element(node):
	# for XPATH we have to count only for nodes with same type!
	length = len(list(node.previous_siblings)) + 1
	if (length) > 1:
	return '%s:nth-child(%s)' % (node.name, length)
	else:
	return node.name

	df = read.csv("your-df.csv")

	# Number of items in each chunk
	elements_per_chunk = 100000

	# List of vectors [1] 1:100000, [2] 100001:200000, ...
	l = split(1:nrow(df), ceiling(seq_along(1:nrow(df))/elements_per_chunk))

	# Write large data frame to csv in chunks
	fname = "inventory-cleaned.csv"