This solutions is based on the behaviour of EXPLODE
in SparkSQL and the premise that the UDFs can return complex types like lists, structs and maps (this types are represented on wire as JSON type)
DIFF_TREE(repository_id, from<revision>, [to<revision>]
-> something similar to []git.ChangeDIFF(repository_id, from<revision>, [to<revision>]
-> []struct{file_path, marker, line, content}COMMIT_FILES_STATS(repository_id, from<revision>, [to<revision>]
-> []struct{file_path, additions, deletions}BLAME(repository_id, from<revision>, [file_path]]
-> []struct{file_path, line, author_name, when, content}
The queries are originale extracted from this document
Get all files in HEAD with the timestamp of their last change (repository, file path, timestamp)
NOTE: In fact his queries returns the last files changed by HEAD, but this is how was done in the google docs example queries
SELECT
repository_id,
file_path_to,
committer_when
FROM (
SELECT *, EXPLODE(DIFF_TREE(repository_id, commit_hash))
FROM commits
NATURAL JOIN refs
WHERE ref_name = 'HEAD'
) as raw
A more acurate anweser is:
SELECT
repository_id,
file_path_to,
MAX(committer_when)
FROM (
SELECT *, EXPLODE(BLAME(repository_id, commit_hash))
FROM commits
NATURAL JOIN refs
WHERE ref_name = 'HEAD'
) as raw
GROUP BY repository_id, file_path_to;
Get the number of lines added and deleted per developer and language (repository, author email, language, lines added, lines deleted). It would be ok to do two different queries for added and deleted, but both in the same query would be ideal.
SELECT
commit_author_email,
x.marker,
LANG(blob_content, file_path) AS lang,
COUNT(1) as lines
FROM (
SELECT *, EXPLODE(DIFF(repository_id, commit_hash))
FROM commits
NATURAL JOIN commit_files
WHERE
marker = '>' OR
marker = '<' OR
marker = '|'
) as raw
GROUP BY
commit_author_email,
lang,
x.marker
Version with
COMMIT_FILE_STATS
SELECT
commit_author_email,
SUM(additions), SUM(deletions)
LANG(blob_content, file_path) AS lang,
COUNT(1) as lines
FROM (
SELECT *, EXPLODE(COMMIT_FILES_STATS(repository_id, commit_hash))
FROM commits
NATURAL JOIN commit_files
) as raw
GROUP BY commit_author_email, lang
Get all added lines per developer and language (repository, author email, language, line). Note that this is not the number of lines, but the lines themselves.
SELECT
repository_id,
author_email,
LANG(blob_content, file_path) as lang,
line
FROM (
SELECT *, EXPLODE(DIFF(repository_id, commit_hash))
FROM files
NATURAL JOIN commit_files
NATURAL JOIN commits
) as raw
WHERE d.marker = '>'