UUID
is probably a string
ShuffleStart(
keyFields Array<String>,
rowType (virtual) Struct,
TypedCodecSpec codecAndEType
): id UUID
ShuffleWrite(
id UUID,
partitionId PInt64,
attemptId PInt64,
rows Stream<Struct>
): Unit
ShuffleWritingFinished(
id UUID,
outPartitions PInt64
): Array<rowType.select(keyFields)>
ShuffleRead(
id UUID,
partitionId PInt64
): Stream<Struct>
ShuffleDelete(
id UUID
): Unit
How does IR manage resources? Should Shuffle
be a new type that performs a
network call when it is freed? My current thinking is that ShuffleDelete can be
inserted by the TableIR lowerer. I believe it will see all potential reads of
the shuffle (effectively, the transitive closure of the parents of the
TableKeyBy
or TableOrderBy
that is being lowered to a pair of stages using a
ShuffleWrite and a ShuffleRead).
ShuffleWritingFinished
returns an array of length outPartitioning + 1
. Partition i
's bounds are [result[i], result[i+1])
.
ShuffleRead
is permitted to pass any partitionId in [0, outPartitions). It may
read the same partition multiple times by multiple actors. The stream returned
by each partition contains only records whose keys fall within the partition
bounds (left-inclusive, right-exclusive) returned by
ShuffleWritingFinished
. Each partition will have roughly the same number of
records.