Hive runs a SQL query as a DAG of jobs or, before Hive 2.0, as MapReduce jobs as well. The final result of this job or set of jobs is always stored in HDFS. Why? Well, your cluster might have to be rebooted between the end of the job and sending of this output to client, or something else might go wrong. To protect against these, we write the final output to HDFS and when Hive gets a request for this output (typically over thrift), it sends the file across.
To write output to HDFS, a FileSinkOperator
instance is used. FileSinkOperator
is a TerminalOperator, meaning that it is called at the end of a job. If there are multiple tasks running, each of them will have their own instance of FileSink (FSO). Whenever you first take a look at an operator, look at 3 functions:
- initializeOp()
- process()
- closeOp()