Skip to content

Instantly share code, notes, and snippets.

@oleg-agapov
Created November 18, 2020 18:39
Show Gist options
  • Save oleg-agapov/21721343dcb3cdf88a954740ddf71c5e to your computer and use it in GitHub Desktop.
Save oleg-agapov/21721343dcb3cdf88a954740ddf71c5e to your computer and use it in GitHub Desktop.
date url
2020-01-01 github.com
2020-01-02 google.com
(venv) Olegs-MacBook-Pro:pyspark-example oagapov$ pyspark
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
20/11/18 19:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.6.8 (default, Dec 29 2018 19:04:46)
SparkSession available as 'spark'.
>>>
>>>
>>> spark
<pyspark.sql.session.SparkSession object at 0x105421b00>
>>> spark.version
'3.0.1'
>>>
(venv) Olegs-MacBook-Pro:pyspark-example oagapov$ spark-submit script.py
20/11/18 19:35:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/18 19:35:56 INFO SparkContext: Running Spark version 3.0.1
...
+----------+----------+
| date| url|
+----------+----------+
|2020-01-01|github.com|
|2020-01-02|google.com|
+----------+----------+
...
20/11/18 19:36:11 INFO SparkContext: Successfully stopped SparkContext
20/11/18 19:36:11 INFO ShutdownHookManager: Shutdown hook called
20/11/18 19:36:11 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/gd5283g11hs_r75wbvr616gm0000gn/T/spark-10e06298-841f-4587-b4be-ff5152eeb894
20/11/18 19:36:11 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/gd5283g11hs_r75wbvr616gm0000gn/T/spark-10e06298-841f-4587-b4be-ff5152eeb894/pyspark-fbb2fd46-21a0-4485-a630-2ca92301f05d
20/11/18 19:36:11 INFO ShutdownHookManager: Deleting directory /private/var/folders/qq/gd5283g11hs_r75wbvr616gm0000gn/T/spark-e22c785e-264a-498a-84f1-16a63f88fd47
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder\
.master('local')\
.appName('Local test')\
.getOrCreate()
df = spark.read.option("header", True).csv("data.csv")
df.show()
if __name__ == "__main__":
main()
python3 -m venv venv
source venv/bin/activate
pip install pyspark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment