caseyliqb/PySpark Dataframes from Scratch.md

Last active February 27, 2020 12:13

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/caseyliqb/a04b80f4dfda569a6c957ee8f29dc997.js"></script>
Save caseyliqb/a04b80f4dfda569a6c957ee8f29dc997 to your computer and use it in GitHub Desktop.

Raw

Why is this so hard to remember?

from pyspark.sql.types import StructType, StructField, StringType

rdd = sc.parallelize([("moo this has stopwords b", "bat this one does not"),
                      ("apple orange banana", "cookie jar bla la")])

schema = StructType([StructField('entity', StringType(), True),
                     StructField('cleaned_entity', StringType(), True),
                     ])

# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)

Author

caseyliqb commented Feb 27, 2020

output:

+------------------------+---------------------+
|entity                  |cleaned_entity       |
+------------------------+---------------------+
|moo this has stopwords b|bat this one does not|
|apple orange banana     |cookie jar bla la    |
+------------------------+---------------------+