Skip to content

Instantly share code, notes, and snippets.

@timothyrenner
Created January 30, 2019 22:19
Show Gist options
  • Save timothyrenner/e9df7c5da334f464e05f3d270a17d86d to your computer and use it in GitHub Desktop.
Save timothyrenner/e9df7c5da334f464e05f3d270a17d86d to your computer and use it in GitHub Desktop.
Pyspark Pandas UDF Creation
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
@pandas_udf(returnType=DoubleType())
def predict_pandas_udf(*features):
""" Executes the prediction using numpy arrays.
Parameters
----------
features : List[pd.Series]
The features for the model, with each feature in it's
owns pandas Series.
Returns
-------
pd.Series
The predictions.
"""
# Need a multi-dimensional numpy array for sklearn models.
X = pd.concat(features, axis=1).values
# If model is somewhere in the driver we're good.
y = model.predict(X) # <- This is vectorized. Kachow.
return pd.Series(y)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment