Created
January 30, 2019 22:19
-
-
Save timothyrenner/e9df7c5da334f464e05f3d270a17d86d to your computer and use it in GitHub Desktop.
Pyspark Pandas UDF Creation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from pyspark.sql.functions import pandas_udf | |
from pyspark.sql.types import DoubleType | |
@pandas_udf(returnType=DoubleType()) | |
def predict_pandas_udf(*features): | |
""" Executes the prediction using numpy arrays. | |
Parameters | |
---------- | |
features : List[pd.Series] | |
The features for the model, with each feature in it's | |
owns pandas Series. | |
Returns | |
------- | |
pd.Series | |
The predictions. | |
""" | |
# Need a multi-dimensional numpy array for sklearn models. | |
X = pd.concat(features, axis=1).values | |
# If model is somewhere in the driver we're good. | |
y = model.predict(X) # <- This is vectorized. Kachow. | |
return pd.Series(y) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment