Skip to content

Instantly share code, notes, and snippets.

@dzakyputra
Created June 17, 2020 00:07
Show Gist options
  • Save dzakyputra/9c1269dde079a55eb9b6d630ba2a0269 to your computer and use it in GitHub Desktop.
Save dzakyputra/9c1269dde079a55eb9b6d630ba2a0269 to your computer and use it in GitHub Desktop.
# Profile preprocessing
# Convert the became_member_on into datetime type
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d')
# Create a new column with the value of the difference days between the column became_member_on and the max days
profile['difference_days'] = (profile['became_member_on'].max() - profile['became_member_on']).dt.days
## Find the median and mode
# Find median of age
median_age_per_day = profile.groupby('became_member_on', as_index=False)['age'].median()
# Find median of income
median_income_per_day = profile.groupby('became_member_on', as_index=False)['income'].median()
# Find mode of gender
mode_gender_per_day = profile.groupby('became_member_on')['gender'].agg(lambda x: pd.Series.mode(x))
mode_gender_per_day_value = [i if isinstance(i, str) else 'M' for i in mode_gender_per_day]
## Fill the value based on the condition
# Convert age 118 to the median of that day
age_reference = dict(zip(median_age_per_day['became_member_on'], median_age_per_day['age']))
profile['age'] = profile['age'].replace({118: None, 101: None}).fillna(profile['became_member_on'].map(age_reference))
profile.loc[profile['age'] > 100, 'age'] = profile['age'].median()
# Fill the null values in gender column with the mode
gender_reference = dict(zip(mode_gender_per_day.index,mode_gender_per_day_value))
profile['gender'] = profile['gender'].fillna(profile['became_member_on'].map(gender_reference))
# Fill the null values in income column with the median
income_reference = dict(zip(median_income_per_day['became_member_on'], median_income_per_day['income']))
profile['income'] = profile['income'].fillna(profile['became_member_on'].map(income_reference))
profile['income'].fillna(profile['income'].median(), inplace=True)
# Round down the age column
profile['age'] = profile['age'].astype(int)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment