Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Created July 17, 2019 20:27
Show Gist options
  • Save dyerrington/bf288bd3927a0bdaad1e34d16ac12bd0 to your computer and use it in GitHub Desktop.
Save dyerrington/bf288bd3927a0bdaad1e34d16ac12bd0 to your computer and use it in GitHub Desktop.

Great Data Science Project Criteria:

  • Problem statement that defines a measurable, and/or falsifiable outcome. “Frequency of [specific event] is influential over [some outcome]”. “Users who use [some feature in app] are differentiable from users who less frequently use [some feature in app]”. etc. If you can’t frame a data problem properly, none of has it has purpose. The biggest challenge in data science is making sense and defining the gray area of business problems. This also comes with experience.
  • EDA EDA EDA. Define your scope. Report only what is necessary and relevant to your problem statement. If the model reports only 4-5 common variables as parameters (logistic regression for instance), focus on those when summarizing your work in terms of EDA.
  • How much data is necessary to make this analysis work? Are you sampling? Is a t-test necessary to gain assurance or a rank order test?
  • Explain which model makes the most sense to use. Are you trying to gain inference about a data problem? Data problems that focus too heavily on prediction aren’t really data science unless you have some kind of business need that can be backed by scientific process. The power of data science process is that you can make distinctions about what is valuable and what is not. Modeling is just that. You have to really understand the data well and rely on your ability to see patterns and assert your reasoning about a pattern before you model and if you can adopt that mindset, you are already better then 90% of candidates that I look at.
  • Modeling — know the model inside and out and be able to code it from the ground up. Modeling is handled by libraries and anyone can import them, fit, predict, run metrics, etc but even fewer people actually know what the hell is going on under the hood. Which solver makes the most sense for logistic regression given the data and business need? Knowing the math and engineering is essential but so is understanding what is important to the business. This should guide your choices. Someone who just imports a library and starts fitting with gridsearch isn’t doing anything an engineer can do. Show me your ability to work in the weeds but also see the big picture.
  • Communicating results. You have to have be able to tie each and every single piece of your analysis to your data and business problems at every step of the way. The summary of your work should represent how successful the analysis is and convey a coherent thread that explains the reasoning and substantiates your statement of work. Also explaining how successful your analysis was in terms of key metrics (hopefully you came up with some in the beginning and explained your reasoning!).

Any project that follows these guidelines demonstrate skills that are bare minimum essential to almost every data science role.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment