Last active
October 4, 2022 10:00
-
-
Save szeitlin/28b99e6228b7e13fae4ff38a5b698515 to your computer and use it in GitHub Desktop.
Netflix PRS conference
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Started in 2016; past talks are online | |
"everything is a recommendation" | |
80% of what people watch on Netflix comes from recs | |
# Mounia Lalmas - Dir Research at Spotify (based in London) | |
Home: help users find content quickly | |
*nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business | |
1. success metrics | |
*BaRT - McInerney et al. 2018* | |
- bandits | |
- find best card per shelf, and then rank shelves | |
success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?) | |
exceptions, e.g. sleep playlist success threshold is longer | |
jazz listeners listen longer than other listeners | |
reward functions: | |
- one global function | |
- one per user x playlist | |
- groups of users x playlists | |
Used *Dhillon et al co-clustering KDD 2003* | |
Histograms + thresholds | |
Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me) | |
Affinity features (content x user) are better than generic ones (age or day) | |
*Dragone et al. WWW 2019* | |
2. intent | |
What is the user looking for? i.e. passive listening vs. actively engaging | |
*Mehotra et al. WWW 2019* | |
examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background | |
(the chart for this looked kind of meaningless to me?) | |
multi-level model + intent improved user satisfaction ratings prediction over global | |
shared learning across intents | |
most useful metrics: | |
- time to success + dwell time | |
- save or download | |
3. diversity of content | |
*Mehotra et al. CIKM 2018* | |
- relevance (user + tracks) | |
- satisfaction (stream > 30 seconds) | |
- diversity (range of popularity from Drake to... not Drake) | |
Of course, they found that high relevance meant less diversity (very few playlists have both) | |
Tradeoff of beta = 0.7 with a 10% drop in satisfaction | |
vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32% | |
"personalized diversity" - satisfaction up 12% | |
(some users are of course more diverse in their interests) | |
Someone asked about how to define intent; she said TBD, but mostly by clustering behavior | |
and looking at things like time of day | |
---- | |
# David Hubbard and Benoit Rostykus - Netflix | |
Long-term outcomes | |
short: click (popularity bias), view, like | |
medium: dwell time, quality plays | |
long: satisfaction, subscription renewal, etc. | |
Ranking: popular, not relevant | |
Messaging: short term metrics can lead to user fatigue | |
Want to model satisfaction over time, for example, renewal as a Bernoulli model over months | |
Bayesian approach, beta-logistic/gemoetric, *Heckman & Willis 1977*. | |
Features used: country, tenure, devices, streaming, behavior, payments | |
predicting churn, basically. *Vaupel & Yashin 1985* | |
Effects of selection on population bias, why means are bad | |
*Fader et al. 2018* predicting retention | |
Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes, | |
vs. initial pleasure followed by lasting disappointment | |
Used a Criteo online conversions dataset of 1M rows and 100k columns (?!) | |
beta-logit was approximately as good as exponential, and <5% better than plain logit | |
common approach is logit + Laplace approximation, but that's not very scalable | |
Concluded that beta posterior was better than a gaussian posterior, and more scalable | |
Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training | |
*paper available on arxiv* | |
someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way) | |
---- | |
# Jason Gauci - applied reinforcement learning, Facebook | |
- Evangelize decision-making | |
Has been training large NNs since back when you couldn't get an NN talk into NIPS | |
Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon | |
Programming Throwdown podast | |
Eternal Terminal replacement for ssh at mistertea.github.com | |
1. retrieval matrix factorization, DNN | |
2. event prediction, DNN, GBDT, etc. | |
3. Ranking - bandits, RL | |
4. DS - a/b tests | |
1 & 3 are control | |
2 is signal processing | |
4 is causal analysis | |
Classification: | |
- what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct | |
Decision Making: | |
- How can we improve trained, trained from another policy (usually a worse one), | |
counterfactual evaluation, assume data are flawed | |
- Action features | |
- context - device type | |
- session features | |
- event predictions | |
Greedy State Recs: | |
- value function: utility to stakeholders | |
- control function (maximum predictions) | |
- transition function - penalty to create. | |
"Data science Descent" | |
- loop: design metrics, create predictions, analyze (better: automated a/b tests) | |
Historical: Google had giant tables of click-through rates by categories. Humans were building decision trees by hand. | |
https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid) | |
Markov Decision Process: | |
- state: user/post/session | |
- action: which post to show (decide) | |
- reward: R(S,A) | |
- Transition: T(S,A) - S' | |
map state-action pairs to future state | |
- Value: discounted reward | |
Can't just regress | |
Credit assignment problem | |
State action reward state action: SARSA is recursive | |
idea borrowed from Dynamic Programming | |
Have the model pick the best action instead: policy gradient | |
Synchronous SGD, spark and distributed pytorch | |
CPE: counterfactual policy evaluator | |
- more useful slides but they went by too fast, see the online video for details | |
---- | |
# Olya Gurevich - Marvelous AI | |
- Detecting political narratives with HITL NLP | |
23% of adults admitted to sharing fake news (on purpose or by accident) | |
- hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric | |
Cofounders: Danielle came from Kixeye | |
Target audience: researchers, journalists, political candidates. Focusing on the 2020 election. | |
- discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content | |
Train GloVe embeddings | |
joe biden 'creep' and amy klobuchar 'salad' and 'comb' see *Demszky et al. NAACL 2019* | |
Hierarchical clustering | |
Media Bias Fact Check (MBFC) ratings of news sites | |
*Benkler et al. Network Propaganda* | |
Left-wing is self-policing, right-wing is not | |
Female candidates are getting more fake news attacks, Elizabeth Warren gets the most | |
Suggestions: | |
1. Engagement metrics have to evolve. | |
2. Beware echo chambers and radicalization spirals. | |
3. Actively measure bias. | |
4. Ideological divide is not symmetric. | |
----- | |
# Susan Asthey - Stanford | |
- Counterfactual Inference for Consumer Choice with Many Products | |
* see her publications * | |
- Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products | |
- unobserved latent product characteristics | |
- Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling | |
- * her slides seems useful, with many references, but she went really fast, see the video for details* | |
- Used loyalty card data set over 18 months, prices change every Tuesday night | |
- Product hierarchy: UPC, subclass, class, category, group, dept, section | |
- throw out seasonal things | |
- 28 features re: user demographics | |
- Consider categories where user is probably only going to buy 1 | |
- users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories | |
- example UPC price series | |
Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility | |
Assumes items are independent, but pricing changes in one brand actually affect purchases in others. | |
Add a penalty for price increase. | |
Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category | |
Variational Bayes | |
popular products are purchased at least 2.5 times per day on average | |
- what happens when something is out of stock | |
- what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity" | |
- determine a user's price sensitivity for different products | |
(how to account for skew? and time effects?) | |
- gains from targeted discounts | |
- similarity/exchangeability/complementarity | |
coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down | |
- how are purchases re-distributed | |
- placebo controls and normalization to check for overall effects, like a product becoming generally more popular | |
---- | |
# Mihajlo Grbovic - ML-powered search at Airbnb for Experiences | |
ML-powered search at AirBnB | |
6M listings in 191 countries | |
Experiences: activities led by local hosts | |
team is 10 people, all men | |
click data | |
experience features: price, reviewes, ratings, duration, max guests, category | |
GBDT | |
50k training samples | |
partial dependency plots | |
forest score delta | |
personalization: a mix of historical & in-session clicks | |
rank also by availability & dates, type of trip (business vs. family) | |
category intensity: # of clicks total | |
category recency: # days since last clicked | |
Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this) | |
create hobby profile of the user | |
origin-destination pairs, e.g. Japan: classes, USA: food | |
add language match using browser language | |
quality - ratings, and phrases in reviews | |
started by training on clicks/bookings --> using bookings per impression, performance got 3% better | |
impression discounting - adjust when something is ranked high but never clicked | |
position bias TBD | |
instance-level features, e.g. weeekend vs. weekday | |
---- | |
Minmin Chen - Youtube | |
reinforcement learning, joint work with Covington et al. | |
limitations of supervised learning | |
1. myopic: pigeon-holes users, short-term is prioritized over long-term | |
2. system bias: missing feedback on items that were never recommended "rich get richer" | |
Goals: | |
1. better understand latent user info | |
2. be able to quickly adapt | |
3. discover new user interests | |
Plan a sequence of actions to maximize long-term reward | |
Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk) | |
Maximize cumulative reward | |
Markov decision process (MDP) | |
Maximize reward by gradient ascent, but user trajectories are generated by different policies | |
off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting | |
see Achiam, Joshua et al. 2017 arXiv paper | |
Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??) | |
Top-K: sum of rewards for individual items; they add an off-policy correction | |
actually saw engagement go up 20% | |
Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate | |
entropy regularization, penalize KL divergence between uniform and learned policy | |
REINFORCE | |
hard to connect a single user choice to long-term rec behavior and long-term user behavior/value | |
understanding user intent TBD | |
---- | |
Jure Leskovec - chief scientist at Pinterest | |
pins: bookmark + photo + map location | |
large, human-curated graph | |
crowd-sourcing for curating clusters | |
can be radically personalized | |
every pin and board has a description | |
huge dataset of 3-4B pins and boards; a few hundred billion connections | |
how people describe things | |
need graph to update in real-time without retraining | |
featurizing the graph structure is hard | |
deep-learning tools know how to use fixed-size grids, and sequences | |
graphs have no spatial locality or reference point (no top-left like a spatial image) | |
see Graph CNNs for web-scale recommender systems, KDD 2018 | |
nodes aggregate info from their neighbors using NNs | |
pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network | |
curriculum learning - use increasingly harder negative controls (closer, but still not related) | |
vs. very easy negatives (very obviously not related). someone asked where do they get these? answer: | |
from other rec systems, and from lower down in the rankings | |
sub-sample neighborhoods for efficient GPU batching | |
producer-consumer CPU-GPU training pipeline | |
trying to predict what pin they'll save next | |
much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well) | |
someone also asked what about cycles in the graph, not reinforcing each other? | |
answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative | |
---- | |
Selen Ugoruglu - Netflix show similarity | |
Siamese networks with contrastive loss | |
weights are shared between them during training | |
can use a hinge loss for dissimilar items | |
Triplet loss - computationally expensive | |
(-) other class -- anchor -- (+) same class | |
minimize anchor - (+) distance after learning | |
triplet choices are important, want to choose semi-hard as (-) to optimize convergence | |
Lifted Structural Loss - more relationships among all the training samples | |
metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment