szeitlin · October 4, 2022 10:00
diff --git a/netflix_PRS_2019_notes b/netflix_PRS_2019_notes
 Started in 2016; past talks are online
 "everything is a recommendation"
 80% of what people watch on Netflix comes from recs

 # Mounia Lalmas - Dir Research at Spotify (based in London)
 Home: help users find content quickly
 *nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business

 1. success metrics
 *BaRT - McInerney et al. 2018*
 - bandits
 - find best card per shelf, and then rank shelves
 success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?)
 exceptions, e.g. sleep playlist success threshold is longer
 jazz listeners listen longer than other listeners

 reward functions: 
 - one global function
 - one per user x playlist
 - groups of users x playlists

 Used *Dhillon et al co-clustering KDD 2003*
 Histograms + thresholds
 Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me)
 Affinity features (content x user) are better than generic ones (age or day)
 *Dragone et al. WWW 2019*

 2. intent
 What is the user looking for? i.e. passive listening vs. actively engaging
 *Mehotra et al. WWW 2019*

 examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background
 (the chart for this looked kind of meaningless to me?)

 multi-level model + intent improved user satisfaction ratings prediction over global
 shared learning across intents 

 most useful metrics:
 - time to success + dwell time
 - save or download

 3. diversity of content
 *Mehotra et al. CIKM 2018*
 - relevance (user + tracks)
 - satisfaction (stream > 30 seconds)
 - diversity (range of popularity from Drake to... not Drake)

 Of course, they found that high relevance meant less diversity (very few playlists have both)
 Tradeoff of beta = 0.7 with a 10% drop in satisfaction
 vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32%

 "personalized diversity" - satisfaction up 12%
 (some users are of course more diverse in their interests)

 Someone asked about how to define intent; she said TBD, but mostly by clustering behavior 
 and looking at things like time of day

 ----

 # David Hubbard and Benoit Rostykus - Netflix
 Long-term outcomes

 short: click (popularity bias), view, like

 medium: dwell time, quality plays

 long: satisfaction, subscription renewal, etc. 

 Ranking: popular, not relevant

 Messaging: short term metrics can lead to user fatigue

 Want to model satisfaction over time, for example, renewal as a Bernoulli model over months
 Bayesian approach, beta-logistic/gemoetric, *Heckman & Willis 1977*.

 Features used: country, tenure, devices, streaming, behavior, payments

 predicting churn, basically. *Vaupel & Yashin 1985*
 Effects of selection on population bias, why means are bad

 *Fader et al. 2018* predicting retention

 Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes, 
 vs. initial pleasure followed by lasting disappointment

 Used a Criteo online conversions dataset of 1M rows and 100k columns (?!) 
 beta-logit was approximately as good as exponential, and <5% better than plain logit

 common approach is logit + Laplace approximation, but that's not very scalable

 Concluded that beta posterior was better than a gaussian posterior, and more scalable

 Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training

 *paper available on arxiv*

 someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way)

 ----

 # Jason Gauci - applied reinforcement learning, Facebook

 - Evangelize decision-making

 Has been training large NNs since back when you couldn't get an NN talk into NIPS

 Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon

 Programming Throwdown podast

 Eternal Terminal replacement for ssh at mistertea.github.com

 1. retrieval matrix factorization, DNN

 2. event prediction, DNN, GBDT, etc.

 3. Ranking - bandits, RL

 4. DS - a/b tests

 1 & 3 are control

 2 is signal processing

 4 is causal analysis

 Classification: 

 - what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct

 Decision Making:

 - How can we improve trained, trained from another policy (usually a worse one), 
 counterfactual evaluation, assume data are flawed

 - Action features

 - context - device type

 - session features

 - event predictions 

 Greedy State Recs: 

 - value function: utility to stakeholders

 - control function (maximum predictions)

 - transition function - penalty to create. 

 "Data science Descent"

 - loop: design metrics, create predictions, analyze (better: automated a/b tests)

 Historical:  Google had giant tables of click-through rates by categories. Humans were building decision trees by hand. 

 https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid)

 Markov Decision Process:

 - state: user/post/session

 - action: which post to show (decide)

 - reward: R(S,A)

 - Transition: T(S,A) - S'

 map state-action pairs to future state

 - Value: discounted reward

 Can't just regress

 Credit assignment problem

 State action reward state action: SARSA is recursive

 idea borrowed from Dynamic Programming

 Have the model pick the best action instead: policy gradient

 Synchronous SGD, spark and distributed pytorch

 CPE: counterfactual policy evaluator

 - more useful slides but they went by too fast, see the online video for details

 ----

 # Olya Gurevich - Marvelous AI

 - Detecting political narratives with HITL NLP

 23% of adults admitted to sharing fake news (on purpose or by accident)

 - hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric

 Cofounders: Danielle came from Kixeye

 Target audience: researchers, journalists, political candidates. Focusing on the 2020 election.

 - discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content

 Train GloVe embeddings

 joe biden 'creep' and amy klobuchar 'salad' and 'comb' see *Demszky et al. NAACL 2019*

 Hierarchical clustering

 Media Bias Fact Check (MBFC) ratings of news sites

 *Benkler et al. Network Propaganda*

 Left-wing is self-policing, right-wing is not

 Female candidates are getting more fake news attacks, Elizabeth Warren gets the most

 Suggestions:

 1. Engagement metrics have to evolve. 

 2. Beware echo chambers and radicalization spirals. 

 3. Actively measure bias. 

 4. Ideological divide is not symmetric. 

 -----

 # Susan Asthey - Stanford

 - Counterfactual Inference for Consumer Choice with Many Products

 * see her publications *

 - Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products

 - unobserved latent product characteristics

 - Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling

 - * her slides seems useful, with many references, but she went really fast, see the video for details*

 - Used loyalty card data set over 18 months, prices change every Tuesday night

 - Product hierarchy: UPC, subclass, class, category, group, dept, section

 - throw out seasonal things

 - 28 features re: user demographics

 - Consider categories where user is probably only going to buy 1

 - users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories

 - example UPC price series

 Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility 

 Assumes items are independent, but pricing changes in one brand actually affect purchases in others. 

 Add a penalty for price increase. 

 Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category

 Variational Bayes

 popular products are purchased at least 2.5 times per day on average

 - what happens when something is out of stock

 - what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity"
 
 - determine a user's price sensitivity for different products
 
 (how to account for skew? and time effects?)
 
 - gains from targeted discounts
 
 - similarity/exchangeability/complementarity
 
 coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down
 
 - how are purchases re-distributed
 
 - placebo controls and normalization to check for overall effects, like a product becoming generally more popular
 
 ----
 # Mihajlo Grbovic - ML-powered search at Airbnb for Experiences
 
 ML-powered search at AirBnB

 6M listings in 191 countries

 Experiences: activities led by local hosts

 team is 10 people, all men

 click data
 experience features: price, reviewes, ratings, duration, max guests, category
 GBDT
 50k training samples
 partial dependency plots
 forest score delta

 personalization: a mix of historical & in-session clicks

 rank also by availability & dates, type of trip (business vs. family)

 category intensity: # of clicks total
 category recency: # days since last clicked

 Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this)

 create hobby profile of the user
 origin-destination pairs, e.g. Japan: classes, USA: food

 add language match using browser language

 quality - ratings, and phrases in reviews

 started by training on clicks/bookings --> using bookings per impression, performance got 3% better

 impression discounting - adjust when something is ranked high but never clicked

 position bias TBD

 instance-level features, e.g. weeekend vs. weekday

 ----
 Minmin Chen - Youtube

 reinforcement learning, joint work with Covington et al. 

 limitations of supervised learning
 1. myopic: pigeon-holes users, short-term is prioritized over long-term
 2. system bias: missing feedback on items that were never recommended "rich get richer" 

 Goals: 
 1. better understand latent user info
 2. be able to quickly adapt
 3. discover new user interests

 Plan a sequence of actions to maximize long-term reward
 Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk)

 Maximize cumulative reward
 Markov decision process (MDP)

 Maximize reward by gradient ascent, but user trajectories are generated by different policies

 off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting
 see Achiam, Joshua et al. 2017 arXiv paper

 Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??)

 Top-K: sum of rewards for individual items; they add an off-policy correction
 actually saw engagement go up 20%

 Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate
 entropy regularization, penalize KL divergence between uniform and learned policy

 REINFORCE

 hard to connect a single user choice to long-term rec behavior and long-term user behavior/value
 understanding user intent TBD

 ----
 Jure Leskovec - chief scientist at Pinterest

 pins: bookmark + photo + map location

 large, human-curated graph
 crowd-sourcing for curating clusters
 can be radically personalized

 every pin and board has a description
 huge dataset of 3-4B pins and boards; a few hundred billion connections
 how people describe things
 need graph to update in real-time without retraining
 featurizing the graph structure is hard

 deep-learning tools know how to use fixed-size grids, and sequences

 graphs have no spatial locality or reference point (no top-left like a spatial image)

 see Graph CNNs for web-scale recommender systems, KDD 2018

 nodes aggregate info from their neighbors using NNs

 pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network

 curriculum learning - use increasingly harder negative controls (closer, but still not related)
 vs. very easy negatives (very obviously not related). someone asked where do they get these? answer: 
 from other rec systems, and from lower down in the rankings

 sub-sample neighborhoods for efficient GPU batching
 producer-consumer CPU-GPU training pipeline

 trying to predict what pin they'll save next
 much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well)

 someone also asked what about cycles in the graph, not reinforcing each other?
 answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative

 ----
 Selen Ugoruglu - Netflix show similarity

 Siamese networks with contrastive loss
 weights are shared between them during training

 can use a hinge loss for dissimilar items

 Triplet loss - computationally expensive
 (-) other class -- anchor -- (+) same class
 minimize anchor - (+) distance after learning
 triplet choices are important, want to choose semi-hard as (-) to optimize convergence

 Lifted Structural Loss - more relationships among all the training samples

 metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles)
	Started in 2016; past talks are online
	"everything is a recommendation"
	80% of what people watch on Netflix comes from recs

	# Mounia Lalmas - Dir Research at Spotify (based in London)
	Home: help users find content quickly
	*nice slide w/ overall view of research -> measurement -> modeling -> optimization -> business

	1. success metrics
	BaRT - McInerney et al. 2018
	- bandits
	- find best card per shelf, and then rank shelves
	success = streaming time binarized with a threshold of 30 seconds per playlist (this seems weird to me, 30 seconds per song would make sense?)
	exceptions, e.g. sleep playlist success threshold is longer
	jazz listeners listen longer than other listeners

	reward functions:
	- one global function
	- one per user x playlist
	- groups of users x playlists

	Used Dhillon et al co-clustering KDD 2003
	Histograms + thresholds
	Found that mean worked the best for the threshold (vs. additive or cumulative, which seem like straw men comparisons to me)
	Affinity features (content x user) are better than generic ones (age or day)
	Dragone et al. WWW 2019

	2. intent
	What is the user looking for? i.e. passive listening vs. actively engaging
	Mehotra et al. WWW 2019

	examples: search for a particular thing vs. discovery by mood/activity vs. music to have on in the background
	(the chart for this looked kind of meaningless to me?)

	multi-level model + intent improved user satisfaction ratings prediction over global
	shared learning across intents

	most useful metrics:
	- time to success + dwell time
	- save or download

	3. diversity of content
	Mehotra et al. CIKM 2018
	- relevance (user + tracks)
	- satisfaction (stream > 30 seconds)
	- diversity (range of popularity from Drake to... not Drake)

	Of course, they found that high relevance meant less diversity (very few playlists have both)
	Tradeoff of beta = 0.7 with a 10% drop in satisfaction
	vs. relevant max of beta = 1, or diversity max with beta = 0 and satisfaction drop of 32%

	"personalized diversity" - satisfaction up 12%
	(some users are of course more diverse in their interests)

	Someone asked about how to define intent; she said TBD, but mostly by clustering behavior
	and looking at things like time of day

	----

	# David Hubbard and Benoit Rostykus - Netflix
	Long-term outcomes

	short: click (popularity bias), view, like

	medium: dwell time, quality plays

	long: satisfaction, subscription renewal, etc.

	Ranking: popular, not relevant

	Messaging: short term metrics can lead to user fatigue

	Want to model satisfaction over time, for example, renewal as a Bernoulli model over months
	Bayesian approach, beta-logistic/gemoetric, Heckman & Willis 1977.

	Features used: country, tenure, devices, streaming, behavior, payments

	predicting churn, basically. Vaupel & Yashin 1985
	Effects of selection on population bias, why means are bad

	Fader et al. 2018 predicting retention

	Take home points: better to have short-term initial disappointment if it leads to better long-term outcomes,
	vs. initial pleasure followed by lasting disappointment

	Used a Criteo online conversions dataset of 1M rows and 100k columns (?!)
	beta-logit was approximately as good as exponential, and <5% better than plain logit

	common approach is logit + Laplace approximation, but that's not very scalable

	Concluded that beta posterior was better than a gaussian posterior, and more scalable

	Netflix dataset they used was 10M rows x 500 columns, used 3M for test set and a lightGBM for training

	paper available on arxiv

	someone asked re: counterfactual bandit approach (sort of getting at feedback loops, though they didn't say it that way)

	----

	# Jason Gauci - applied reinforcement learning, Facebook

	- Evangelize decision-making

	Has been training large NNs since back when you couldn't get an NN talk into NIPS

	Tech Lead Manager on Horizon https://github.com/facebooksearch/Horizon

	Programming Throwdown podast

	Eternal Terminal replacement for ssh at mistertea.github.com

	1. retrieval matrix factorization, DNN

	2. event prediction, DNN, GBDT, etc.

	3. Ranking - bandits, RL

	4. DS - a/b tests

	1 & 3 are control

	2 is signal processing

	4 is causal analysis

	Classification:

	- what will happen, trained on ground truth, evaluated re: accuracy, assume data are correct

	Decision Making:

	- How can we improve trained, trained from another policy (usually a worse one),
	counterfactual evaluation, assume data are flawed

	- Action features

	- context - device type

	- session features

	- event predictions

	Greedy State Recs:

	- value function: utility to stakeholders

	- control function (maximum predictions)

	- transition function - penalty to create.

	"Data science Descent"

	- loop: design metrics, create predictions, analyze (better: automated a/b tests)

	Historical: Google had giant tables of click-through rates by categories. Humans were building decision trees by hand.

	https://becominghuman.ai/the-very-basis-of-reinforcement-learning-(uuid)

	Markov Decision Process:

	- state: user/post/session

	- action: which post to show (decide)

	- reward: R(S,A)

	- Transition: T(S,A) - S'

	map state-action pairs to future state

	- Value: discounted reward

	Can't just regress

	Credit assignment problem

	State action reward state action: SARSA is recursive

	idea borrowed from Dynamic Programming

	Have the model pick the best action instead: policy gradient

	Synchronous SGD, spark and distributed pytorch

	CPE: counterfactual policy evaluator

	- more useful slides but they went by too fast, see the online video for details

	----

	# Olya Gurevich - Marvelous AI

	- Detecting political narratives with HITL NLP

	23% of adults admitted to sharing fake news (on purpose or by accident)

	- hard: not large labeled datasets, often couched in a kernel of truth, user engagement can't be used as a success metric

	Cofounders: Danielle came from Kixeye

	Target audience: researchers, journalists, political candidates. Focusing on the 2020 election.

	- discovering themes & narratives about candidates, NLP work on tweets, measuring spread, clustering content

	Train GloVe embeddings

	joe biden 'creep' and amy klobuchar 'salad' and 'comb' see Demszky et al. NAACL 2019

	Hierarchical clustering

	Media Bias Fact Check (MBFC) ratings of news sites

	Benkler et al. Network Propaganda

	Left-wing is self-policing, right-wing is not

	Female candidates are getting more fake news attacks, Elizabeth Warren gets the most

	Suggestions:

	1. Engagement metrics have to evolve.

	2. Beware echo chambers and radicalization spirals.

	3. Actively measure bias.

	4. Ideological divide is not symmetric.

	-----

	# Susan Asthey - Stanford

	- Counterfactual Inference for Consumer Choice with Many Products

	* see her publications *

	- Old way: 1 product at a time, not scalable and misses things, e.g. store vs. store competition, bundling related products

	- unobserved latent product characteristics

	- Build a structural model of the customer, with preferences that generalize, i.e. quality, stockpiling

	- * her slides seems useful, with many references, but she went really fast, see the video for details*

	- Used loyalty card data set over 18 months, prices change every Tuesday night

	- Product hierarchy: UPC, subclass, class, category, group, dept, section

	- throw out seasonal things

	- 28 features re: user demographics

	- Consider categories where user is probably only going to buy 1

	- users with > 20 trips on Tuesday or Wednesday with > 10 items per trip, top 235 categories

	- example UPC price series

	Hierarchical Poisson Factorization Model (HPF): log(user pref • product attributes) = mean utility

	Assumes items are independent, but pricing changes in one brand actually affect purchases in others.

	Add a penalty for price increase.

	Nested logit to deal with people who don't buy anything: 1) purchase or not, 2) value of purchase in category

	Variational Bayes

	popular products are purchased at least 2.5 times per day on average

	- what happens when something is out of stock

	- what happens when another product in the same category had a price change, or other subclasses "cross-price elasticity"

	- determine a user's price sensitivity for different products

	(how to account for skew? and time effects?)

	- gains from targeted discounts

	- similarity/exchangeability/complementarity

	coffee & diapers are often co-purchased, vs. if hot dog prices go up, hot dog bun purchases go down

	- how are purchases re-distributed

	- placebo controls and normalization to check for overall effects, like a product becoming generally more popular

	----
	# Mihajlo Grbovic - ML-powered search at Airbnb for Experiences

	ML-powered search at AirBnB

	6M listings in 191 countries

	Experiences: activities led by local hosts

	team is 10 people, all men

	click data
	experience features: price, reviewes, ratings, duration, max guests, category
	GBDT
	50k training samples
	partial dependency plots
	forest score delta

	personalization: a mix of historical & in-session clicks

	rank also by availability & dates, type of trip (business vs. family)

	category intensity: # of clicks total
	category recency: # days since last clicked

	Recently booked ranks lower by default, because they're assuming people won't pick the same thing 2x in a row (he didn't show any data for this)

	create hobby profile of the user
	origin-destination pairs, e.g. Japan: classes, USA: food

	add language match using browser language

	quality - ratings, and phrases in reviews

	started by training on clicks/bookings --> using bookings per impression, performance got 3% better

	impression discounting - adjust when something is ranked high but never clicked

	position bias TBD

	instance-level features, e.g. weeekend vs. weekday

	----
	Minmin Chen - Youtube

	reinforcement learning, joint work with Covington et al.

	limitations of supervised learning
	1. myopic: pigeon-holes users, short-term is prioritized over long-term
	2. system bias: missing feedback on items that were never recommended "rich get richer"

	Goals:
	1. better understand latent user info
	2. be able to quickly adapt
	3. discover new user interests

	Plan a sequence of actions to maximize long-term reward
	Improving the candidate generator long-tail; noisy and sparse feedback for users x items (see notes from Strata youtube talk)

	Maximize cumulative reward
	Markov decision process (MDP)

	Maximize reward by gradient ascent, but user trajectories are generated by different policies

	off-policy learning: use batched feedback from a different policy to help identify bias and remove it with an inverse propensity weighting
	see Achiam, Joshua et al. 2017 arXiv paper

	Have multiple agents seeing videos, but have no access to how those work, so they have to run models to learn about their own platform in lieu of using logs (??)

	Top-K: sum of rewards for individual items; they add an off-policy correction
	actually saw engagement go up 20%

	Boltzmann exploration - sample according to learned policy, and there's a temperature term to adjust the exploration rate
	entropy regularization, penalize KL divergence between uniform and learned policy

	REINFORCE

	hard to connect a single user choice to long-term rec behavior and long-term user behavior/value
	understanding user intent TBD

	----
	Jure Leskovec - chief scientist at Pinterest

	pins: bookmark + photo + map location

	large, human-curated graph
	crowd-sourcing for curating clusters
	can be radically personalized

	every pin and board has a description
	huge dataset of 3-4B pins and boards; a few hundred billion connections
	how people describe things
	need graph to update in real-time without retraining
	featurizing the graph structure is hard

	deep-learning tools know how to use fixed-size grids, and sequences

	graphs have no spatial locality or reference point (no top-left like a spatial image)

	see Graph CNNs for web-scale recommender systems, KDD 2018

	nodes aggregate info from their neighbors using NNs

	pinSAGE - embeddings for nodes, borrows info from nearby nodes in the network

	curriculum learning - use increasingly harder negative controls (closer, but still not related)
	vs. very easy negatives (very obviously not related). someone asked where do they get these? answer:
	from other rec systems, and from lower down in the rankings

	sub-sample neighborhoods for efficient GPU batching
	producer-consumer CPU-GPU training pipeline

	trying to predict what pin they'll save next
	much better than visual-only or annotation-only (partly because their visual object identification doesn't work that well)

	someone also asked what about cycles in the graph, not reinforcing each other?
	answer: BFS and otherwise it doesn't matter if the same node reappears, it can actually be very informative

	----
	Selen Ugoruglu - Netflix show similarity

	Siamese networks with contrastive loss
	weights are shared between them during training

	can use a hinge loss for dissimilar items

	Triplet loss - computationally expensive
	(-) other class -- anchor -- (+) same class
	minimize anchor - (+) distance after learning
	triplet choices are important, want to choose semi-hard as (-) to optimize convergence

	Lifted Structural Loss - more relationships among all the training samples

	metadata they use: genre, expert tags, cast, title, images, script, synopsis, knowledge graph (relationships between titles)