cscorley/proposal.md

Last active September 10, 2015 18:59

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/cscorley/71fa1da790a55d58912b.js"></script>
Save cscorley/71fa1da790a55d58912b to your computer and use it in GitHub Desktop.

Raw

the general algo goes like so:

for chunk in corpus:
  e-step
  m-step

gensim hacks in multiple passes:

for pass_ in passes:
  for chunk in corpus:
    e-step
    m-step

what we've been doing (only works for batch):

for pass_ in passes:
  for bound_iter in iters:
    for chunk in corpus:
      e-step
      m-step
  
    break if done

for online updates, would it make more sense to:

for chunk in corpus:
  for bound_iter in iters:
    e-step
    m-step
  
    break if done

this would give us something that works the same for batch (via chunksize=len(corpus) and bound_iters > 1) but also something that works for online mode (via chunksize<len(corpus) and bound_iters > 1).

hazelybell commented Sep 10, 2015

Well the main difference between bound_iter and passes is that it has a convergence criterion, it's not just "do this 10 times," and secondly it doesn't update the decay

Author

cscorley commented Sep 10, 2015

that's fine, renamed in the example so it is more clear what i mean (e.g., before edit passes == iters, so for the PR users don't get deprecation problems). not updating rho/gamma/etc until after a chunk is finished is what this would do.

hazelybell commented Sep 10, 2015

I mean the only reason I can see to use example 4 is if the updates are "actually" online. If chunks are being used just due to memory constraints but we're really running in a batch mode, then having chunks on the inside makes more sense and follows the algorithm in the paper more closely.