nikodemus/foo-duplicates.lisp

scymtym · 2012-04-19T20:11:12Z

For me, the following is faster than remove-duplicates for 10.000 and 100.000 elements (if i measured correctly):

(defconstant +foo-count-2+ (ash +foo-count+ -6))
(defconstant +foo-mask-2+ (1- +foo-count-2+))
(defun foo-duplicates (list)
       (let ((bitmap     (make-array +foo-count+ :element-type 'bit))
         (results    (make-array +foo-count-2+
                     :element-type    'list
                     :initial-element nil)))
         (declare (dynamic-extent bitmap results))
         (dolist (elt list)
           (let* ((hash  (sxhash elt))
              (hash1 (logand +foo-mask+ hash))
              (hash2 (logand +foo-mask-2+ hash)))
         (cond ((zerop (bit bitmap hash1))
            (setf (bit bitmap hash1) 1)
            (push elt (aref results hash2)))
               ((member elt (aref results hash2)))
               (t
            (push elt (aref results hash2))))))
         (apply #'nconc (coerce results 'list))))

scymtym · 2012-04-20T07:42:56Z

Another attempt, this time preserving the order of elements:

(defun bar-duplicates (list)
       (let ((bitmap (make-array +foo-count+ :element-type 'bit))
         (bins   (make-array +foo-count-2+
                     :element-type    'list
                     :initial-element nil))
         (result (cons nil (reverse list))))
         (declare (dynamic-extent bitmap bins))
         (let ((current result))
         (loop while (cdr current)
        do (let* ((elt   (cadr current))
              (hash  (sxhash elt))
              (hash1 (logand +foo-mask+ hash))
              (hash2 (logand +foo-mask-2+ hash))
              (bin   (aref bins hash2)))
             (cond ((zerop (bit bitmap hash1))
                (setf (bit bitmap hash1) 1)
                (push elt (aref bins hash2))
                (setf current (cdr current)))
               ((or (not bin) (not (member elt bin)))
                (push elt (aref bins hash2))
                (setf current (cdr current)))
               (t
                (setf (cdr current) (cddr current)))))))
         (nreverse (cdr result))))

for 10.000 elements:

CL-USER> (mismatch (time (bar-duplicates *list*))
           (time (remove-duplicates *list*)))
Evaluation took:
  0.001 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  1,335,843 processor cycles
  294,912 bytes consed

Evaluation took:
  0.002 seconds of real time
  0.000000 seconds of total run time (0.000000 user, 0.000000 system)
  0.00% CPU
  4,757,373 processor cycles
  626,192 bytes consed

NIL

for 100.000 elements:

CL-USER> (mismatch (time (bar-duplicates *list*))
           (time (remove-duplicates *list*)))
Evaluation took:
  0.038 seconds of real time
  0.040000 seconds of total run time (0.040000 user, 0.000000 system)
  105.26% CPU
  89,943,768 processor cycles
  3,145,728 bytes consed

Evaluation took:
  0.056 seconds of real time
  0.060000 seconds of total run time (0.060000 user, 0.000000 system)
  [ Run times consist of 0.020 seconds GC time, and 0.040 seconds non-GC time. ]
  107.14% CPU
  135,868,617 processor cycles
  5,821,552 bytes consed

NIL

for 1.000.000 elements:

CL-USER> (mismatch (time (bar-duplicates *list*))
           (time (remove-duplicates *list*)))
Evaluation took:
  15.488 seconds of real time
  15.400000 seconds of total run time (15.240000 user, 0.160000 system)
  99.43% CPU
  37,080,623,358 processor cycles
  31,260,976 bytes consed

Evaluation took:
  0.493 seconds of real time
  0.480000 seconds of total run time (0.440000 user, 0.040000 system)
  [ Run times consist of 0.090 seconds GC time, and 0.390 seconds non-GC time. ]
  97.36% CPU
  1,179,232,776 processor cycles
  56,379,504 bytes consed

NIL

nikodemus · 2012-04-20T10:31:46Z

(CCing Lutz, since he's the one working in the area.)

On 20 April 2012 10:42, scymtym <reply+g-2423090-207604dcd68100eb7a8614caea627e832cf46b5d- ***@***.***> wrote: Another attempt, this time preserving the order of elements:

Nice! Did you test that keeping the bitmap helps when you have the full table? Apropos: it's probably worthwhile to nuke any non-stack-allocated temporary objects with (fill seq 0) afterwards, to remove stray roots that could delay release of garbage due to conservativism. Cheers,

…

-- nikodemus

leuler · 2012-04-20T12:18:47Z

Yes, this is fast for small lists.

leuler · 2012-04-20T12:34:49Z

Regarding implementation strategies that use constant space:

If one uses a bit array I think Nikodemus' way is best.
I have read a bit more about bloom filters and currently I think that
using them would even degrade earlier (that is, for smaller lists).

I think I found a way to use a constant size hash table to get the
runtime down by a constant factor for huge inputs.
That is, with n the size of input, m the size of output,
from n * m down to n * (m / hash-table-size).
For small inputs this would behave just like a dynamically adjusted hash table.

I don't want to code that currently, but it goes like this
(keeping the first of multiply-occurring elements, as that is simpler):
Walk through the list, putting each not yet seen element
(tested via the hash table) into the result and in the hash table.
Once the hash table is full, remember the place in the list we are at,
and walk through the rest of the list deleting all elements from it
that are in the hash table. Empty the hash table and repeat from
the remembered place on.

	(defconstant +foo-count+ (ash (sb-int:power-of-two-ceiling
	(* 8 sb-vm::backend-page-bytes))
	-1))

	(defconstant +foo-mask+ (1- +foo-count+))

	(defun foo-duplicates (list)
	(let ((bitmap (make-array +foo-count+ :element-type 'bit))
	(results nil))
	(declare (dynamic-extent bitmap))
	(dolist (elt list)
	(let ((hash (logand +foo-mask+ (sxhash elt))))
	(cond ((zerop (bit bitmap hash))
	(setf (bit bitmap hash) 1)
	(push elt results))
	((member elt results))
	(t
	(push elt results)))))
	results))

	(defparameter list (let (all)
	(loop repeat 10000
	do (push (random 100000) all))
	all))

	(progn
	(time (foo-duplicates list))
	nil)
	(let ((list (copy-list list)))
	(time (delete-duplicates list))
	nil)
	(progn
	(time (remove-duplicates list))
	nil)

nikodemus/foo-duplicates.lisp

scymtym commented Apr 19, 2012

scymtym commented Apr 20, 2012

nikodemus commented Apr 20, 2012 via email

leuler commented Apr 20, 2012

leuler commented Apr 20, 2012