At the Albuquerque meeting P0814
"hash_combine()
Again" was discussed and accepted for further
work by LEWG. Unfortunately, this proposal throws out multiple
improvements on hashing supportapproach and disregards a substantial
amount of work leading to superior hashing support. This proposal
is an attempt to bring the various bits together to avoid accepting
an inferior approach into the standard C++ library.
This section merely rehashes other prior papers on hashing and their respective discussions. It is merely present to provide more background and it can be skipped.
There were plenty of hashing-related papers. Some of these try to build on prior work while others don't. Also, some of the relevant dimensions are not necssarily discussed in the various papers. This section tries to yield an overview of relevant papers and their respective discussions. The respective summary sections try to extract the key points of the papers. The summary will invariably lack some details and isn't meant as a replacements for reading the papers but hopefully does truthfully capture the gist of the papers. The links to the committee discussions require a password as the internal discussions cannot be available publicly. Again, the summary should capture the gist of the discussions. Note that the discussion summary mentions relevant points brought up and not all of these represent a consensus view.
Number | Date | Title | Discussion |
---|---|---|---|
N3333 | 2012-01-13 | Hashing User-Defined Type in C++1y | N3333 Discussion |
N3573 | 2013-03-10 | Heterogenous extensions to unordered containers | |
N3730 | 2013-08-28 | Specializations and namespaces | |
N3876 | 2014-01-19 | Convenience Functions to Combine Hash Values | N3876 and N3898 Discussion |
N3898 | 2014-01-20 | Hashing and Fingerprinting | N3898 Discussion |
N3983 | 2014-05-07 | Hashing tuple-like types | |
N3980 | 2014-05-24 | Types Don't Know # | |
N3339 | 2015-04-09 | Message Digest Library for C++ | |
P0029r0 | 2015-09-21 | A Unified Proposal for Composable Hashing | P0029 Discussion |
P0199r0 | 2016-02-11 | Default Hash | |
P0513r0 | 2016-11-10 | Poisoning the Hash | D0513 Discussion |
P0599r1 | 2017-03-02 | noexcept for Hash Functions |
D0599 Discussion |
P0809r0 | 2017-10-12 | Comparing Unordered Containers | |
P0814r0 | 2017-10-13 | hash_combine() Again |
P0814 Discussion |
P0549r3 | 2018-02-03 | Adjuncts to std::hash |
N3333: Hashing User-Defined Type in C++1y
The standard C++ libraries defines various containers based on hashing but no support to create good hashes. Users are bound to create bad hash functions.
Discussion of multiple hashing uses (in-process containers, distributed
containers, storing data on disk, string comparison via a hash,
security digests). std::hash<>
can't fill all these needs and
suggests to address in-process unordered contains with std::hash
only. To allow updates to the hashing function while avoiding
suprises it should yield different results in different processes
(non-determinism of hash values). The paper mentions a few
approaches how to initialize a seed to produce different results
from different processes.
Specializing functionality in namespace std
is confusing and leads
programmers to also specialize things they are not allowed to
specialize. Thus, std::hash
should delegate to hash_value()
which locates the specialized functionality using ADL. Library code
already using std::hash
needs to continue using it but user code
should ideally use hash_value()
directly and/or adl_hash_value()
(which readily implements the using dance).
To aid users with their implementation of hash_value()
a variadic
function template hash_combine()
is proposed. A number of properties
for hash_combine()
are listed. It is also mentioned that hashing
functions do require some initialization/finalization which have
some costs but that it may not pose a significatn problem in practice.
Also, a hash_combine_range()
function template is proposed as well
as hash_as_bytes()
.
The functions for computing hashes return a hash_code
to make it
explicit that the values are intentionally unstable [between different
processes]. This use can also help reducing the collision rates and
performance but these aspects were not consider sufficient to warrant
the extra complexity.
The paper does not contain concrete wording.
N3333 Discussion |in Kona
- Generally there seems to be an argreement of hashes being non-deterministic. However, there are also concerns raised about non-determinism.
- Request to create an updated version including hashing and fingerprinting.
N3876: Convenience Functions to Combine Hash Values
The current standard C++ library doesn't provide functionality to create hashes for composite values. Users need to create hashes for composite values, e.g., to make them viable keys in unordered containers. However, it is non-trivial to create a good hash combination and users are prone to use bad approaches for combining hashes.
There are papers on implementing decent hash combining functions and
Boost hash_combine()
provides corresponding functionality. The Boost functionality is used
since years while the approach from N3333 is not.
In quoted comments it is pointed out that hash_combine()
should
take advantage of the variadic nature of the function. Without doing
so the result is harder to optimize well.
The paper proposed the addition of two function templates to
<functional>
, both returning a std::size_t
:
hash_combine(seed, values...)
taking aseed
to be modified by combining the hashes of the values into it.hash_val(values...)
taking a sequence ofvalues
whose hash is to be combined.hash_val()
delegates tohash_combine()
for the actual work.
N3876 and N3898 Discussion in Issaquah
- The discussions for N3876 and N3898 overlapped and are combined into one discussion.
- There maybe a more powerful approach where the hashing/fingerprint algorithm is a parameter but something like N3878 is a first step.
- Both a variadic and a range-based interface should be provided.
std::size_t
as an intermediate state is bad. Using bigger state results in faster and better hashes.- The interface for
hash_val()
isn't right. - The conversion to a
std::size_t
should happen as the last thing. - It was suggested that hash-combing tuple values might work but it is also stated that this doesn't work in all cases.
- There will be one chance to tell people how to write hash functions.
- The current practice is broken and we need to provide and teach a better approach.
- Various comments point to N3333.
- LWG2291:
std::hash
is vulnerable to DoS attacks - No support fo N3876 but support for N3333 and hashing algorithms which are unstable across executions as well as cryptographic algorithms.
N3898: Hashing and Fingerprinting
Going further than N3333 the paper proposes to separate the parts traversing the elements of a struct from the computation of a hash: code computing a hash should nearly read like serializing an object. A concept is defined for a hasher object : the functions from N3333 become member functions of the hasher objects. Also, the standard would provide multiple implementations of hasher objects with different properties with respect to the quality of the hash, whether it is deterministic and the size of the result.
The paper doesn't propose wording.
- Pre-meeting some comments were added about the optimizer's ability to do a good job: Passing the full set of arguments to the function by value makes the optimizer's job easier.
- Discussion at the meeting was joined with the discussion of N3876 (see above).
P0029r0: A Unified Proposal for Composable Hashing
The paper starts by stating the goals (support efficient and
convenient composition; avoid users writing bad hashing functions;
dealing with wrapper types; avoid the need of specializing in
namespace std
). The objective is to unify the properties of
proposals N3333 and
N3980.
The primary goal of the proposal is to decouple the hashing algorithm
(provided by libraries) and the hashing function provided by users.
This is done by exposing a type's hashing represention using bytes
(as unsigned char
). The paper also defines a hash expansion
when composing values and states the restriction that a hash
representations of a value shouldn't be suffix of a hash representation
of any other value of the same type leading to a set of basic
operations to combine hashes. The actual hash value is computed
based on the hash expansion provided by a type.
The fundamental approach consists of uses of a few functions and traits:
hash_value(h, v)
obtaining ahash_code
forv
based on ahash_code
passed in.hash_combine(std::move(h), m0, ..., mi)
for a hash based on the hash expansion.hash_combin_range(std::move(h), begin, end)
for a hash based on a range.is_uniquely_represented
for types which can be readily treated as byte arrays.
The used hash_code
is move-only. hash_code
is used for the
default hashing algorithm which is expected to be of good quality
and randomized at program start-up. It can be used for user-provided
hashing functions and is the hashing function used by hash<T>
,
i.e., users can provide a hashing function using a concrete type.
For some use cases it is useful to provide other hashing function.
In that case the provided hashing functions should be templated on
the hashing algorithm. All overloads of hash_value()
in the standard
library should be templatized that way.
Based on experience with users hash_value()
will have to be
integrated with hash<T>
. It is proposed to extend the primary
template to implement operator()(T const& t)
in terms of
hash_value(h, t)
(with a suitable rvalue h
).
The paper does not propose support for heterogeneous lookup but the proposed infrastructure should be able to cope with that case.
The paper compares various aspect to other proposals and provides a rationale for the respective choices:
- Passing the algorithm by-value vs. by-reference: aside from performance considerations, by-value subsumes by-reference. Hence by-value is preferred.
- Multiple hashing algorithms: The proposal doesn't require use of a template to support multiple algorithm but the infrastructure supports this use and using a template is required for standard types
- Single vs. Multiple initialization: The proposal adds all bytes to the representation based on one initialization, thereby avoiding leakage of implementation details on how the bytes are produced. Doing so should help with supporting heteregeneous lookup.
- Hash algorithm API: Providing more arguments in individual calls
to
hash_combine()
provide a broader interface to provide but also more optimization options for algorithm implementers. Thus, the interface is richer than that proposed in N3980. - Contiguous layout: Using a trait to assert that all bytes can be used for hashing yields for an easier to use interface and better integrates with arrays.
The paper contains wording.
P0029 Discussion | at Kona
- Support of heterogenous look-up matters mostly for
string
andstring_view
. - No consensus for general heterogenous look-up.
- Discussion on
hash<T>
vs. replacing the hasher: replacing the hasher forhash<T>
does require some migration effort. - Discussion on ease of creating a hashing function. The general assessment is seems to be that it isn't unduly hard to create a hashing function based on this proposal.
- Discussion on the difference on how the hashing functions are
defined. There is concern about needing
using std::hash_combine
but also mention of the remedy (putting that into a standard function.
Polls:
- Should the name be distinguished from existing names: apparently in favor
- Should
std::hash
be changed [to use the hashing infrastructure]: mildly in favor - Should the hashing functions be
noexcept
: in favor
P0199r0: Default Hash
The paper proposes automatic implementation of hash_value(v)
and
is_uniquely_reprsented
(from P0029r0
by the language. Users would be allowed to override the default
implementation were necessary or desired. It explains how and when
hash_value(v)
is generated and the rationale behind the decisions.
It also provides some detail on why an implementation based on
[static] reflection isn't viable (based on the reflection papers
at the time, at least).
Apparently this paper was not discussed at a meeting.
P0513r0: Poisoning the Hash
The paper clarifies the wording on when hash<T>
is enabled or
disabled for a type T
. There isn't much rationale as this paper is
mostly addressing a reported defect in the Working Draft.
According the paper hash<T>
is disable if neither the user nor
the library provides an explicit or partial specialization of the
class template hash
for T
.
The changes from this paper were applied to the Working Draft at the Issaquah meeting.
####D0513 Discussion
Lengthy discussion in LWG on various details. P0513r0 reflects the changes asked for during the discussion.
P0599r1: noexcept
for Hash Functions
The paper addresses US140 and changes the wording of some hash<T>
specializations.
D0599 Discussion at Kona
- Some minor concerns with specific decisions.
- General consensus with the direction.
D0599 Discussion at Kona
- Various discussion about
noexcept
in different contexts. - Consensus that the
hash<T>
structure operations (ctors, dtor,swap
, etc.) should benoexecpt
. - Apparently no consensus to make
operator()()
also alwaysnoexcept
. - General agreement to take the paper as written.
P0814r0: hash_combine()
Again
hash_combine()
was proposed in N3876
which, however, was rejected because something better based on
N3333 was promised to emerge. Despite the
promise nothing better came and C++17 shipped without facilities
to combine hashes. Even though possibly inferior hash_combine()
is proposed to be added to the standard C++ library.
The core of the proposal is a variadic function template hash_combine()
taking a sequence of values and returns a hash value resulting from
a combination of the respective value's hash values. The hash values
of the respective values are obtained using hash<T>()(arg)
. The
hash_combine()
function template supports a first, non-deduced
argument for the aggregation type which is defaulted to size_t
.
P0814 Discussion in Albuquerque
The discussion brought up a few points:
- Channeling hashes through
std::size_t
for combining hashes (the result ofstd::hash<T>()(v)
) is inefficient and leads to bad hashes. - Baking a hash value into programs creates an ABI dependency.
- Nicolai isn't insisting on having
hash_combine()
in the standard C++ library as long as there is something available. - Combining hashes should not be required to be associative.
- Both variadic and range versions should be provided with the potential of leveraging a Ranges library.
- The most likely location for the functionality is
<functional>
. - There was consent that
hash_combine()
should be accepted.
P0549r3: Adjuncts to std::hash
The paper proposes the addition of a few tools around the ability to producing hashes for different types:
is_enabled_hash<T>
hash_for<T>
,is_hashable_<T>
andis_hashable_v<T>
hash_value(v)
is_nothrow_hashable<T>
andis_nothrow_hashable_v<T>
The function hash_value(v)
delegates computation of a hash to
hash_for<T>{}(std::forward<T>(v))
. hash_for<T>
in turn is a
using alias for hash<remove_cvref_t<T>>
The paper includes proposed wording.
P0549r3 or its earlier revisision were not discussed at any of the meetings.
Great summary!
Has anything happened about the topic of hashing in the past years?