struct
Pf::USet32
- Pf::USet32
- Struct
- Value
- Object
Overview
A thread-safe, persistent set of 32-bit unsigned integers.
TODO with this architecture, it is possible to union very large consecutive integer spans really quickly by setting presence = UInt[16/32]::MAX and setting fully covered bitmaps to UInt64::MAX. Bitmaps on the edges are union'd the way we do it right now. This would probably require introducing a third option for op rhs: right now we have Trie and TrieP (trie path), and to implement this we'd need to have TrieSpan as well.
How it works?
The most important takeaway is that with USet32, integer magnitude directly affects performance.
See the diagram below.
(0-31) (0-15) (0-15)
LSB bitmap index n0 index n2 index MSB
───────── ─────── ───────
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
─────────── ───────── ─────── ───────
bit index chunk index n1 index n3 index
(0-63) (0-31) (0-15) (0-15)
┌─────┐ ▲
│N3 │ │
└──┬──┘ │
│ 16-way branch │
▼ │
┌─────┐ │
│N2 │ │
└──┬──┘ │
│ 16-way branch │
▼ │
┌─────┐ │
│N1 │ │
└──┬──┘ │
│ 16-way branch │
▼ │
┌─────┐ │
│N0 │ │
└──┬──┘ │
│ 16-way branch │ instantiated on-demand
▼ │ as the numbers grow
┌─────────┐ │
│wide node│ │
└────┬────┘ │
│ 32-way branch │
▼ │
┌─────────┐ │
│ chunk │ │
└────┬────┘ │
│ 32-way branch │
▼ │
┌───────────────────────────┐ │
│ bitmap │ │
│ │ │
│ 0 0 0 0 0 0 0 ... 0 0 0 0 │ │
│ ───────────────────────── │ │
│ 64 bits │ │
└───────────────────────────┘ │
USet32 effectively "reifies" the number tree (each number is a path through that tree), providing an indexing data structure on top of the u32 number line.
Note how this implementation penalizes large numbers or sets containing large numbers (numbers far away from origin).
Bitmaps are very cheap, so if your numbers are frequently in the 0-63 range, you're well set. The larger your number, the more allocation overhead you will have and the "taller" the tree will grow.
Everything from chunk upwards adds an allocation / a pointer dereference. On one hand, that's nice because you will copy less and share more. On the other hand, modern CPUs penalize pointer chasing so you're going to suffer: dozens of nanoseconds up to 100ns for an insertion is not unheard of.
We will probably provide support for USet64-256 in the future, because as you can see, this construction scales rather trivially; and USet256 especially would be useful for storing hashes such as SHA256 and BLAKE3. In that case, easing indirection and control overhead would become a high-priority goal.
Performance
Let's add all i32s. We can't add all u32s because BitArray crashes on that. We don't want to crash our only competitor in Crystal-land.
require "benchmark"
require "bit_array"
Benchmark.ips do |x|
x.report("uset32") do
_ = Pf::USet32.transaction do |set|
(0..Int32::MAX).each do |n|
set << n.to_u32
end
end
end
x.report("bit array") do
bits = BitArray.new(Int32::MAX)
(0..Int32::MAX).each do |n|
bits.unsafe_put(n, true)
end
end
end
This reports the following on my machine (Ryzen 3 2200G):
uset32 9.28m (107.77s ) (± 0.00%) 4.79GB/op 46.95× slower
bit array 435.66m ( 2.30s ) (± 1.14%) 256MB/op fastest
Horrendous, isn't it? Except remember that bit array allocates all the memory upfront, and does a neat sequ pass where it doesn't follow even a single pointer. Paradise for a modern CPU!
Now, USet32 is a dynamic, immutable set that is explicitly designed in such a way that large numbers are penalized (versus, say, Roaring bitmaps, where magnitude does not matter and instead small bitmaps are penalized). USet32 lets you work with small numbers practically for free.
If you have a different origin far from zero in your domain, you can normalize the numbers
before inserting them into USet32. To support negative numbers, create two sets: one for
negative numbers and another for positive ones. This is also the way to store Int32s efficiently,
since their raw binary representation is very far from zero if cast to UInt32.
USet32 is particularly well-suited for recording small, sequential ids, with an occasional immutable add or set operation. USet32 is suited for situations where you have thousands of small immutable sets, especially when they have a common ancestor.
In fact, USet32 works beautifully well for set operations, since, point A, it's a trie, and point B, we have bitmaps at the leaves. This is almost perfect for any kind of set operation, be it union, intersection, difference, etc.
Included Modules
Defined in:
permafrost/uset32.crConstructors
-
.[] : USet32
Constructs an empty set.
-
.[](*values : UInt32) : USet32
Constructs a set with the given values.
- .additive_identity : self
-
.new(ee : Enumerable(UInt32)) : USet32
Construct a set from the given enumerable ee.
-
.new : USet32
Constructs an empty set.
- .transaction(& : Commit -> ) : USet32
Instance Method Summary
-
#&(other : USet32) : USet32
Intersection: returns a new set containing integers common to this and other sets.
-
#+(other : USet32) : USet32
Alias of
#|. -
#-(other : USet32) : USet32
Difference: returns a new set containing integers from this set that are not in the other set.
- #==(other : USet32) : Bool
-
#|(other : USet32) : USet32
Union: returns a new set containing integers from this and other sets.
-
#add(value : UInt32) : USet32
Returns a copy of this set with value present.
-
#add?(value : UInt32) : Tuple(USet32, Bool)
Returns a copy of this set with value present, followed by a boolean indicating whether value was added during the call.
-
#delete(value : UInt32) : USet32
Returns a copy of this set without value.
-
#each(& : UInt32 -> ) : Nil
Yields integers stored in this set.
-
#hash(hasher)
See
Object#hash(hasher) -
#includes?(value) : Bool
Returns
trueif this set contains value. - #inspect(io)
-
#intersects?(other : USet32) : Bool
Returns
trueif this set has common integers with other. -
#offset(value : UInt32) : UInt32
Returns the number of integers before value (even if value is absent).
-
#prefix : UInt32
Returns the number of consecutive integers following zero in this set (i.e., all numbers from zero before the first "gap", if any).
- #pretty_print(pp)
-
#proper_subset_of?(other : USet32) : Bool
Returns
trueif all integers in this set are contained in other, and other is larger than this set. -
#proper_superset_of?(other : USet32) : Bool
Returns
trueif this set contains all integers from other, and other is smaller than this set. -
#size : Int32
Returns the number of integers in this set (as an
Int32). -
#subset_of?(other : USet32) : Bool
Returns
trueif all integers in this set are contained on other. -
#superset_of?(other : USet32) : Bool
Returns
trueif this set contains all integers from other. - #to_s(io)
-
#usize : UInt32
Returns the number of integers in this set (as a
UInt32).
Instance methods inherited from module Enumerable(UInt32)
to_pf_bidi
to_pf_bidi,
to_pf_map(& : T -> Tuple(K, V)) : Pf::Map(K, V) forall K, Vto_pf_map to_pf_map, to_pf_set : Pf::Set(T) to_pf_set, to_pf_uset32 : Pf::USet32 to_pf_uset32
Constructor Detail
Construct a set from the given enumerable ee.
Lets you build a possibly large set using Commit, which offers mutable versions
of methods from USet32.
WARNING The yielded Commit must, under no circumstances, outlive the block or
escape it otherwise, through retained closures, instance variables, etc. It uses
pointers into stack-allocated memory, so using it after the block returns
will lead to UB.
set = Pf::USet32.transaction do |commit|
commit << 1
commit << 2
if commit.includes?(2)
commit << 3
end
end
set # Pf::USet32[1, 2, 3]
Instance Method Detail
Intersection: returns a new set containing integers common to this and other sets.
Pf::USet32[1, 2, 3] & Pf::USet32[2, 3, 4] # => Pf::USet32[2, 3]
Difference: returns a new set containing integers from this set that are not in the other set.
Pf::USet32[1, 2, 3] - Pf::USet32[2, 3, 4] # => Pf::USet32[1]
Union: returns a new set containing integers from this and other sets.
Pf::USet32[1, 2, 3] | Pf::USet32[4, 5, 6] # => Pf::USet32[1, 2, 3, 4, 5, 6]
Returns a copy of this set with value present.
set = Pf::USet32[1, 2, 3]
set.add(4) # => Pf::USet32[1, 2, 3, 4]
set # => Pf::USet32[1, 2, 3]
Returns a copy of this set with value present, followed by a boolean indicating whether value was added during the call.
set = Pf::USet32[1, 2, 3]
set.add?(4) # => {Pf::USet32[1, 2, 3, 4], true}
set.add?(1) # => {Pf::USet32[1, 2, 3], false}
set # => Pf::USet32[1, 2, 3]
Returns a copy of this set without value.
set = Pf::USet32[1, 2, 3]
set.delete(2) # => Pf::USet32[1, 3]
set # => Pf::USet32[1, 2, 3]
Returns true if this set contains value. Returns false otherwise.
set = Pf::USet32[1, 2, 3]
set.includes?(2) # => true
set.includes?(100) # => false
Returns true if this set has common integers with other. Returns
false otherwise.
xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[4, 5, 6]
zs = Pf::USet32[3, 5, 6]
xs.intersects?(ys) # => false, nothing in common
xs.intersects?(zs) # => true, they both have `3`
Returns the number of integers before value (even if value is absent).
Effectively, #offset counts the number of set bits before value-th
bit in the underlying bits of this set.
This is sometimes called the rank of value.
Pf::USet32[1, 2, 3, 100].offset(1) # => 0
Pf::USet32[1, 2, 3, 100].offset(2) # => 1
Pf::USet32[1, 2, 3, 100].offset(3) # => 2
Pf::USet32[1, 2, 3, 100].offset(90) # => 3
Pf::USet32[1, 2, 3, 100].offset(100) # => 3
Pf::USet32[1, 2, 3, 100].offset(1234) # => 4
As a side effect, you can use this method for sorting (similar to bisection in spirit).
people = [
{"Alice", 32},
{"Bob", 45},
{"Charlie", 28},
{"Diana", 36},
{"Eve", 29},
]
sorted = [] of {String, Int32}
Pf::USet32.transaction do |uset|
people.each do |name, age|
sorted.insert(uset.offset(age.to_u32), {name, age})
uset << age.to_u32
end
end
pp sorted # => [{"Charlie", 28}, {"Eve", 29}, {"Alice", 32}, {"Diana", 36}, {"Bob", 45}]
Returns the number of consecutive integers following zero in this set (i.e., all numbers from zero before the first "gap", if any).
Effectively, #prefix performs the "trailing ones count" operation on
the underlying bits of this set.
Pf::USet32[0, 1, 2, 3].prefix # => 4
Pf::USet32[1, 2, 3].prefix # => 0
Pf::USet32[0, 1, 3].prefix # => 2
Returns true if all integers in this set are contained in other, and
other is larger than this set. Returns false otherwise.
xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]
xs.proper_subset_of?(ys) # => true
ys.subset_of?(ys) # => true
ys.proper_subset_of?(ys) # => false
Returns true if this set contains all integers from other, and other
is smaller than this set. Returns false otherwise.
xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]
ys.proper_superset_of?(xs) # => true
ys.superset_of?(ys) # => true
ys.proper_superset_of?(ys) # => false
Returns the number of integers in this set (as an Int32).
set = Pf::USet32[1, 2, 3]
set.size # => 3
Returns true if all integers in this set are contained on other. Returns
false otherwise.
xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]
xs.subset_of?(ys) # => true
ys.subset_of?(xs) # => false
Returns true if this set contains all integers from other. Returns
false otherwise.
xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]
xs.superset_of?(ys) # => false
ys.superset_of?(xs) # => true