struct Pf::USet32

Pf::USet32
Struct
Value
Object

Overview

A thread-safe, persistent set of 32-bit unsigned integers.

TODO with this architecture, it is possible to union very large consecutive integer spans really quickly by setting presence = UInt[16/32]::MAX and setting fully covered bitmaps to UInt64::MAX. Bitmaps on the edges are union'd the way we do it right now. This would probably require introducing a third option for op rhs: right now we have Trie and TrieP (trie path), and to implement this we'd need to have TrieSpan as well.

How it works?

The most important takeaway is that with USet32, integer magnitude directly affects performance.

See the diagram below.

                (0-31)             (0-15)           (0-15)
LSB          bitmap index         n0 index         n2 index       MSB
               ─────────           ───────         ───────
   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   ───────────           ─────────         ───────         ───────
    bit index           chunk index        n1 index        n3 index
      (0-63)              (0-31)            (0-15)          (0-15)




    ┌─────┐                        ▲
    │N3   │                        │
    └──┬──┘                        │
       │  16-way branch            │
       ▼                           │
    ┌─────┐                        │
    │N2   │                        │
    └──┬──┘                        │
       │  16-way branch            │
       ▼                           │
    ┌─────┐                        │
    │N1   │                        │
    └──┬──┘                        │
       │  16-way branch            │
       ▼                           │
    ┌─────┐                        │
    │N0   │                        │
    └──┬──┘                        │
       │  16-way branch            │  instantiated on-demand
       ▼                           │  as the numbers grow
  ┌─────────┐                      │
  │wide node│                      │
  └────┬────┘                      │
       │  32-way branch            │
       ▼                           │
  ┌─────────┐                      │
  │  chunk  │                      │
  └────┬────┘                      │
       │  32-way branch            │
       ▼                           │
  ┌───────────────────────────┐    │
  │ bitmap                    │    │
  │                           │    │
  │ 0 0 0 0 0 0 0 ... 0 0 0 0 │    │
  │ ───────────────────────── │    │
  │          64 bits          │    │
  └───────────────────────────┘    │

USet32 effectively "reifies" the number tree (each number is a path through that tree), providing an indexing data structure on top of the u32 number line.

Note how this implementation penalizes large numbers or sets containing large numbers (numbers far away from origin).

Bitmaps are very cheap, so if your numbers are frequently in the 0-63 range, you're well set. The larger your number, the more allocation overhead you will have and the "taller" the tree will grow.

Everything from chunk upwards adds an allocation / a pointer dereference. On one hand, that's nice because you will copy less and share more. On the other hand, modern CPUs penalize pointer chasing so you're going to suffer: dozens of nanoseconds up to 100ns for an insertion is not unheard of.

We will probably provide support for USet64-256 in the future, because as you can see, this construction scales rather trivially; and USet256 especially would be useful for storing hashes such as SHA256 and BLAKE3. In that case, easing indirection and control overhead would become a high-priority goal.

Performance

Let's add all i32s. We can't add all u32s because BitArray crashes on that. We don't want to crash our only competitor in Crystal-land.

require "benchmark"
require "bit_array"

Benchmark.ips do |x|
  x.report("uset32") do
    _ = Pf::USet32.transaction do |set|
      (0..Int32::MAX).each do |n|
        set << n.to_u32
      end
    end
  end

  x.report("bit array") do
    bits = BitArray.new(Int32::MAX)
    (0..Int32::MAX).each do |n|
      bits.unsafe_put(n, true)
    end
  end
end

This reports the following on my machine (Ryzen 3 2200G):

   uset32   9.28m (107.77s ) (± 0.00%)  4.79GB/op  46.95× slower
bit array 435.66m (  2.30s ) (± 1.14%)   256MB/op        fastest

Horrendous, isn't it? Except remember that bit array allocates all the memory upfront, and does a neat sequ pass where it doesn't follow even a single pointer. Paradise for a modern CPU!

Now, USet32 is a dynamic, immutable set that is explicitly designed in such a way that large numbers are penalized (versus, say, Roaring bitmaps, where magnitude does not matter and instead small bitmaps are penalized). USet32 lets you work with small numbers practically for free.

If you have a different origin far from zero in your domain, you can normalize the numbers before inserting them into USet32. To support negative numbers, create two sets: one for negative numbers and another for positive ones. This is also the way to store Int32s efficiently, since their raw binary representation is very far from zero if cast to UInt32.

USet32 is particularly well-suited for recording small, sequential ids, with an occasional immutable add or set operation. USet32 is suited for situations where you have thousands of small immutable sets, especially when they have a common ancestor.

In fact, USet32 works beautifully well for set operations, since, point A, it's a trie, and point B, we have bitmaps at the leaves. This is almost perfect for any kind of set operation, be it union, intersection, difference, etc.

Included Modules

Enumerable(UInt32)

Defined in:

permafrost/uset32.cr

Constructors

.[] : USet32
Constructs an empty set.
.[](*values : UInt32) : USet32
Constructs a set with the given values.
.additive_identity : self
.new(ee : Enumerable(UInt32)) : USet32
Construct a set from the given enumerable ee.
.new : USet32
Constructs an empty set.
.transaction(& : Commit -> ) : USet32
Lets you build a possibly large set using Commit, which offers mutable versions of methods from USet32.

Instance Method Summary

#&(other : USet32) : USet32
Intersection: returns a new set containing integers common to this and other sets.
#+(other : USet32) : USet32
Alias of #|.
#-(other : USet32) : USet32
Difference: returns a new set containing integers from this set that are not in the other set.
#==(other : USet32) : Bool
#|(other : USet32) : USet32
Union: returns a new set containing integers from this and other sets.
#add(value : UInt32) : USet32
Returns a copy of this set with value present.
#add?(value : UInt32) : Tuple(USet32, Bool)
Returns a copy of this set with value present, followed by a boolean indicating whether value was added during the call.
#delete(value : UInt32) : USet32
Returns a copy of this set without value.
#each(& : UInt32 -> ) : Nil
Yields integers stored in this set.
#hash(hasher)
See Object#hash(hasher)
#includes?(value) : Bool
Returns true if this set contains value.
#inspect(io)
#intersects?(other : USet32) : Bool
Returns true if this set has common integers with other.
#offset(value : UInt32) : UInt32
Returns the number of integers before value (even if value is absent).
#prefix : UInt32
Returns the number of consecutive integers following zero in this set (i.e., all numbers from zero before the first "gap", if any).
#pretty_print(pp)
#proper_subset_of?(other : USet32) : Bool
Returns true if all integers in this set are contained in other, and other is larger than this set.
#proper_superset_of?(other : USet32) : Bool
Returns true if this set contains all integers from other, and other is smaller than this set.
#size : Int32
Returns the number of integers in this set (as an Int32).
#subset_of?(other : USet32) : Bool
Returns true if all integers in this set are contained on other.
#superset_of?(other : USet32) : Bool
Returns true if this set contains all integers from other.
#to_s(io)
#usize : UInt32
Returns the number of integers in this set (as a UInt32).

Instance methods inherited from module `Enumerable(UInt32)`

, , ,

Constructor Detail

def self.[] : USet32 #

Constructs an empty set.

[View source]

def self.[](*values : UInt32) : USet32 #

Constructs a set with the given values.

[View source]

def self.additive_identity : self #

[View source]

def self.new(ee : Enumerable(UInt32)) : USet32 #

Construct a set from the given enumerable ee.

[View source]

def self.new : USet32 #

Constructs an empty set.

[View source]

def self.transaction(& : Commit -> ) : USet32 #

Lets you build a possibly large set using Commit, which offers mutable versions of methods from USet32.

WARNING The yielded Commit must, under no circumstances, outlive the block or escape it otherwise, through retained closures, instance variables, etc. It uses pointers into stack-allocated memory, so using it after the block returns will lead to UB.

set = Pf::USet32.transaction do |commit|
  commit << 1
  commit << 2
  if commit.includes?(2)
    commit << 3
  end
end

set # Pf::USet32[1, 2, 3]

[View source]

Instance Method Detail

def &(other : USet32) : USet32 #

Intersection: returns a new set containing integers common to this and other sets.

Pf::USet32[1, 2, 3] & Pf::USet32[2, 3, 4] # => Pf::USet32[2, 3]

[View source]

def +(other : USet32) : USet32 #

Alias of #|.

[View source]

def -(other : USet32) : USet32 #

Difference: returns a new set containing integers from this set that are not in the other set.

Pf::USet32[1, 2, 3] - Pf::USet32[2, 3, 4] # => Pf::USet32[1]

[View source]

def ==(other : USet32) : Bool #

[View source]

def |(other : USet32) : USet32 #

Union: returns a new set containing integers from this and other sets.

Pf::USet32[1, 2, 3] | Pf::USet32[4, 5, 6] # => Pf::USet32[1, 2, 3, 4, 5, 6]

[View source]

def add(value : UInt32) : USet32 #

Returns a copy of this set with value present.

set = Pf::USet32[1, 2, 3]

set.add(4) # => Pf::USet32[1, 2, 3, 4]
set        # => Pf::USet32[1, 2, 3]

[View source]

def add?(value : UInt32) : Tuple(USet32, Bool) #

Returns a copy of this set with value present, followed by a boolean indicating whether value was added during the call.

set = Pf::USet32[1, 2, 3]

set.add?(4) # => {Pf::USet32[1, 2, 3, 4], true}
set.add?(1) # => {Pf::USet32[1, 2, 3], false}

set # => Pf::USet32[1, 2, 3]

[View source]

def delete(value : UInt32) : USet32 #

Returns a copy of this set without value.

set = Pf::USet32[1, 2, 3]

set.delete(2) # => Pf::USet32[1, 3]
set           # => Pf::USet32[1, 2, 3]

[View source]

def each(& : UInt32 -> ) : Nil #

Yields integers stored in this set.

[View source]

def hash(hasher) #

Description copied from struct Struct

See Object#hash(hasher)

[View source]

def includes?(value) : Bool #

Returns true if this set contains value. Returns false otherwise.

set = Pf::USet32[1, 2, 3]
set.includes?(2)   # => true
set.includes?(100) # => false

[View source]

def inspect(io) #

[View source]

def intersects?(other : USet32) : Bool #

Returns true if this set has common integers with other. Returns false otherwise.

xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[4, 5, 6]
zs = Pf::USet32[3, 5, 6]

xs.intersects?(ys) # => false, nothing in common
xs.intersects?(zs) # => true, they both have `3`

[View source]

def offset(value : UInt32) : UInt32 #

Returns the number of integers before value (even if value is absent).

Effectively, #offset counts the number of set bits before value-th bit in the underlying bits of this set.

This is sometimes called the rank of value.

Pf::USet32[1, 2, 3, 100].offset(1)    # => 0
Pf::USet32[1, 2, 3, 100].offset(2)    # => 1
Pf::USet32[1, 2, 3, 100].offset(3)    # => 2
Pf::USet32[1, 2, 3, 100].offset(90)   # => 3
Pf::USet32[1, 2, 3, 100].offset(100)  # => 3
Pf::USet32[1, 2, 3, 100].offset(1234) # => 4

As a side effect, you can use this method for sorting (similar to bisection in spirit).

people = [
  {"Alice", 32},
  {"Bob", 45},
  {"Charlie", 28},
  {"Diana", 36},
  {"Eve", 29},
]

sorted = [] of {String, Int32}

Pf::USet32.transaction do |uset|
  people.each do |name, age|
    sorted.insert(uset.offset(age.to_u32), {name, age})
    uset << age.to_u32
  end
end

pp sorted # => [{"Charlie", 28}, {"Eve", 29}, {"Alice", 32}, {"Diana", 36}, {"Bob", 45}]

[View source]

def prefix : UInt32 #

Returns the number of consecutive integers following zero in this set (i.e., all numbers from zero before the first "gap", if any).

Effectively, #prefix performs the "trailing ones count" operation on the underlying bits of this set.

Pf::USet32[0, 1, 2, 3].prefix # => 4
Pf::USet32[1, 2, 3].prefix    # => 0
Pf::USet32[0, 1, 3].prefix    # => 2

[View source]

def pretty_print(pp) #

[View source]

def proper_subset_of?(other : USet32) : Bool #

Returns true if all integers in this set are contained in other, and other is larger than this set. Returns false otherwise.

xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]

xs.proper_subset_of?(ys) # => true

ys.subset_of?(ys)        # => true
ys.proper_subset_of?(ys) # => false

[View source]

def proper_superset_of?(other : USet32) : Bool #

Returns true if this set contains all integers from other, and other is smaller than this set. Returns false otherwise.

xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]

ys.proper_superset_of?(xs) # => true

ys.superset_of?(ys)        # => true
ys.proper_superset_of?(ys) # => false

[View source]

def size : Int32 #

Returns the number of integers in this set (as an Int32).

set = Pf::USet32[1, 2, 3]
set.size # => 3

[View source]

def subset_of?(other : USet32) : Bool #

Returns true if all integers in this set are contained on other. Returns false otherwise.

xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]

xs.subset_of?(ys) # => true
ys.subset_of?(xs) # => false

[View source]

def superset_of?(other : USet32) : Bool #

Returns true if this set contains all integers from other. Returns false otherwise.

xs = Pf::USet32[1, 2, 3]
ys = Pf::USet32[1, 2, 3, 4, 5, 6]

xs.superset_of?(ys) # => false
ys.superset_of?(xs) # => true

[View source]

def to_s(io) #

[View source]

def usize : UInt32 #

Returns the number of integers in this set (as a UInt32).

[View source]

permafrost

struct Pf::USet32

Overview

How it works?

Performance

Included Modules

Defined in:

Constructors

Instance Method Summary

Instance methods inherited from module Enumerable(UInt32)

Constructor Detail

Instance Method Detail

Instance methods inherited from module `Enumerable(UInt32)`