Hash tables in Dyalog APL

Hash tables in Dyalog APL#

If you’re coming to Dyalog APL from a different programming language, before long you’ll probably wonder “where’s the hash table/dict/associative array?”. If you ask around, some seasoned APLer will chirp up with “you don’t need one, we have arrays”.

As all Real Programmers know, the only useful data structure is the array. Strings, lists, structures, sets -- these are all special cases of arrays and and can be treated that way just as easily without messing up your programing language with all sorts of complications.

From Real programmers don't use Pascal

Ultimately, this is true – with a little bit of legwork, we can achieve much of the same things that Python (say) does with its built-in dict using APL arrays.

For example:

keys ← 'bob' 'eric' 'frankie' 'alison' 'jo'
vals ←   5      9       6         4      9
find ← {vals[keys⍳⍵]}

find ⊂'frankie'
find 'frankie' 'jo' 'bob'

6 9 5

Fair – but whilst this is an associative array, it’s not much of a hash table – there is no actual hashing going on, just a linear search. We can make it a bit hashier by employing a cute little trick:

keyfind ← keys∘⍳          ⍝ Binding array to ⍳ (or ∊) makes secret hash
find ← {vals[keyfind ⍵]}
find 'frankie' 'jo' 'bob'

6 9 5

By binding the keys array to dyadic iota, we let the interpreter know that we will perform multiple lookups, and the interpreter then creates a hashed lookup table ‘behind the scenes’. This means that our hash table is read-only, but sometimes this is what you want.

We can be a little bit more flexible by marking arrays for hashing, with 1500⌶:

keys ← 1500⌶'bob' 'eric' 'frankie' 'alison' 'jo'
vals ←        5      9       6         4      9
find ← {vals[keys⍳⍵]}
find 'frankie' 'jo' 'bob'

6 9 5

This tells the interpreter to maintain a hashed index for the keys array, meaning that certain operations will be able to do lookups faster (like ⍳ and ∊). However, it comes with a couple of caveats. If we add elements to the end of the array with ,←, the hash will be maintained. If we pass the hashed array as an argument to a function, the hash will be maintained. The bad news is that if we do anything else, the hash vanishes. However, that still gives us a pretty decent use case coverage – we can do lookups and we can add new items to the hash. We can’t delete items from, or change items in the key array (perhaps no big surprise there). The vals array needn’t be hashed, so in our simple hash table we can still do what we want to the values:

find ⊂'frankie'
vals[keys⍳⊂'frankie']←10
find ⊂'frankie'

We can use the 1500⌶ to check the state of the hash table:

1 (1500⌶) keys

As per the documentation, a 2 means that the array has a hash table.

To add a new item, we need to add the new key to the keys array, and the corresponding value to the vals array:

keys ,← ⊂'zach'
vals ,← 32
find ⊂'zach'
1 (1500⌶) keys ⍝ Verify that we still have a hash table: 2

keys
vals

┌───┬────┬───────┬──────┬──┬────┐
│bob│eric│frankie│alison│jo│zach│
└───┴────┴───────┴──────┴──┴────┘

5 9 10 4 9 32

This kind of hash table is insertion order preserving. As you’re only ever adding new items to the end of the keys and vals, the order of the keys never changes.

Hashing with the 1500⌶ certainly makes a bit of difference in terms of performance, but perhaps not as much as one might think.

Here’s an example. Let’s read in the system dictionary (around 235k English words) and use those as our keys, and then look up every single key with ⍳ and measure the difference comparing hashed and unhashed arrays:

'cmpx'⎕cy'dfns'
words1 ← 1500⌶ ⊃⎕NGET'/usr/share/dict/words'1 ⍝ Hashed
words2 ← ⊃⎕NGET'/usr/share/dict/words'1       ⍝ Not hashed
vals ← ⍳≢words1
keys ← ⊖words2 ⍝ Our lookup test. Reverse order 
cmpx '_←vals[words1⍳keys]' '_←vals[words2⍳keys]'

  _←vals[words1⍳keys] → 1.9E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                
  _←vals[words2⍳keys] → 3.1E¯2 | +64% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

≢keys

A saving of around 10ms. Hashing is worthwhile perhaps on larger arrays, but Dyalog’s pretty speedy without, too.

In terms of developer ergonomics, we lack a way to “upsert” values in this approach. A nice thing with Python-style dicts is that if you assign to a non-present key, it will be added to the dict, and if the key is present, the value will be updated:

Python 3.11.2 (main, May  2 2023, 15:12:41) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> d = {}
>>> d["bob"] = 1
>>> d["eric"] = 2
>>> d["bob"] = 3
>>> d
{'bob': 3, 'eric': 2}

We can “upsert” by first checking if a key is present, and if so, just update the corresponding item in the values array.

]dinput
upsert ← {
    ⍝ 'upsert' -- for keys already present, update values.
    ⍝             for keys not present, append to key and val array
    ⍝
    ⍝ ⍺ keys 
    ⍝ ⍵ values
    ⍝ 
    ⍝ Restriction: the keys in ⍺ must be unique
    
    idx ← keys⍳⊆⍺
    k ← idx≠⎕IO+≢keys
    
    values[k/idx] ← k/⍵ ⍝ Update
    
    ⍝ Any new kv-pairs: add at the end
    keys,←(~k)/⊆⍺
    values,←(~k)/⍵
    ⍬
}

Namespaces as hash tables#

What else can we do? Well, the intrepid Dyaloger will have noted the conceptual similarity between namespaces and lookup tables. Can we use namespaces as hash tables? Sort of.

a←⎕NS⍬
a.(bob eric frankie alison jo) ← 5 9 6 4 9
a.frankie

Now that is pretty nice! We can add a couple of helper functions to give the illusion that namespaces are hash-like:

Keys←{⍵.⎕NL ¯2}
Vals←{⍵.(⍎¨⎕NL ¯2)}

Keys a
Vals a

┌──────┬───┬────┬───────┬──┐
│alison│bob│eric│frankie│jo│
└──────┴───┴────┴───────┴──┘

4 5 9 6 9

Unfortunately, there is no particularly nice way yet to set variables in a namespace from character vectors. Here’s one way we can try:

'alison' 'eric' 'bob' a.{⍎(∊⍺,¨' '),'←⍵'} ¯1 ¯2 ¯3
a.alison

¯1

but evaluating strings comes with its own set of complications. For example,

'Jean-Pierre' 'Jean-Christophe' 'D''Artagnan' (a.{⍎⍺,'←⍵'})¨ 9 8 10 ⍝ VALUE ERROR

⍎VALUE ERROR: Undefined name: Jean
      Jean-Pierre←⍵
      ∧

In order to be safe, we’d need to encode strings in some way to ensure that they’re valid APL names before putting them into a namespace. Fortunately, there is a handy I-beam function we can use for that, 7162⌶ – it’s used by ⎕JSON which faces the same issue when converting JSON-data to Dyalog namespaces.

Let’s put that into action with a getter-setter pair of helper routines:

_Set ← {⍺⍺⍎'←⍵',⍨∊' ',¨⍨0(7162⌶)¨⊆⍺} 
Get ← {⍺⍎¨0(7162⌶)¨⍵}

Here’s how they work:

a←⎕NS⍬
'aaa' 'bbb' 'ccc' 'jean-pierre' 'b+c' 'a/d' (a _Set) 1 2 3 4 5 6
a Get 'aaa' 'jean-pierre' 'a/d'

1 4 6

Unfortunately, we still have a problem. If we try to set a large number of keys, we run into limitations of ⍎:

a←⎕NS⍬
words2 (a _Set) vals ⍝ LIMIT ERROR

LIMIT ERROR
_Set[0] _Set←{⍺⍺⍎'←⍵',⍨∊' ',¨⍨0(7162⌶)¨⍺}
                ∧

This is an internal limitation that a Dyalog function can comprise of no more than 4095 “tokens”. Our tokens, in this case, are our keys to the left, and then two more, for the ←⍵ bit of the evaluation. We need to ‘chunk’ our data into sizes less than 4093. Let’s use 4000, a nice, round number:

]dinput
_Set ← {⎕IO←0
    ch ← 0=4000|⍳≢⊆⍺                       
    _sset ← {⍺⍺⍎'←⍵',⍨∊' ',¨⍨0(7162⌶)¨⊆⍺}  
    (⍺⍺ _sset)⌿¨↓⍉↑(ch⊂⍺)(ch⊂⍵)            
}

Now we should be able to avoid the limit error, but with our ~235k token data set, it’s a bit slow.

a←⎕NS⍬
]runtime "words2 (a _Set) vals"

* Benchmarking "words2 (a _Set) vals"
┌──────────┬─────┐
│          │(ms) │
├──────────┼─────┤
│CPU (avg):│14617│
├──────────┼─────┤
│Elapsed:  │14617│
└──────────┴─────┘

.NET Collections#

What about pulling in a hash table from .NET? .NET has a superb set of collections. Let’s try the aptly named Hashtable. It’s nicely ergonomic, and can be indexed directly with strings, even in APL:

⎕USING ← 'System.Collections'
ht ← ⎕NEW Hashtable
{ht.Add ⍵}¨↓⍉↑words2 vals ⍝ Quick

Keyed properties#

Most things C# can do, we can also do in Dyalog. We can actually write a dictionary-like class that is indexable directly using strings, just like the .NET example above, using what’s known as a “keyed property”:

]dinput
:Class Dict
    ⍝ )ed ○ Dict
    :Field Public _keys←(1500⌶)⍬
    :Field Public _values←⍬

    :Property Keyed Default Item
        :Access Public Instance

        ∇ r←get arg
            r←_values[_keys⍳⊃arg.Indexers]
        ∇

        ∇ set arg;unique;kk;vv;seen
            unique ← ≠⊃arg.Indexers
            kk ← unique/⊃arg.Indexers
            vv ← unique/arg.NewValue
            seen ← kk∊_keys
            (seen/_values)←seen/vv
            _keys,←(~seen)/kk
            _values,←(~seen)/vv
        ∇

    :EndProperty

    ∇ Make(keys values)
        :Access Public
        :Implements Constructor

        (_keys _values)←keys values
    ∇
:EndClass

z←⎕NEW Dict (('APL' 'Rocks!')(42 'Bob'))
z['Rocks!' 'APL' 'APL'] 

┌───┬──┬──┐
│Bob│42│42│
└───┴──┴──┘

Let’s load up that big word list and compare what we have tried so far:

z←⎕NEW Dict (words2 vals)

cmpx '≢ht[keys]' '≢a Get keys' '≢vals[words1⍳keys]' '≢z[keys]'

  ≢ht[keys]          → 5.9E¯1 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ≢a Get keys        → 2.6E¯1 | -57% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                       
  ≢vals[words1⍳keys] → 1.9E¯2 | -97% ⎕                                       
  ≢z[keys]           → 3.4E¯2 | -95% ⎕⎕

So our orignal attempt using arrays is still the fastest, and – perhaps surprisingly – the .NET approach the slowest. The guts of our keyed property class is the same as our original ‘hashed array’, so it should come as no surprise that they are pretty much identical in terms of performance.

dfns alist#

There is a set of association list functions in the dfns workspace. They’re simple, but nicely put together. An assoication list is a 2-element vector, where the first element is a vector of keys, and the second element is a vector of the corresponding keys:

'alpush' 'alpop' 'alset' 'alget'⎕cy'dfns'

alist ← ('bob' 'eric' 'frankie' 'alison' 'jo')(5 9 6 4 9)

To retrieve the value corresponding to a particular key, we use alget:

alist alget 'frankie'

Note that alget only operates on single keys – we can’t retrieve multiple values by feeding it a vector of keys:

alist alget 'frankie' 'jo' ⍝ INDEX ERROR

INDEX ERROR
alget[2] (keys⍳⊂⍵)⊃vals      ⍝ :: val ← list ∇ key
                  ∧

To put another key-value pair onto our association list, we use alpush:

⊢alist ← alist alpush 'zach' 32

┌─────────────────────────────────┬────────────┐
│┌────┬───┬────┬───────┬──────┬──┐│32 5 9 6 4 9│
││zach│bob│eric│frankie│alison│jo││            │
│└────┴───┴────┴───────┴──────┴──┘│            │
└─────────────────────────────────┴────────────┘

Note that alpush adds new items to the beginning of the vectors holding keys and values. alset modifies the value of an existing key:

⊢alist ← alist alset 'frankie' 2

┌─────────────────────────────────┬────────────┐
│┌────┬───┬────┬───────┬──────┬──┐│32 5 9 2 4 9│
││zach│bob│eric│frankie│alison│jo││            │
│└────┴───┴────┴───────┴──────┴──┘│            │
└─────────────────────────────────┴────────────┘

alist

┌─────────────────────────────────┬────────────┐
│┌────┬───┬────┬───────┬──────┬──┐│32 5 9 2 4 9│
││zach│bob│eric│frankie│alison│jo││            │
│└────┴───┴────┴───────┴──────┴──┘│            │
└─────────────────────────────────┴────────────┘

The final function is alpop – it removes the specified key-value pair, returning the value and the resulting list:

⊢(eric_val alist) ← alist alpop 'eric'

┌─┬─────────────────────────────────────────┐
│9│┌────────────────────────────┬──────────┐│
│ ││┌────┬───┬───────┬──────┬──┐│32 5 2 4 9││
│ │││zach│bob│frankie│alison│jo││          ││
│ ││└────┴───┴───────┴──────┴──┘│          ││
│ │└────────────────────────────┴──────────┘│
└─┴─────────────────────────────────────────┘

Sadly, there is no alupsert function that would modify an existing value if a key is present, and otherwise add a new key-value pair. We can, of course, write one ourselves:

alupsert←{(k v)←⍵⋄3::⍺ alpush k v⋄⍺ alset k v} ⍝ Modify if present

⊢alist ← alist alupsert 'æthelflæd' 8
⊢alist ← alist alupsert 'frankie' ¯1

┌──────────────────────────────────────┬────────────┐
│┌─────────┬────┬───┬───────┬──────┬──┐│8 32 5 2 4 9│
││æthelflæd│zach│bob│frankie│alison│jo││            │
│└─────────┴────┴───┴───────┴──────┴──┘│            │
└──────────────────────────────────────┴────────────┘

┌──────────────────────────────────────┬─────────────┐
│┌─────────┬────┬───┬───────┬──────┬──┐│8 32 5 ¯1 4 9│
││æthelflæd│zach│bob│frankie│alison│jo││             │
│└─────────┴────┴───┴───────┴──────┴──┘│             │
└──────────────────────────────────────┴─────────────┘

The way this structure is defined is very neat – completely immutable, much closer in spirit to an Erlang dict than the Python equivalent. Any modifying operation returns a copy of the original with the change applied. In comparison, all the other things we’ve done are mutable – we change things in place. Of course, always returning a copy has a cost associated with it. In functional languages like Erlang where immutability is the norm, returning copies of things doesn’t actually copy much data. In Dyalog we have to pay the price if we want to do frequent modifications of a large association list. Also, as we can only pick a single item at a time, our “big data” retrieval test is somewhat disappointing:

alist ← words2 vals
]runtime "alist∘alget¨keys" ⍝ Manually interrupted after several minutes out of boredom

INTERRUPT

As they stand, the al* routines in the dfns ws are nicely ergonomic for small data sets, perhaps as a way to store the contents of some kind of structured configuration file, but they’re probably not the best choice for large data sets and frequent access or modification.

Wrapping up#

We all occasionally need to map keys to values, and whilst Dyalog does not come with a native dictionary type, we’ve explored several ways by which we can achieve the same thing.