You can beat the binary search

Posted by vok 3 days ago

You can beat the binary search(lemire.me)

282 points | 129 comments

ssivark 11 hours ago|

Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic.

If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).

In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.

More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.

charleslmunger 7 hours ago||

I've spent some brainpower on binary search and have not been able to beat this:

https://github.com/protocolbuffers/protobuf/blob/44025909eb7...

1. Check for dense list O(1) 2. Check upper bound 3. Constant trip count binary search

The constant trip count is great for the branch predictor, and the core loop is pretty tightly optimized for the target hardware, avoiding multiplies. Every attempt to get more clever made the loop worse and did not pay for itself. It's hard because it's an array-of-structs format with a size of 12, and mostly pretty small N.

ben-schaaf 23 minutes ago|||

For a pretty small N I've found that less clever can be quite a bit faster. I'd try a linear search - possibly SIMD if you can change the data format to struct-of-arrays. An adaptive approach that uses linear search up to a certain N can also yield some benefit.

echelon 6 hours ago|||

I know protobuf code is extremely high quality, but I really can't stand the c-style naming conventions.

I know people train themselves into grokking this and reading and emitting this way, but it sounds like writing "bork bork bork bork" runes to me.

I'm glad Rust feels more like Ruby and Python and that method and field names are legible.

My eyes just glaze over:

    UPB_API_INLINE
    const struct upb_MiniTableField* upb_MiniTable_FindFieldByNumber(
        const struct upb_MiniTable* m, uint32_t number) {
      const uint32_t i = number - 1;  // 0 wraps to UINT32_MAX
    
      // Ideal case: index into dense fields
      if (i < m->UPB_PRIVATE(dense_below)) {
        UPB_ASSERT(m->UPB_ONLYBITS(fields)[i].UPB_ONLYBITS(number) == number);
        return &m->UPB_ONLYBITS(fields)[i];
      }
    
      // Early exit if the field number is out of range.
      uint32_t hi = m->UPB_ONLYBITS(field_count);
      uint32_t lo = m->UPB_PRIVATE(dense_below);
      UPB_ASSERT(hi >= lo);
      uint32_t search_len = hi - lo;
      if (search_len == 0 ||
          number > m->UPB_ONLYBITS(fields)[hi - 1].UPB_ONLYBITS(number)) {
        return NULL;
      }
    
      // Slow case: binary search
      const struct upb_MiniTableField* candidate;
    #ifndef NDEBUG
      candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
          m, lo, search_len, number);
      UPB_ASSERT(candidate ==
                 UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number));
    #elif UPB_ARM64_ASM
      candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
          m, lo, search_len, number);
    #else
      candidate = UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number);
    #endif
    
      return candidate->UPB_ONLYBITS(number) == number ? candidate : NULL;
    }

teo_zero 6 hours ago|||

> I really can't stand the c-style naming conventions.

Honestly I don't see much difference between

  upb_MiniTable_FindFieldByNumber

and

  upb::MiniTable::FindFieldByNumber

charleslmunger 6 hours ago||||

Yeah namespaces and public/private would be quite nice, but C doesn't have them, so they get hacked on via macros and prefixing. The syntax was not the hard part of working or analyzing this code, though.

ahartmetz 6 hours ago|||

I think this needs way more "upb" and "UPB" to make it clear that it is, in fact, dealing with UPBs. Whatever these are.

jibal 2 hours ago||

> μpb (often written 'upb') is a small protobuf implementation written in C.

crazygringo 5 hours ago|||

Sure, but the whole point is that you often don't know anything further about the data.

That's why b-trees are the standard in databases. The data could be anything, and its characteristics could massively change at any time, as you suddenly import a whole bunch of new rows at once.

And while you can certainly design algorithms around e.g. gradient descent to try to accelerate lookup, b-trees are already incredibly fast, and have lots of other benefits like predictable worse-case performance and I/O requirements, supporting range scans, ordered traversal, prefix conditions, etc.

So yes, you can certainly design lookup algorithms that are more efficient for particular data distributions, but they will also often lack other important properties. And b-trees are already so fast, improvements are often negligible -- like even if another algorithm produces a closer initial guess, it may be slower to locate the final item, or it may be faster on average but have horrible worst-case performance that makes it unusable.

Even with a paper dictionary, I've always used pretty much a binary search beyond the first initial guess, which only saves you a couple of hops. And actually, once I get to the right handful of pages I'm probably more linear than I should be, and I'd probably be faster if I tried to do a rigorous binary search, but I have to balance that with how long it takes to flip pages.

skybrian 4 hours ago||

Databases often use table statistics to try to do better at generating query plans. I wonder if they use them to make indexes faster as well?

10000truths 4 hours ago||

The cost plan is a crude approximation of the actual query cost. Sometimes, the query planner makes a terrible guess. Your resident DBA won't appreciate being sometimes paged at 3 AM on a Sunday. A good strategy is to freeze the query plan once you have sufficient sample size of data in the involved tables.

hinkley 10 hours ago|||

I swear I read an article about treaps but instead of being used to balance the tree, they used the weights to Huffman encode the search depth to reduce the average access time for heterogenous fetch frequencies.

I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.

ssivark 6 hours ago|||

Huffman coding assumes your corpus is a string of discrete elements (symbol strings) without any continuous structure (eg. topology/geometry). With that fairly mild assumption, it gives a recipe to reorganize (transform/encode) your data as a prefix-tree, to minimize the bits of information needed to communicate the contents of your corpus i.e. reducing (on average) the bits of information you need to identify a specific item. Eg. To go back to the analogy from my previous comment above... if the function you are inverting via search has long plateaus then you could simply front-load those as guesses; that's roughly the spirit of Huffman coding, except it eschews monotonicity.

mvelbaum 10 hours ago|||

https://arxiv.org/abs/2206.12110 ?

_jackdk_ 4 hours ago|||

Fritz Henglein has done some interesting work on fast sorting/grouping. I think Generic Discrimination Sorting and Partitioning Unshared Data in Linear Time[1] is the main paper. Ed Kmett took those ideas and refined them into the discrimination[2] library for Haskell, and gave a very interesting tech talk about it[3].

[1]: https://dl.acm.org/doi/epdf/10.1145/1411203.1411220

[2]: https://hackage.haskell.org/package/discrimination

[3]: https://www.youtube.com/watch?v=cB8DapKQz-I

dnnddidiej 1 hour ago|||

For humans binary searching a dict is slower because it requires a different physical action vs. scanning and we have advanced OCR and flipping through capabilites. Especially recognizing when something hasnt changed e.g. still on Es keep flipping is maybe 10ms.

theptip 1 hour ago||

No the point is you know exactly where T is just by looking at the dictionary (or at least, you learn this if you use a dictionary a lot).

IOW your prior on the data distribution lets you skip the first 4-5 binary chops.

rixed 10 hours ago|||

> it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic

Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.

ssivark 6 hours ago|||

> Also [IFF] you do not learn anything about the data while performing the binary search, no?

Yes, absolutely!

I forgot to share this general perspective above, and it's too late to edit, so I'll add it here...

Since binary search assumes only monotonicity; splitting your interval into two equal parts extracts one bit of information per step, and any other choice would extract less information on average. One bit of information per step is how you end up needing log(n) steps to find the answer.

To accelerate your search, you basically need to extract those log(n) bits as fast as you can. You can think of that as leveraging both the prior, and everything you learn along the way -- to adaptively design each step to be the optimal experiment to extract maximum amount of information. And adaptive local models of your search space (gradient / hessian / etc) allow you to extract many more bits of information from each query / experiment, provided the function you are inverting has some local structure.

PS: That is why we leverage these ideas to "search" for the optimum, among a space of solutions.

molf 9 hours ago||||

It's not possible to learn anything about other elements when performing binary search, _except_ the only thing there is to learn: if the target is before or after the recently compared element.

If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right. And if we guess incorrectly, in the worst case, the algorithm degrades to a linear scan.

Unless we have prior knowledge. For example: if there is a particular distribution, or if we know we're dealing with integers without any repetition (i.e. each element is strictly greater than the previous one), etc.

travisjungroth 7 hours ago|||

> If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right.

This is true for abstract and random data. I don't think it's true for real world data.

For example, python's sort function "knows nothing" about the data you're passing in. But, it does look for some shortcuts and these end up saving time, on average.

kryptiskt 9 hours ago|||

> It's not possible to learn anything about other elements when performing binary search, _except_ the only thing there is to learn: if the target is before or after the recently compared element.

You have another piece of information, you don't only know if the element was before or after the compared element. You can also know the delta between what you looked at and what you're looking for. And you also have the delta from the previous item you looked at.

wtallis 9 hours ago|||

And you always start off knowing the total length of the array, and the width of the datatype.

Actually deciding what to do with that information without incurring a bunch more cache misses in the process may be tricky.

xenadu02 8 hours ago||

Is the disconnect here that in many datasets there is some implicit distribution? For example if we are searching for english words we can assume that the number of words or sentences starting with "Q" or "Z" is very small while the ones starting with "T" are many. Or if the first three lookups in a binary search all start with "T" we are probably being asked to search just the "T" section of a dictionary.

Depending on the problem space such assumptions can prove right enough to be worth using despite sometimes being wrong. Of course if you've got the compute to throw at it (and the problem is large) take the Contact approach: why do one when you can do two in parallel for twice the price (cycles)?

LorenPechtel 2 hours ago|||

Assuming your key space is anything like randomly distributed.

Thinking about it--yeah, if you can anticipate anything like a random distribution it's a few extra instructions to reduce the number of values looked up. In the old days that would have been very unlikely to be a good deal, but with so many algorithms dominated by the cache (I've seen more than one case where a clearly less efficient algorithm that reduced memory reads turned out better) I suspect there's a lot of such things that don't go the way we learned them in the stone age.

Nevermark 8 hours ago|||

For a list of sorted values with no other knowledge, the binary search is optimal. Provably, it is simple information theory on binary information.

You can do better if the list is stable by reusing information.

But gathering that information during searches is going to require great complexity to leverage, as searches are an irregular information gathering scheme.

So create RAM for speedup optimizations up front.

1) Create a table that maps the first 8 bits to upper and lower indexes in the list. Then binary search over the last 8 bits. That reduces the search time in half.

2) Go all the way, and create an array of 32,768 indexes, with all 1's for misses. Either way, search returns O(1).

Stable lists allow for sliding parametric trade offs between RAM-lookup vs. binary search. From full lookup, to full binary.

painted-now 10 hours ago|||

> In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm.

Never thought about it this way. Brilliant!

mycall 11 hours ago|||

Furthermore, with the vast and immediate knowledge that LLMs have, we could see a proliferation of domain-specific sorting algorithms designed for all types of purposes.

tantalor 10 hours ago|||

> use that extra information to perform MUCH better

Do you mean using a better estimator for the median value? Or something else?

giovannibonetti 4 hours ago||

Say, if you know the function is a polynomial of degree N, with N+1 datapoints you can find it – e.g. with Lagrange's polynomial, although the finite precision of computer numbers might make that more complex.

locknitpicker 10 hours ago|||

> If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better.

You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.

rv64imafdc 10 hours ago|||

> knowing the position and value of two elements in a sorted list

That's a prior about the distribution, if a relatively weak one (in some sense, at least).

esafak 9 hours ago||

https://en.wikipedia.org/wiki/Empirical_Bayes_method

darknoon 10 hours ago|||

This relies on knowledge of the distribution, just querying in the middle of A = [1, 2, 4, 8, 16, ..., 2^(n-1)] is slower than binary search

drob518 14 hours ago||

Isn't "quaternary" just sort of unrolling the binary search loop by one level? I mean, to find the partition in which the item is located, you still do roughly the same rough number of comparisons. You're just taking them 4 at a time, not 2 at a time. Seems like loop unrolling would give you the same.

nkurz 13 hours ago||

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.

drob518 12 hours ago||

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

LoganDark 12 hours ago||

It's not about doing more or less work; it's about doing the work faster. For instance, it's relatively common to discover that some recomputation can be faster than caching or lookup tables. Similarly, fetching more from memory also can be faster if it means you make less roundtrips.

crdrost 11 hours ago||

Well that's where I thought this link was going to go before it went down the simd path... We have a way to beat binary search, it is called b-trees, it has the same basic insight that you can easily take 64 elements from your data set evenly spaced, compare against all of those rapidly, and instead of bifurcating your search space once, you do the same as six times, but because you store the 64 elements in an array in memory, they only take one array fetch and you get cache locality... But as you have more elements, you need to repeat this lookup table like three or four or five times, so it costs a bit of extra space, so what if we make it not cost space by just storing the data in these lookup tables...

wtallis 13 hours ago|||

Yes, this can be seen as unrolling the loop a bit. It improves performance not by significantly reducing the number of instructions or memory reads, but by relaxing the dependencies between operations so that it doesn't have to be executed purely serially. You could also look at it as akin to speculatively executing both sides of the branch.

mayoff 13 hours ago|||

Quaternary search effectively performs both of the next loop iteration’s possible comparisons simultaneously with the current iteration’s comparison. This is a little more complex than simple loop unrolling.

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

loeg 13 hours ago|||

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

pfortuny 13 hours ago||

It is because processors do not do what one might naively think they do.

lalitmaganti 11 hours ago||

I also wrote recently [1] about Exponential Search [2] which is another algorithm if you need to repeatedly binary search in an array where the elements you're searching are themselves are sorted. It allowed for an 8x speedup in our workload!

[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search

10000truths 9 hours ago|

Exponential search is useful when you're querying a REST API that addresses resources with sequential IDs, and need the last ID, but there's no dedicated endpoint for it:

  HEAD /users/1 -> 200 OK
  HEAD /users/2 -> 200 OK
  HEAD /users/4 -> 200 OK
  ...
  HEAD /users/2048 -> 200 OK
  HEAD /users/4096 -> 404 Not Found

And then a binary search between 2048 and 4096 to find the most recent user (and incidentally, the number of users). Great info to have if you're researching competing SaaS companies.

drkrab 9 hours ago||

Since the cpu always accesses a full cache line (64 bytes) at a time, you might as well search the entire cache line (it’s practically free once the data is on-cpu). So I’d like to try a ‘binary’ search that tests all the values in the ‘middle cache line’ and then chooses to go left or right if none match. You can do the cache line search as a single 512bit simd instruction. A cache line is 64 bytes (or 32 16-bit integers); such a search might well be almost 32 times faster than simple binary search; at least it’ll do 32x less memory accesses, which will dominate in most realistic programs.

nly 9 hours ago|

Searching the upper cache lines in your binary search tree (sorted vector) for your target is unlikely to yield results. Instead you want to use the extra data in the line to shorten the search, which leads you to a B-Tree or B+tree.

For 4 byte keys and 4 byte child pointers (or indexes in to an array) your inner nodes would have 7 keys, 8 child pointers and 1 next pointer, completely filling a 64 byte cache-line and your tree depth for 1 million entries would go down from ~20 to ~7, the top few levels of which are likely to remain cache resident.

With some thought, it's possible to use SIMD on B-tree nodes to speed up the search within the node, but it's all very data dependent.

aidenn0 6 hours ago||

Binary searching a sorted array is isomorphic to a sorted binary tree with implicit child pointers.

It seems to me like there should be a sort order that stores the items as a fully-dense left-shifted binary tree from top-to-bottom (e.g. like the implicit heap in an in-place heap sort, but a binary search tree instead of a hea). Is there a name for this? Does it show any performance wins in practice?

Rusky 3 hours ago||

There's Eytzinger order: https://algorithmica.org/en/eytzinger

ted_dunning 2 hours ago||

Seems like you can use the core intuition here of SIMD comparison of multiple elements on more than just the terminal scale.

The outline would be:

a) use a gather to grab multiple elements from 16 evenly spaced locations

b) compare these in parallel using a SIMD instruction

c) focus in on the correct block

d) if the block is small revert to linear search, else repeat the gather/compare cycle

Even though the gather instruction is reading from non-contiguous memory and reading more than you normally would need to with a binary sort, enabling a multiway compare and collapsing 4 levels of binary search should be a win on large tables.

You also may not be able to do a full 16-way comparison for all data types. Searching for float64 will limit you to 8 way comparisons (with AVX-512) but int32, float32 will allow 16 way comparisons.

dragontamer 2 hours ago|

Gather is extremely slow. Anyone aiming for efficiency will avoid gathers.

I bet you a binary search is in fact faster than any gather based methodology.

taeric 14 hours ago||

If you are talking smaller arrays, linear search with a sentinel value at the end is already tough to beat. The thing that sucks about that claim, is that "smaller" is such a nebulous qualifier that it is really hard to internalize.

rao-v 14 hours ago||

This is simply not true - if you look at this article’s excellent benchmarking, linear search falls behind somewhere around 200-400 elements.

In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.

KalMann 12 hours ago|||

I don't really see how this implies the above commenter's statement is "simply not true".

taeric 13 hours ago||||

I don't think std::find typically uses a sentinel, though?

BeetleB 13 hours ago||||

For that machine and compiler version, yes.

eggprices 14 hours ago|||

Except on Apple, where binary search always wins. Does anyone know why?

stephencanon 13 hours ago||

Prior to the current generation Intel designs, Apple’s branch predictor tables were a good deal larger than Intel’s IIRC, so depending on benchmarking details it’s plausible that Apple Silicon was predicting every branch perfectly in the benchmark, while Intel had a more real-world mispredict rate. Perf counters would confirm.

SuperV1234 14 hours ago||

That's not what the article is about.

srcreigh 12 hours ago||

The algorithm description was a bit confusing for me.

The SIMD part is just in the last step, where it uses SIMD to search the last 16 elements.

The Quad part is that it checks 3 points to create 4 paths, but also it's searching for the right block, not just the right key.

The details are a bit interesting. The author chooses to use the last element in each block for the quad search. I'm curious how the algorithm would change if you used the first element in each block instead, or even an arbitrary element.

vjay15 34 minutes ago||

Surprisingly simple concept!

XCSme 7 hours ago||

Is it possible to do some sort of Binary* Search (Binary Star, as in A* star search algorithm, where we use heuristics).

    a: [1,3,5,7,8,9,10,15]  
    x: 8 (query value)

For this array, we would compare a[0], a[3], a[7] (left/mid/right) by subtracting 9.

And we would get d=[-7, -1, 7]

Now, normally, with binary search, because 8 > mid, we would go to (mid+right)/2, BUT we already have some extra information: we see that x is closer to a[3] (diff of 1) than a[7] (diff of 7), so instead of going to the middle between mid and right, we could choose a new "mid" point that's closer to the desired value (maybe as a ratio of (d[right]-d[mid]).

so left=mid+1, right stays the same, but new mid is NOT half of left and right, it is (left+right)/2 + ratioOffset

Where ratioOffset makes the mid go closer to left or right, depending on d.

The idea is quite obvious, so I am pretty sure it already exists.

But what if we use SIMD, with it? So we know not only which block the number is in in, but also, which part of the block the number is likely in. Or is this what the article actually says?

mbowring 3 hours ago|

Yeah this is basically interpolation search

XCSme 3 hours ago||

Oh, that's what the article was referring to with "interpolation".

Weird that I didn't hear about it before, it's not that used in practice?

One reason I could see is that binary search is fast enough and easy to implement. Even on largest datasets it's still just a few tens of loop iterations.

alexfoo 13 hours ago|

The classical canonical Comp Sci algorithms are effectively "designed" for CPUs with no parallelism (either across multiple cores, via Hyper-threading technology, or "just" SIMD style instructions), and also where all memory accesses take the same amount of time (so no concept of L1/L2/L3/etc caches of varying latencies). And all working on general/random data.

As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.

What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.

When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.

(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)

More comments...