High performance mass short string search in Python

High performance mass short string search in Python

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.
Now I use a simple loop like:
result = []
for word in A:
    if word in B:

But it's crazy slow when A contains ~500,000 or more items.
Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.
Thank you!


Answer 1:

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

Answer 2:

Depending on how long your long string is, it may be worth it to do something like this:

ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])

result = []
for word in A:
    if word in x:

Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.

Answer 3:

I don’t know if this would be any quicker, but it’s a lot more pythonic:

result = [x for x in A if x in B]

Answer 4:

Pack up all the individual words of B into a new list, consisting of the original string split by ' '. Then, for each element in B, test for membership against each element of A. If you find one (or more), delete it/them from A, and quit as soon as A is empty.

It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

Answer 5:

Assume you has all keywords of the same length (later you could extend this algo for different lengths)

I could suggest next:

  1. precalculate some hash for each keyword (for example xor hash):

    hash256 = reduce(int.__xor__, map(ord, keyword))
  2. create a dictionary where key is a hash, and value list of corresponding keywords

  3. save keyword length

    curr_keyword = []
    for x in B:
      if len(curr_keyword) == keyword_length:
         hash256 = reduce(int.__xor__, map(ord, curr_keyword))
         if hash256 in dictionary_of_hashed:
            #search in list
      curr_keyword = curr_keyword[1:]

Something like this