# High performance mass short string search in Python

## High performance mass short string search in Python

```The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.
Now I use a simple loop like:
result = []
for word in A:
if word in B:
result.append(word)

But it's crazy slow when A contains ~500,000 or more items.
Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.
Thank you!
```

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

Depending on how long your long string is, it may be worth it to do something like this:

``````ls = 'my long string of stuff'
#Generate all possible substrings of ls, keeping only uniques
x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)])

result = []
for word in A:
if word in x:
result.append(word)
``````

Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.

I don’t know if this would be any quicker, but it’s a lot more pythonic:

``````result = [x for x in A if x in B]
``````

Pack up all the individual words of `B` into a new list, consisting of the original string split by `' '`. Then, for each element in `B`, test for membership against each element of `A`. If you find one (or more), delete it/them from `A`, and quit as soon as `A` is empty.

It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.

Assume you has all keywords of the same length (later you could extend this algo for different lengths)

I could suggest next:

1. precalculate some hash for each keyword (for example xor hash):

``````hash256 = reduce(int.__xor__, map(ord, keyword))
``````
2. create a dictionary where key is a hash, and value list of corresponding keywords

3. save keyword length

``````curr_keyword = []
for x in B:
if len(curr_keyword) == keyword_length:
hash256 = reduce(int.__xor__, map(ord, curr_keyword))
if hash256 in dictionary_of_hashed:
#search in list

curr_keyword.append(x)
curr_keyword = curr_keyword[1:]
``````

Something like this