High performance mass short string search in Python
The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them. Now I use a simple loop like: result =  for word in A: if word in B: result.append(word) But it's crazy slow when A contains ~500,000 or more items. Is there any library or algorithm that fits this problem? I've tried my best to search but no luck. Thank you!
Your problem is large enough that you probably need to hit it with the algorithm bat.
Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.
Also, look into the work by Nicholas Lehuen with his PyTST package.
There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?
Depending on how long your long string is, it may be worth it to do something like this:
ls = 'my long string of stuff' #Generate all possible substrings of ls, keeping only uniques x = set([ls[p:y] for p in range(0, len(ls)+1) for y in range(p+1, len(ls)+1)]) result =  for word in A: if word in x: result.append(word)
Obviously if your long string is very, very long then this also becomes too slow, but it should be faster for any string under a few hundred characters.
I don’t know if this would be any quicker, but it’s a lot more pythonic:
result = [x for x in A if x in B]
Pack up all the individual words of
B into a new list, consisting of the original string split by
' '. Then, for each element in
B, test for membership against each element of
A. If you find one (or more), delete it/them from
A, and quit as soon as
A is empty.
It seems like your approach will blaze through 500,000 candidates without an opt-out set in place.
Assume you has all keywords of the same length (later you could extend this algo for different lengths)
I could suggest next:
precalculate some hash for each keyword (for example xor hash):
hash256 = reduce(int.__xor__, map(ord, keyword))
create a dictionary where key is a hash, and value list of corresponding keywords
save keyword length
curr_keyword =  for x in B: if len(curr_keyword) == keyword_length: hash256 = reduce(int.__xor__, map(ord, curr_keyword)) if hash256 in dictionary_of_hashed: #search in list curr_keyword.append(x) curr_keyword = curr_keyword[1:]
Something like this
- Database Administration Tutorials
- Programming Tutorials & IT News
- Linux & DevOps World
- Entertainment & General News
- Games & eSport