Faster way of searching a string in text files [closed]

Faster way of searching a string in text files [closed]






Closed. This question needs to be more focused. It is not currently accepting answers.
                            
                        










Want to improve this question? Update the question so it focuses on one problem only by editing this post.
                        
Closed 3 years ago.



I need to search for a string, roughly 13 characters, in a group of text files using C#. The number of text files is changing and can range between 100-1000. The size of the files can range between 1KB and 10MB.
I tried the naive way of opening each file, read it line by line and see if the string exists (using index.of), but this is too slow. I also tried using the Boyer-Moore algorithm, which did improve the timing, by 5 seconds, but still this feels slow.
Any idea on how to speed up the search?

Solutions/Answers:

Answer 1:

Depending on how many times you want to do the ‘search’, you want to use a search engine or not. If you want to search a lot of times, use a search engine, otherwise: don’t. I’m going to describe how to implement both scenario’s here.

When using a search engine: It sounds like you’re looking for substrings, which means you should index your files as such using your favorite search engine, preferably one you can customize (lucene, terrier, etc.). The technique you need here is to index trigrams, that is: all 3-character combinations have to be indexed. F.ex.: ‘foobar’ will generate ‘foo’,’oob’,’oba’ and ‘bar’. When searching, you want to do the same with your query and issue a search engine query with the AND of all these trigrams. (That will run a merge-join on the posting lists from the documents, which will return their ID’s or whatever you put in the posting lists).

Alternatively, you can implement suffix arrays and index your files once. This will give a little more flexibility if you want to search for short (1-2 char) substrings, but in terms of indexes is harder to maintain. (There is some research at CWI/Amsterdam for fast indexing suffix arrays)

When you want to search only a few times, the algorithm to use is either Boyer-Moore (I usually use Boyer-moore-sunday as described in [Graham A. Stephen, String Search]) or a compiled DFA (you can construct them from an NFA, which is easier to make). However, that will only give you a small speed increase, for the simple reason that disk IO is probably your bottleneck and comparing a bunch of bytes that you need to decode anyways is quite fast.

The biggest improvement you can make is not reading your file line by line, but in blocks. You should configure NTFS to use a block size of 64 KB if you can and read the files in multiplies of 64 KB – think 4 MB or more in a single read. I’d even suggest using asynchronous IO so that you can read and process (previously read data) at the same time. If you do it correctly, that should already give you a split-second implementation for 10 MB on most modern hardware.

Last but not least, a neat trick used throughout information retrieval is also to compress you data using a fast compression algorithm. Since disk IO is slower than memory/cpu operations, this will probably help as well. Google’s Snappy compressor is a good example of a fast compression algorithm.

Answer 2:

You should consider using Operating System file search with contents. Take a look at Microsoft Windows Search 3.x SDK

Or you can utilize PLINQ for searching in array of files. See this link:

File Content and Directory Search using Directory.GetFiles and PLINQ

Answer 3:

Two options come to mind:

Reading your text file in memory and just search the whole string at once.

If that proves to be too slow or too memory hungry, use an indexer like Apache Lucene. There is a nice and easy SDK for that available for .NET, called Lucene.net

Here is a small introduction for it:
http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net

Answer 4:

If your computer can handle it try loading all text files into memory (using the technique shown here and then evaluate the text in memory.

If you cant handle all of the files at one time, then do this for smallest files. File I/O is going to be your greatest expense here, so you want to minimize that as much as possible.

Answer 5:

You can use the Microsoft’s indexing service to search for documents in the folders which you would add in the catalogue. Here is a very nice article which you can user to search your text files

References