- mysql fulltext search
- lucene search engine
- sphinx www.sphinxsearch.com
I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.
I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB
I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.
Mysql Version - 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 & <=84 ( There is a feature in mysql only which allows you to specify the min & max length of words you want to index. By default min length is 4. I changed it to 2 so that i can index words like net, php, hr etc. If you want to index all words, change this to 1.
Indexing time : 1440 seconds
Index size : 263 MB (456 MB data - remains same).
Lucene Version - 1.9.1
Stop words : 660 general words (like is, if, this etc...)
Indexing time : 2761 seconds (default configuration was used during indexing. There are certain parameters like mergefactor and maxmergedocs using which indexing can be tuned to work much faster. Though it may result in Too Many Open Files error in linux.
Index Size : 109 MB (No data is stored. Had stored only the unique id of each document using which i can retrieve the document later.)
Sphinx Version - 0.9.5
Stop words : NONE
Indexing time : 246 seconds (using default configuration. Dont have much idea whether indexing can be tuned.)
Index Size : 211.1 MB (3.1 MB Index - .spi file + 208 MB Data - .spd file).
3.1 MB of index looks extremely good. Also in case of sphinx, there is no need to maintain separate id for retrieval of data, since the unique id of your data is maintained in the index. As compared to lucene where you have to maintain a separate id and enforce uniqueness on it with your program. The indexing time and data compression are both good.
The searches done were using scripts. I did a number of searches on randomly selected words and then came out with an average time. In case of lucene and mysql, the search was done on 2 fields with an OR between them. In case of sphinx the search was done on the complete index.
|Searches/Thread||Concurrency - no of simultaneous threads||Total searches||Total time (milli seconds)||Average time (milli seconds)|
|Found that search for an exact phrase which can be done using "in boolean mode" queries is more resource hungry. The query time in mysql is extremely high. Mysql runs purely on RAM, so with more RAM and accordingly configured mysql the queries would be much faster. Concurrency does not affect query execution speed to a major extent.|
|Initially searches are slow. But as we keep on searching the index is cached in RAM and the speed of searches increases. The searches are very fast as compared to MySQL. Here, from the results, it does not seem that concurrency is an issue. But i have been using lucene for some time now and have found that there are some concurrency issues in lucene. So if you are searching a huge index of say 100,00,000 records and the index size is say 8-10 GB, then with a concurrency of 100 searches at the same time, issues pop up, as searches seem to get locked. But still performance wise it is much better than mysql|
|Single searches are faster than that in lucene. But here we will have to consider the fact that there is no OR clause in the search. So the search engine does not have to get 2 different result sets and do a union on them. But as the concurrency of searches in increased the average time per search does not drop majorly as in lucene. Clearly pointing out that there may be concurrency issues here. Since i have not explored this to a great extent, i cannot comment on the problems related to concurrency here.|
To sum up, lucene seems to be the right choice for the time being if you are looking forward to searching large amounts of data and performance is your primary goal. The list of features available is also impressive. Sphinx will come in next where indexing time is very small and indexing/searching hassle free. With evolution, sphinx may overtake lucene some time down the line providing both a list of good features and performance. MySQL fulltext search comes as a last option, which, it seems should be used only if the data set is small and quick development time is required.
Date : 12 Aug 2009
Some latest benchmarks on Lucene Versus Sphinx