Hyper Estraier is a small search engine for the lazy

декабря 11, 2017

Small — because in comparison with Sphinx speed is really not impressive, and lazy — because it's very simple.
What attracted the attention, in spite of the modest performance?
1. Opportunity real-time indexing.
2. The attributes of the document and use the search and sorting the result.
3. Easy operation and compact clear documentation (in a couple of days on the study, in fact a cursory glance at the diagonal of the docks and was the impetus for a more detailed study of the product).

My impressions of Hyper Estraier:

This search engine belongs to Peru FalLabs, one product which I test recently. Short User's Guide (34 on a laptop screen, of which two thirds for the parameters and configs), with some interesting features tempted to experiment.
Personally, I'm about a day spent on the study of the description, a few moments to install plus half an hour to deal with the simplest version of the work, indexing of three documents and playing with the results.
Another day spent to write "on the knee" program for indexing is already working the amount of data to evaluate its performance and the day to study and testing of client / server.

the Standard installation with default settings:
$ ./configure
$ make
$ make install
Requires:
— libiconv — is part of glibc;
— zlib — for data compression
— QDBM — a product of the same FalLabs, embeddable database. The installation for the same, above.

the Index.
For indexing the document should be presented in the format of "document draft" format, ideologically close to the format of http Protocol — header/empty string/text.
The header lists the attributes in the format of "@attribute=value". One line — one attribute.
The text is a simple plain-text. File in utf-8.

the Operation.
1. The simplest option is the command line.
Estcmd utility creates a search index, manages and allows you to search. When you specify options-vh search result it is read is issued with snippets in multipart format. The first line — separator blocks. The first block — title-query result — how many documents the search time, number of documents for each of the words, etc. to Parse this results nice and easy.
For this package there is a simple cgi script, you can look for more familiar way, via a browser.
If you want something more beautiful in its design — partite the issuance of any convenient tools.
2. More complex version of client-server.
It is desirable for multi-user work with the database of a search engine. Plus, if in the first embodiment, each time some time spent on the open database, in this case we save on the operation, and of course, caching of recent queries significantly speeds up the results with repeated applications.
Interfaces to search for this:
— C API (documentation with simple examples);
— web interface — easy search and database management;
— command-line utility estcall — in fact, the server sends the same http requests, the search results similar to those described in the preceding paragraph.

the Speed.

Testing was performed on the same server as last time — Opteron-2218, 2.6 GHz, 8G OP, 73G HDD+143G sas.
At this time all work was conducted in one of the 143 GB drives.
Source data — 3224992 post from the forums of one project, totaling about 700 Mb.
the Indexed data. Data were extracted in batches of 5000 files were converted to utf-8 and indexed.
— the first option, the command line is 6 hours, almost to the minute;
— the second option — the files individually fed to the server — about 10.5 hours.

the Disk resources:
— the index received on the first options takes up about 5.3 Gb on disk;
the index obtained according to the second embodiment is about 6.3 Gb.
Why do so — do not understand. Maybe it depends on the concurrent operation of the server with multiple indexes, internal name "node" (node).

the .
Unfortunately on this issue more or less comprehensive statistics I gather so far failed. The massive bombing of requests I was not satisfied. Can share subjective impressions:
1. The first requests for soupstrainer index processed long enough — about one and a half seconds.
2. Again the same queries as you move between pages of a given query is not more than 1 hundredth of a second.
3. Verbose queries are processed longer. For example, a query of 5 words (searched for the last remnants spam) even in the scrolling has spent of the order of 0.17 seconds.
All requests I made via the web server interface of a search engine.

the Conclusions.
the

the capacity of this engine is enough for most sites, except, perhaps, the major media, blogs, the level LJ, and the like. the

Installing, configuring, and working with them is quite simple and does not require high qualification and the individual.
Actually using Hyper Estraier is possible to index any documents from which you can extrude the text. Links to some programs-parsers of other formats given in the documentation. Also you have your own crawler to index the web pages.

I'm going to run it on a group of forums with the traffic in 10-20 thousand of posts/comments per day.

the PS. I have not considered all the possibilities of Hyper Estraier. If I understand correctly, it is possible to search on several indexes nodes, and multi-machine version of the work. So it is possible that the real "power" of the engine may be much higher than what I could achieve. For fans to test-torment work left :)

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

Hyper Estraier is a small search engine for the lazy

Популярные сообщения из этого блога

Approval of WSUS updates: import, export, copy

Kaspersky Security Center — the fight for automation

The Hilbert curve vs. Z-order