more elastissearch

  • Affected App
    WoltLab Suite Forum

    Not sure what is going on here, upgraded to all the newest patches today, restarted elasticsearch just to be sure..

    Can I get some guidance on how elasticsearch is supposed to work with wildcards and such? Seeing more oddness.

    I have a post with the word 'viris' in it, if I search for 'is', 'iris', 'is', and use * before, end, and both, I get no hits. However, when searching for 'is' I am getting hits for 'issues', 'distributed'', 'disaster', but no 'viris'.

    Something is broke somewhere, just trying to find out where.

  • Using the wildcard operator is pointless because it gets removed from the query string anyway. Unless terms are encapsulated in quotes, a wildcard is automatically added behind each word. Leading wildcards are disabled because they cannot be handled by an index and will cause elasticsearch to evaluate each term in the entire index; The performance impact is enormous.

  • There is something you might want to consider: The highlighting in search results is performed by our parser which tries to find the proper matches in the text. It simply searches for the term and highlights the occurrences and adds a bit text before and after similar to Google's search results.

    When using short or common terms, chances are the parser will not highlight the correct words, I will try to explain this:

    In this example, the search matched on ghost, but the highlighter incorrectly assumed the first match to be correct. This creates the false impression that it actually matched on mighty even though this was not the case.

    The only way to improve this would be adding an option for elasticsearch to allow leading wildcards, thus allowing searches such as *is which in return would match viris. It is currently disabled in the code because the elasticsearch integration aims at very large forums and in these environments a leading wildcard search could cause serious performance issues - we have a general rule to not implement features that don't scale.

  • I still don't follow, sorry.

    Forget wildcards, why is is found in some posts/words (disappeared, disaster, provisioned, issues, is, and so on), but not found in viris? Even ris wasn't found in viris.

    PS just turned off ES and used built-in SQL and re-indexed, same stuff is happening. I'll test on my IPS4 and XF deployments, but I don't think they have these limitations.

  • Forget wildcards, why is is found in some posts/words (disappeared, disaster, provisioned, issues, is, and so on), but not found in viris? Even ris wasn't found in viris.

    elasticsearch uses each word in the search term and find matches in the index, starting with exact matches and then looking for matches that start with the same sequence. This means that searching for def will match both def and defghi, but will not match abcdef. This can of course be changed by allowing leading wildcards (this is actually an ES setting when querying) which will then also find abcdef.

    As long as ES can perform a simple index lookup together with finding terms that start the same, things are lightning fast. Leading wildcards (and that is what you want!) are a performance disaster because all of sudden ES has to crawl its entire index to find possible matches.

    I will try to explain it a bit less abstract: Think of managing data for tens of thousands of customers all stored in files ordered by the first 2 characters of their last name (e.g. Aa-Ae, Af-AkZv-Zz[/t]). If you are now requested to search for all customers whose name begin with [tt]Doe (including the exact match), it is as easy as grabbing the correct file and then go through them. Now if you're requested to look for customers whose name contain doe, you'll have to look through all files to find possible matches. You never really want to do this and neither wants ES, because it will be a nightmare.

  • I've been working on the elasticsearch plugin lately and it will receive a bunch of changes with the next update. The version is still under development and not yet active on woltlab.com, just in case you were curious.

    Leading wildcards
    This is now an optional setting in the ACP, granting or deny the usage of leading wildcards such as *bar which would match both bar and foobar. Since this requires more work for elasticsearch, it could potential cause a higher load and therefore can be turned off. It will be enabled by default though.

    Decay for relevance-based search
    Until now the order by relevance will use only the relevance score which is ultimately focused on the document's content itself. In most cases this is fine, but there are scenarios in which you want older documents to be considered less relevant because they could be potentially outdated. To solve this, the next update will introduce an optional decay function applied to the score computed by elasticsearch.

    See this image for explanation: https://www.elastic.co/guide/en/elast…decay-functions

    You can configure a period which will be used as value for "offset", meaning that all documents not older than the set period will be treated as equally current, receiving no penalty. The value for "scale" is half the value for "offset". In numbers: Let's say you set the period to 1 year, which means that all documents not older than 1 year will be considered to be equally current. Documents that are 1 1/2 years old will receive half the penalty, from this point on it's all up to the used function that decides how older documents are handled.

    The image displays all 3 available decay functions where the gauss function is set by default, but you can freely pick any of them. It's the most useful because it doesn't have a big impact on documents that are 13 or 14 months old, but will from then on cause a rapid decay.

    These settings only apply on query time, they don't actually alter any data stored in elasticsearch, therefore you can simply change the settings in the ACP without the need to rebuild any data. This way you can experiment a bit with these settings to see what fits your purpose best.

  • Will the relevance based search work also with posts ?
    Until now posts are always sorted by time instead by relevance which makes the better search results useless...

  • Will the relevance based search work also with posts ?
    Until now posts are always sorted by time instead by relevance which makes the better search results useless...

    This is a global setting in the ACP. It is set to time by default, because MySQL relevance search is horrible and favors ancient content over new ones. The next version of the elasticsearch plugin will automatically change this to relevance if the ACP setting still equals the default, mostly because this will cause significantly better search results with elasticsearch. Given the additional configuration option as described above, there is a lot room for optimizations that can be done for each community.

  • The current Version of wbb4.1 will never ever use the relevance-search at all for posts, it is wbb\system\event\listener\SearchListener in line 33 disabled and the current elasticsearch plugin doesn't enable it again.
    So even if you change this, it will always use time sorting for post search.

    This is one of the reasons why i'll wrote my own implementation to overwrite this again and enable relevance-search at all...

  • The current Version of wbb4.1 will never ever use the relevance-search at all for posts, it is wbb\system\event\listener\SearchListener in line 33 disabled and the current elasticsearch plugin doesn't enable it again.
    So even if you change this, it will always use time sorting for post search.

    I'm afraid, but this is not correct. If you look closely at line 27 you'll notice that this only applies when displaying post search results as threads and when searching for threads started by a user.

    In every other case sorting is based upon the preset defined in the ACP option or the individual choice by the user when using the extended search form.

  • OK, you are right, but as i always search by thread this counts a lot for me.
    I'm not sure about the performance, but using a GROUP BY instead of a DISTINCT would allow you to get the relevance of the best fitting post and using this to sort the threads. I'm using this and it seams to deliver the best fitting result for thread searching...

    The point is: Currently it is not possible to find threads really good, you have to use either google or changing the search by your self...

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!