Elasticsearch - fuzzy query

728x90

개요

자연어 검색은 본질적으로 부정확하다. 컴퓨터는 잔연어를 이해할 수 없기 때문에 검색에 대한 다양한 접근 방식이 있으며 각각의 장단점을 가지고 있다. 하지만 Fuzzy 쿼리는 사용자 이름 검색, 철자 오류 및 기타 문제에 대햐여 fuzzy 쿼리를 사용하여 해결할 수 있다.

Fuzzy Query

Levenshtein edit distance를 사용하여 검색어와 유사한 용어가 포함된 문서를 조회할 수 있다.

(Levenshtein 거리는 하나의 문자열이 다른 문자열과 일치하도록 만드는데 필요한 삽입, 삭제, 대체 및 전치수)

text, keyword filed 대상으로 사용할 수 있다.

edit distance는 한 용어를 다른 용어로 바꾸는데 필요한 문자 변경의 수다. (edit deistance가 클수록 효율적으로 계산하는데 훨씬 더 많은 비용이 필요하다.)

changing a character (box -> fox)
Removing a character (black -> lack)
inserting a character (sic -> sick)
Transposing two adjacent characters (act -> cat)

Parameters for fuzzy

value : Term you with to find in the provided
fuzziness : Masimum edit distance allowed for matching. See
- 0, 1, 2 : The maximum allowed Levenshtein Edit Distance (or number of edits)
- AUTO : Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness
max_expansions : Maximum number of variations created. Default to 50 (성능 이슈로 주의)
prefix_length : Number of befinning characters left unchanged when creating expansions. Defaults to 0
transpositions : Indicates whether edits include transpositions of two adjacent characters (ab -> ba)
rewrite : Method uesd to rewrite the query. For vaild values and more information

Fuzzy 쿼리 검색

{
	"query": {
		"bool": {
			"must": [
				"multi_match": {
					"query": "{search_query}",
					"fields": [ "title" ],
					"type" : "best_fileds",
					"operator": "AND",
					"slop": 0,
					"fuzziness": "1",
					"prefix_length": 0,
					"max_expansions: 50,
					"zero_terms_query": "NONE",
					"auto_generate_synonyms_phrase_query": true,
					"fuzzy_transpositions: true,
					"boost: 1.0
				}
			]
		}
	}
}

주의사항

ngram 분석기나 유의어와 같이 사용하는 경우 검색 결과가 이상할 수 있다.
- fuzzy 쿼리와 함께 사용하도록 의도된 텍스트에 대해서만 간단한 분석기를 사용하고 동의어를 비활성화하는 것이 의미가 있다.
Levenshtein 거리 구현이 빠르지만 일반적인 일치 쿼리보다는 훨씬 느리다. (쿼리실행 시간은 인덱스의 고유한 용어 수에 따라 증가한다.)
- 일반 검색은 binary search, fuzzy 검색은 DFA
접두어 설정은(prefix_length) 성능이 크게 좋아진다. (RDB에 like 쿼리에 xx%처럼)
snowball 분석기나 ngram처럼 다른 도구를 사용하는 것이 맞춤법 오류에 대응하기에 더 알맞은 방법일 수 있다.

참고

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html

728x90

'dev > elasticsearch' 카테고리의 다른 글

Elasticsearch hybrid search with RRF(Reciprocal rank fusion) (0)	2024.12.03
Elasticsearch 시맨틱 검색(semantic search) (1)	2024.11.27
Elasticsearch - Dense vector field type (1)	2024.06.03
elasticsearch 2.0 Getting Started (0)	2015.11.04
elasticsearch Getting Started (0)	2015.01.30

igooo

Elasticsearch - fuzzy query

개요

Fuzzy Query

Parameters for fuzzy

Fuzzy 쿼리 검색

주의사항

참고

'dev > elasticsearch' 카테고리의 다른 글

티스토리툴바

Elasticsearch - fuzzy query

개요

Fuzzy Query

Parameters for fuzzy

Fuzzy 쿼리 검색

주의사항

참고

'dev > elasticsearch' 카테고리의 다른 글

관련글

티스토리툴바