Elasticsearch hybrid search with RRF(Reciprocal rank fusion)

728x90

개요

이전 게시글에서는(https://blog.igooo.org/157) Elasticsearch에 semantic_text 타입을 사용한 시맨틱 검색을 사용하는 방법에 대하여 알아봤다. 이번 게시글에서는 sematic search와 full-text 검색을 결합한 하이브리드 검색을(hybrid search) 사용하는 방법에 대하여 알아본다.

하이브리드 검색에서 semantic-search은 텍스트의 의미에 따라 결과를 검색하고, full-text search는 정확한 단어 일치에 초점을 맞춰서 검색한다. 하이브리드 검색은 sematic, full-text 두 가지 방법을 결함함으로써 둘 중 하나의 검색 결과가 충분하지 않을 경우 더욱 관련성 있는 결과를 제공할 수 있다.

Getting Started

Requirements

Elasticsearch (inference API 지원 버전)
Inference APIs (Text embedding)

앞에서 진행했던 Elasticsearch 구성을 참고한다. (참고 https://blog.igooo.org/157)

Inference endpoint 생성

Create inference API를 사용하여 inference endpoint를 생성한다.

PUT /_inference/text_embedding/openai-embeddings
{
	"service": "openai",
	"service_settings" : {
		"api_key: "{API_KEY}",
		"model_id": "text-embedding-ada-002",
		"url": "https://api.openai.com/v1/embeddings"
	}
}

index mapping 생성

하이브리드 검색을 위한 mapping 정보를 설정한다.

Full-text 검색을 위한 content 필드를 설정한다.
- copy_to를 설정하여 인덱스 시간에 content에 데이터가 추가되면 sematic-text 필드로 복사한다.
semantic 검색을 위한 semantic-text 필드를 설정한다.
category 필드를 integer 타입으로 설정한다.

PUT hybrid-search
{
	"mappings": {
		"properties": {
			"semantic-text": {
				"type": "semantic_text",
			 	"inference_id": "openai-embedding"
			}
			"content": {
				"type": "text",
			 	"copy_to": "semantic_text"
			},
			"category": {
				"type": "integer"
			}
		}
	}
}

데이터 로드

위에서 생성한 mapping 정보에 맞게 데이터를 입력한다.

POST /hybrid-search/_doc
{
	"content": "PC 게임 이용 중 개인정보에 등록된 퓨대폰 번호로 SMS 인증 및 본인 명의 휴대폰.....",
	"category": 2  
}

POST /hybrid-search/_doc
{
	"content": "본인이 직접 생성하지 않았으나 본인 명의로 계정이 생성되어 있다면 명의가 도용된 경우일 .....",
	"category": 1  
}

Hybrid-search

데이터를 모두 저장 후 하이브리드 검색을 사용하여 데이터를 쿼리할 수 있다.

첫 번째 standard는 전통적인 어휘(lexical) 검색을 나타낸다
- text 타입으로 설정한 content 필드에서 full-text 검색을 한다.
두 번째 standard는 시맨틱(semantic) 검색을 나타낸다.
- semantic_text 타입으로 설정한 semantic-text 필들에서 semantic 검색을 한다.

POST hybrid-search/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": { 
            "query": {
              "match": {
                "content": "상품 선물하기 방법" 
              }
            }
          }
        },
        {
          "standard": { 
            "query": {
              "semantic": {
                "field": "semantic_text", 
                "query": "상품 선물하기 방법"
              }
            }
          }
        }
      ]
    },
    "_source": ["content", "category"]
  }
}


# Response
{
	"took": 452,
	"time_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 2,
			"relation": "eq"
		},
		"max_score": null,
		"hits": [
			{
				"_index": "semantic-embeddings",
				"_id": "yCvpbJMBh4hPZA9jwQ5D",
                "_score": 0.8905029,
                "_source": {
                    "category": 1,
                    "content": "캐릭터 선물하기 신청이 ....."
                }
			},
			......
		]
	}
}

일반 검색

semantic, full-text 검색을 하지 않는 검색에 경우도 같이 사용할 수 있다.

GET hybrid-search/_search
{
    "query": {
        "term": {
            "category": {
                "value": 1
            }
        }
    },
    "_source": ["content", "category"]
}

RRF(Reciprocal rank fusion)

Reciprocal rank fusion (RRF)는 서로 다른 관련성 지표를 가진 여러 결과 집합을 단일 결과 집합으로 결합하는 방법이다. RRF는 튜닝이 필요 없으며, 서로 다른 관련성 지표는 고품질 결과를 얻기 위해 서로 관련될 필요가 없다.

RRF는 다음 공식을 사용하여 각 문서의 순위를 매기는 점수를 결정하게 된다.

score = 0.0
for q in queries:
    if d in result(q):
        score += 1.0 / ( k + rank( result(q), d ) )
return score

# where
# k is a ranking constant
# q is a query in the set of queries
# d is a document in the result set of q
# result(q) is the result set of q
# rank( result(q), d ) is d's rank within the result(q) starting from 1

참고 : https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

License

rrf를 사용한 하이브리드 검색을 하는 경우 아래와 같이 오류가 발생할 수 있다.

{
    "error": {
        "root_cause": [
            "type": "security_exception",
            "reason": "current license is non-compliant for [Reciprocal Rank Fusion (RRF)]",
            "license.expired.feature": "Reciprocal Rank Fusion (RRF)"
        ],
        "type": "security_exception",
        "reason": "current license is non-compliant for [Reciprocla Rank Fusion (RRF)]",
        "lincese.expired.feature": "Reciprocal RAnk Rusion (RRF)"
    },
    "status": 403
}

rrf의 경우 Enterprise 버전에서만 사용가능함으로 라이센스를 확인한다.

참고 : https://www.elastic.co/subscriptions

참고