사용자 사전 작성 시 유의점

사용자 사전을 왜 작성해야 할까?

검색 정확도를 높이기 위해서이다.

예를 들어 쿠버네티스 라는 단어를 ES에 색인한다고 해보자. 형태소 분석기는 Nori를 사용한다고 하자.

사용자 사전이 없을 경우 Nori 토크나이저는 해당 단어를 다음과 같이 분해한다. (decompound_mode: "mixed", 복합어와 어근을 함께 저장)

{
        \"token\": \"쿠\",
    \"start_offset\": 0,
    \"end_offset\": 1,
    \"type\": \"word\",
    \"position\": 0,
    \"bytes\": \"[ec bf a0]\",
    \"leftPOS\": \"NNP(Proper Noun)\",
    \"morphemes\": null,
    \"posType\": \"MORPHEME\",
    \"positionLength\": 1,
    \"reading\": null,
    \"rightPOS\": \"NNP(Proper Noun)\",
    \"termFrequency\": 1
},
{
        \"token\": \"버네\",
    \"start_offset\": 1,
    \"end_offset\": 3,
    \"type\": \"word\",
    \"position\": 1,
    \"bytes\": \"[eb b2 84 eb 84 a4]\",
    \"leftPOS\": \"NNP(Proper Noun)\",
    \"morphemes\": null,
    \"posType\": \"MORPHEME\",
    \"positionLength\": 1,
    \"reading\": null,
    \"rightPOS\": \"NNP(Proper Noun)\",
    \"termFrequency\": 1
},
{
        \"token\": \"티스\",
    \"start_offset\": 3,
    \"end_offset\": 5,
    \"type\": \"word\",
    \"position\": 2,
    \"bytes\": \"[ed 8b b0 ec 8a a4]\",
    \"leftPOS\": \"NNP(Proper Noun)\",
    \"morphemes\": null,
    \"posType\": \"MORPHEME\",
    \"positionLength\": 1,
    \"reading\": null,
    \"rightPOS\": \"NNP(Proper Noun)\",
    \"termFrequency\": 1
}

쿠버네티스는 Nori가 사용하는 기본 사전에 포함되어 있지 않지만, 버네와 티스로 분해되는 걸 보니 이 두 단어는 기본 사전에 포함되어 있을 가능성이 높다.

문제는, 분해된 쿠 때문에 쿠버네티스를 검색 시 쿠앤크가 나오는 골때리는 상황이 발생한다는것이다.

따라서 이같은 단어들을 사용자 사전에 등록해야 하는 것인데.. 그럼 쿠버네티스 같은 외래어들을 모두 사용자 사전에 등록해야 하는 걸까?

필라델피아라는 단어를 분석해보자.

{
        \"token\": \"필라델피아\",
    \"start_offset\": 0,
    \"end_offset\": 5,
    \"type\": \"word\",
    \"position\": 0,
    \"bytes\": \"[ed 95 84 eb 9d bc eb 8d b8 ed 94 bc ec 95 84]\",
    \"leftPOS\": \"NNP(Proper Noun)\",
    \"morphemes\": null,
    \"posType\": \"MORPHEME\",
    \"positionLength\": 1,
    \"reading\": null,
    \"rightPOS\": \"NNP(Proper Noun)\",
    \"termFrequency\": 1
}

이번에는 분해되지 않고 복합어 그대로 나온다. 해당 단어가 Nori 기본 사전에 포함되어 있기 때문이다.

따라서 Nori 기본 사전에 탑재되지 않아 단어가 분해되는 경우에만 사용자 사전에 등록해야 한다.

다음과 같이 사용자 사전을 작성했다고 하자.

그리고 다음 도큐먼트를 작성한다.

{
      \"_index\": \"post_2024_10\",
  \"_id\": \"0cd515d0-cf26-4eab-9a02-4787e0f1ad5b\",
  \"_score\": 2.9540634,
  \"_source\": {
        \"title\": \"C++은 왜\",
    \"content\": \"어려울까\"
  }
},

lowercase 필터를 적용했기에 사용자 사전에 있는 c++로 색인이 될거라 예상했지만 그렇지 않다.

GET _termvectors/0cd515d0-cf26-4eab-9a02-4787e0f1ad5b?fields=title
{
        \"_index\": \"post_2024_10\",
    \"_id\": \"0cd515d0-cf26-4eab-9a02-4787e0f1ad5b\",
    \"_version\": 1,
    \"found\": true,
    \"took\": 1,
    \"term_vectors\": {
            \"title\": {
                \"field_statistics\": {
                    \"sum_doc_freq\": 134,
                \"doc_count\": 32,
                \"sum_ttf\": 139
            },
            \"terms\": {
                    \"c\": {
                        \"term_freq\": 1,
                    \"tokens\": [
                            {
                                \"position\": 0,
                            \"start_offset\": 0,
                            \"end_offset\": 1
                        }
                    ]
                },
                \"왜\": {
                        \"term_freq\": 1,
                    \"tokens\": [
                            {
                                \"position\": 2,
                            \"start_offset\": 5,
                            \"end_offset\": 6
                        }
                    ]
                },
                \"은\": {
                        \"term_freq\": 1,
                    \"tokens\": [
                            {
                                \"position\": 1,
                            \"start_offset\": 3,
                            \"end_offset\": 4
                        }
                    ]
                }
            }
        }
    }
}

왜 c로 색인되는것일까?

다음 두 요청을 보내보자

GET /_analyze
{
      \"analyzer\": \"nori_analyzer\",
  \"text\": \"c++\",
  \"explain\": true
}

{
        \"detail\": {
            \"custom_analyzer\": true,
        \"charfilters\": [],
        \"tokenizer\": {
                \"name\": \"custom_nori_tokenizer\",
            \"tokens\": [
                    {
                        \"token\": \"c++\",
                    \"start_offset\": 0,
                    \"end_offset\": 3,
                    \"type\": \"word\",
                    \"position\": 0,
                    \"bytes\": \"[63 2b 2b]\",
                    \"leftPOS\": \"NNG(General Noun)\",
                    \"morphemes\": null,
                    \"posType\": \"MORPHEME\",
                    \"positionLength\": 1,
                    \"reading\": null,
                    \"rightPOS\": \"NNG(General Noun)\",
                    \"termFrequency\": 1
                }
            ]
        },
        \"tokenfilters\": [
                {
                    \"name\": \"lowercase\",
                \"tokens\": [
                        {
                            \"token\": \"c++\",
                        \"start_offset\": 0,
                        \"end_offset\": 3,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63 2b 2b]\",
                        \"leftPOS\": \"NNG(General Noun)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"NNG(General Noun)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"stop_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c++\",
                        \"start_offset\": 0,
                        \"end_offset\": 3,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63 2b 2b]\",
                        \"leftPOS\": \"NNG(General Noun)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"NNG(General Noun)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"nori_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c++\",
                        \"start_offset\": 0,
                        \"end_offset\": 3,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63 2b 2b]\",
                        \"leftPOS\": \"NNG(General Noun)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"NNG(General Noun)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"synonym_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c++\",
                        \"start_offset\": 0,
                        \"end_offset\": 3,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63 2b 2b]\",
                        \"leftPOS\": \"NNG(General Noun)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"NNG(General Noun)\",
                        \"termFrequency\": 1
                    }
                ]
            }
        ]
    }
}

GET /_analyze
{
      \"analyzer\": \"nori_analyzer\",
  \"text\": \"C++\",
  \"explain\": true
}

{
        \"detail\": {
            \"custom_analyzer\": true,
        \"charfilters\": [],
        \"tokenizer\": {
                \"name\": \"custom_nori_tokenizer\",
            \"tokens\": [
                    {
                        \"token\": \"C\",
                    \"start_offset\": 0,
                    \"end_offset\": 1,
                    \"type\": \"word\",
                    \"position\": 0,
                    \"bytes\": \"[43]\",
                    \"leftPOS\": \"SL(Foreign language)\",
                    \"morphemes\": null,
                    \"posType\": \"MORPHEME\",
                    \"positionLength\": 1,
                    \"reading\": null,
                    \"rightPOS\": \"SL(Foreign language)\",
                    \"termFrequency\": 1
                }
            ]
        },
        \"tokenfilters\": [
                {
                    \"name\": \"lowercase\",
                \"tokens\": [
                        {
                            \"token\": \"c\",
                        \"start_offset\": 0,
                        \"end_offset\": 1,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63]\",
                        \"leftPOS\": \"SL(Foreign language)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"SL(Foreign language)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"stop_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c\",
                        \"start_offset\": 0,
                        \"end_offset\": 1,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63]\",
                        \"leftPOS\": \"SL(Foreign language)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"SL(Foreign language)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"nori_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c\",
                        \"start_offset\": 0,
                        \"end_offset\": 1,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63]\",
                        \"leftPOS\": \"SL(Foreign language)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"SL(Foreign language)\",
                        \"termFrequency\": 1
                    }
                ]
            },
            {
                    \"name\": \"synonym_filter\",
                \"tokens\": [
                        {
                            \"token\": \"c\",
                        \"start_offset\": 0,
                        \"end_offset\": 1,
                        \"type\": \"word\",
                        \"position\": 0,
                        \"bytes\": \"[63]\",
                        \"leftPOS\": \"SL(Foreign language)\",
                        \"morphemes\": null,
                        \"posType\": \"MORPHEME\",
                        \"positionLength\": 1,
                        \"reading\": null,
                        \"rightPOS\": \"SL(Foreign language)\",
                        \"termFrequency\": 1
                    }
                ]
            }
        ]
    }
}

토큰화 과정에서 c++는 c++로 색인된 반면 C++는 c로 색인되었다

이렇게 되는 이유는 색인 과정이 토크나이저 -> 필터 순서이기 때문이다. 토크나이저는 C++와 c++가 일치하지 않으므로 특수문자를 제거해버리고, 그 이후에 C를 lowercase 필터가 c로 변환시키는 것이다. 그럼 사용자 사전을 대소문자를 구별해서 작성해야 할까?

Character Filter에 mapping 또는 pattern_replace 필터를 적용하여 입력 스트림 자체의 대문자 C를 소문자 c로 먼저 치환해 버리는 설정을 추가하면 된다.