Monday, February 23, 2015

Lucene.NET (part 2): Tag indexing and searching

In this post I will describe how I implemented tag indexing and searching. This is Part 2 of Lucene.NET series. Part 1 can be found here: Part 1

In the case of INT64, a tag is a short textual descriptor that applies to a snippet of code. You can have multiple tags per snippet. For example, you can put the tags jQuery and JavaScript on a snippet that shows how to find an element by name using jQuery.

Later, you should be able to type [jquery] or [javascript], or [jquery] [javascript] in the search box and find that snippet and all other accessible snippets with those tags.

Indexing


To index tags for a single snippet, I have this piece of code:

if (snippet.Tags != null) {
    foreach (var tag in snippet.Tags.Select(t => t.Name == null ?
                                                    string.Empty :
                                                    t.Name.ToLower())) {
        doc.Add(new Field(FIELD_TAGS,
            tag,
            Field.Store.NO,
            Field.Index.NOT_ANALYZED,
            Field.TermVector.NO));
    }
}

This code runs when a document is being indexed. This is an addition to the MakeDocumentFromSnippet(...) method from Part 1 of the series. Lucene allows to store multiple entries for the same field in a single document. So I loop through the list of tags and store them one by one in the FIELD_TAGS document field. We don't need to store the tags and we don't need the term vector. We want to index the tag as is, so we don't need to analyze it either. That's it for indexing, now on to searching.

Extracting tags


In the Lucene query syntax, square brackets are used for range descriptors. I don't need this kind of advanced feature for INT64, so I'll use the brackets to delimit tags. But since the Lucene parser doesn't know how I want to use brackets, I will pre-process the query to extract the tags before parsing it. There is one caveat, and that is if the brackets appear inside quotes, they shouldn't be considered part of a tag but rather a literal phrase to be used as is. Here is the code to extract the tags:

private List ExtractTags(string originalQuery, out string modifiedQuery) {
    modifiedQuery = originalQuery;
    var tagList = new List();

    if (originalQuery.IsNullOrEmpty()) {
        return tagList;
    }

    bool insideQuotes = false;
    bool insideTag = false;
    var currentTagSb = new StringBuilder();
    var modifiedQuerySb = new StringBuilder();
    foreach (var c in originalQuery) {
        if (c == '"') {
            // we are either going into or coming out of quotes
            insideQuotes = !insideQuotes;
            modifiedQuerySb.Append(c);
        }
        else if (c == '[' && !insideTag && !insideQuotes) {
            // tag name start
            insideTag = true;
        }
        else if (c == ']' && insideTag && !insideQuotes) {
            // tag name just finished. save to list and clear the current tag
            tagList.Add(currentTagSb.ToString());
            currentTagSb.Clear();
            insideTag = false;
        }
        else if (insideTag) {
            // inside the brackets. append the character to current tag name
            currentTagSb.Append(c);
        }
        else {
            modifiedQuerySb.Append(c);
        }
    }

    modifiedQuery = modifiedQuerySb.ToString();
    return tagList;
} 

Here we loop through the original query string once and keep track of whether we are inside a quoted phrase or inside a tag. This code is pretty simple. We return the original query with tags removed since we don't want them to be parsed by Lucene's parser.

Searching


Now that we can extract tags, we need to add them into the query before we run it. Here is part of the Search(...) method from Part 1 modified to include the tags in the query:

// lower-case the query
if (query != null) {
    query = query.ToLower();
}

// prepare the searcher and parser
var searcher = new IndexSearcher(m_azureDirectory);
var parser = new MultiFieldQueryParser(Version.LUCENE_30,
    textSearchFields,
    new StandardAnalyzer(Version.LUCENE_30));

// *** BUILDING THE TAGS QUERY HERE

// prepare the query
string modifiedQuery;
var tags = ExtractTags(query, out modifiedQuery);

// build the tags query
var tagsQuery = new BooleanQuery();
foreach (var tag in tags) {
    tagsQuery.Add(new TermQuery(new Term(FIELD_TAGS, tag)), Occur.MUST);
}

// *** END BUILDING THE TAGS QUERY HERE

// parse the user query
var userQuery = (modifiedQuery.IsNullOrWhiteSpace() ?
                 new MatchAllDocsQuery() :
                 parser.Parse(modifiedQuery));

// filter out results that don't belong to the current user and
// that are not public
var userId = BizSession.CurrentState.UserId;
var onlyThisUser = new TermQuery(new Term(FIELD_USER_ID, userId));
var onlyPublic = new TermQuery(new Term(FIELD_IS_PUBLIC, true.ToString()));
var onlyThisUserOrPublic = new BooleanQuery
                               {
                                   { onlyPublic, Occur.SHOULD }
                               };
if (userId != null) {
    onlyThisUserOrPublic.Add(onlyThisUser, Occur.SHOULD);
}

var finalQuery = new BooleanQuery
                           {
                               { onlyThisUserOrPublic, Occur.MUST },
                               { userQuery, Occur.MUST }
                           };
// *** ADD THE TAGS QUERY TO FINAL QUERY
if (tagsQuery.Clauses.Count > 0) {
    finalQuery.Add(tagsQuery, Occur.MUST);
}

// do the search
var totalToRequest = (criteria.PageNumber + 1) * criteria.PageSize;
var results = searcher.Search(finalQuery, totalToRequest);

That's it! Please leave your thoughts and comments below.

8 comments:

  1. Your interesting article helped me fix a lot of mistakes. Now my course on programming is the best.

    ReplyDelete
  2. excellent! I was looking for this piece of code! Thanks for the hint how best to organize it!

    ReplyDelete
  3. It is remarkable that I found your blog. Your article helped me in writing my course work.

    ReplyDelete
  4. sbobet casino review - Thauberbet.com
    sbobet is the latest online betting site that sbobet ทางเข้า features the finest in betway login live casino and casino action! Sign up to Bet365 and use your free £20 bet

    ReplyDelete
  5. 우리카지노계열 대표 추천 카지노사이트로 13년이 흐른 지금까지 가장 안전하고 먹튀없으며 자본력이 우리계열중에 최고로 인정받고 있는 메리트입니다. 마음에 드는 스포츠 및 경기 이벤트 내에서 이를 보증할 수 있는 가장 좋은 절차 중 하나는 웹 육상 경기에 익숙해지는 것입니다. 육상 도박의 모든 것을 올인119 즐기려면 , 이것에 대해 조금 더 알아야 합니다.

    ReplyDelete