Monday, February 23, 2015

Lucene.NET (part 2): Tag indexing and searching

In this post I will describe how I implemented tag indexing and searching. This is Part 2 of Lucene.NET series. Part 1 can be found here: Part 1

In the case of INT64, a tag is a short textual descriptor that applies to a snippet of code. You can have multiple tags per snippet. For example, you can put the tags jQuery and JavaScript on a snippet that shows how to find an element by name using jQuery.

Later, you should be able to type [jquery] or [javascript], or [jquery] [javascript] in the search box and find that snippet and all other accessible snippets with those tags.

Indexing


To index tags for a single snippet, I have this piece of code:

if (snippet.Tags != null) {
    foreach (var tag in snippet.Tags.Select(t => t.Name == null ?
                                                    string.Empty :
                                                    t.Name.ToLower())) {
        doc.Add(new Field(FIELD_TAGS,
            tag,
            Field.Store.NO,
            Field.Index.NOT_ANALYZED,
            Field.TermVector.NO));
    }
}

This code runs when a document is being indexed. This is an addition to the MakeDocumentFromSnippet(...) method from Part 1 of the series. Lucene allows to store multiple entries for the same field in a single document. So I loop through the list of tags and store them one by one in the FIELD_TAGS document field. We don't need to store the tags and we don't need the term vector. We want to index the tag as is, so we don't need to analyze it either. That's it for indexing, now on to searching.

Extracting tags


In the Lucene query syntax, square brackets are used for range descriptors. I don't need this kind of advanced feature for INT64, so I'll use the brackets to delimit tags. But since the Lucene parser doesn't know how I want to use brackets, I will pre-process the query to extract the tags before parsing it. There is one caveat, and that is if the brackets appear inside quotes, they shouldn't be considered part of a tag but rather a literal phrase to be used as is. Here is the code to extract the tags:

private List ExtractTags(string originalQuery, out string modifiedQuery) {
    modifiedQuery = originalQuery;
    var tagList = new List();

    if (originalQuery.IsNullOrEmpty()) {
        return tagList;
    }

    bool insideQuotes = false;
    bool insideTag = false;
    var currentTagSb = new StringBuilder();
    var modifiedQuerySb = new StringBuilder();
    foreach (var c in originalQuery) {
        if (c == '"') {
            // we are either going into or coming out of quotes
            insideQuotes = !insideQuotes;
            modifiedQuerySb.Append(c);
        }
        else if (c == '[' && !insideTag && !insideQuotes) {
            // tag name start
            insideTag = true;
        }
        else if (c == ']' && insideTag && !insideQuotes) {
            // tag name just finished. save to list and clear the current tag
            tagList.Add(currentTagSb.ToString());
            currentTagSb.Clear();
            insideTag = false;
        }
        else if (insideTag) {
            // inside the brackets. append the character to current tag name
            currentTagSb.Append(c);
        }
        else {
            modifiedQuerySb.Append(c);
        }
    }

    modifiedQuery = modifiedQuerySb.ToString();
    return tagList;
} 

Here we loop through the original query string once and keep track of whether we are inside a quoted phrase or inside a tag. This code is pretty simple. We return the original query with tags removed since we don't want them to be parsed by Lucene's parser.

Searching


Now that we can extract tags, we need to add them into the query before we run it. Here is part of the Search(...) method from Part 1 modified to include the tags in the query:

// lower-case the query
if (query != null) {
    query = query.ToLower();
}

// prepare the searcher and parser
var searcher = new IndexSearcher(m_azureDirectory);
var parser = new MultiFieldQueryParser(Version.LUCENE_30,
    textSearchFields,
    new StandardAnalyzer(Version.LUCENE_30));

// *** BUILDING THE TAGS QUERY HERE

// prepare the query
string modifiedQuery;
var tags = ExtractTags(query, out modifiedQuery);

// build the tags query
var tagsQuery = new BooleanQuery();
foreach (var tag in tags) {
    tagsQuery.Add(new TermQuery(new Term(FIELD_TAGS, tag)), Occur.MUST);
}

// *** END BUILDING THE TAGS QUERY HERE

// parse the user query
var userQuery = (modifiedQuery.IsNullOrWhiteSpace() ?
                 new MatchAllDocsQuery() :
                 parser.Parse(modifiedQuery));

// filter out results that don't belong to the current user and
// that are not public
var userId = BizSession.CurrentState.UserId;
var onlyThisUser = new TermQuery(new Term(FIELD_USER_ID, userId));
var onlyPublic = new TermQuery(new Term(FIELD_IS_PUBLIC, true.ToString()));
var onlyThisUserOrPublic = new BooleanQuery
                               {
                                   { onlyPublic, Occur.SHOULD }
                               };
if (userId != null) {
    onlyThisUserOrPublic.Add(onlyThisUser, Occur.SHOULD);
}

var finalQuery = new BooleanQuery
                           {
                               { onlyThisUserOrPublic, Occur.MUST },
                               { userQuery, Occur.MUST }
                           };
// *** ADD THE TAGS QUERY TO FINAL QUERY
if (tagsQuery.Clauses.Count > 0) {
    finalQuery.Add(tagsQuery, Occur.MUST);
}

// do the search
var totalToRequest = (criteria.PageNumber + 1) * criteria.PageSize;
var results = searcher.Search(finalQuery, totalToRequest);

That's it! Please leave your thoughts and comments below.

Tuesday, February 10, 2015

Lucene.NET (part 1): Permission filtering

I'm working on adding search to www.int64.io. Here is what I'm looking for in the search capability:
  • Full-text search that looks at the Title, Code, and Notes fields of a snippet
  • Has to work on Azure
  • Has to be fairly easy to set up and tweak
And in the future:
  • Will need to be able to search tags
  • Has to support more granular parameters for handling Advanced Search
I was debating between Azure Search and Lucene.NET, and picked Lucene because I didn't want to be locked into Azure - who knows, maybe I will move INT64 to AWS in the future? Plus, Lucene is a much more mature platform with a lot more information available.

So I followed this tutorial to set up Lucene: http://chriskirby.net/getting-full-text-search-up-and-running-in-azure/ and everything went pretty smoothly.

Permission filtering preparation


So I set up field indexing like this:

private static Document MakeDocumentFromSnippet(
                            SearchIndexSnippetModel snippet) {
    var doc = new Document();
    doc.Add(new Field(FIELD_ID,
        snippet.Id.ToString(),
        Field.Store.YES,
        Field.Index.NOT_ANALYZED,
        Field.TermVector.NO));
    doc.Add(new Field(FIELD_USER_ID,
        snippet.UserId,
        Field.Store.YES,
        Field.Index.NOT_ANALYZED,
        Field.TermVector.NO));
    doc.Add(new Field(FIELD_IS_PUBLIC,
        snippet.IsPublic.ToString(),
        Field.Store.YES,
        Field.Index.NOT_ANALYZED,
        Field.TermVector.NO));
    // indexing the Title, Text, and Notes below is omitted
    ....

    return doc;
}


Note the Field.Index.NOT_ANALYZED specified for the FIELD_USER_ID and FIELD_IS_PUBLIC. We will use these fields when we filter out the data the current user is not allowed to see.

Indexing and Updating


So I added some code that loops through all the existing snippets, and indexes each one like this:

public void AddSnippetToIndex(SearchIndexSnippetModel snippet) {
    using (var writer = MakeIndexWriter()) {
        var doc = MakeDocumentFromSnippet(snippet);
        writer.AddDocument(doc);
    }
}

And every time a snippet is updated, I update the index like this:

public void UpdateSnippetInIndex(SearchIndexSnippetModel snippet) {
    using (var writer = MakeIndexWriter()) {
        var doc = MakeDocumentFromSnippet(snippet);
        writer.UpdateDocument(new Term(FIELD_ID, snippet.Id.ToString()), doc);
    }
}

To make this work correctly, you have to tell the writer which document to update. Since Lucene doesn't support actual updating, it will delete all documents that match the term you provided and add the new document doc. It wasn't finding the document to update at first because I was using Field.Index.NO for the FIELD_ID field during indexing. Switching to Field.Index.NOT_ANALYZED fixed the problem.

Searching and permission filtering


I have a single input for the user to type their search query. The results returned from the query need to be filtered to only show snippets the current user has access to. To achieve this, we need to add additional conditions to the query the user types. At INT64 the user has access to his own snippets and to any public snippets.

Here is an excerpt from the Search method where I prepare the queries:

// prepare the searcher and parser
var searcher = new IndexSearcher(m_azureDirectory);
var parser = new MultiFieldQueryParser(Version.LUCENE_30,
                                       textSearchFields,
                                       new StandardAnalyzer(Version.LUCENE_30));

// parse the user query
var userQuery = parser.Parse(query);

// filter out results that don't belong to the current user and
// that are not public
var onlyThisUser = new TermQuery(new Term(FIELD_USER_ID,
                                          BizSession.CurrentState.UserId));
var onlyPublic = new TermQuery(new Term(FIELD_IS_PUBLIC, true.ToString()));
var onlyThisUserOrPublic = new BooleanQuery
                               {
                                   { onlyThisUser, Occur.SHOULD },
                                   { onlyPublic, Occur.SHOULD }
                               };

var finalQuery = new BooleanQuery
                     {
                         { onlyThisUserOrPublic, Occur.MUST },
                         { userQuery, Occur.MUST }
                     };

// do the search
var totalToRequest = (criteria.PageNumber + 1) * criteria.PageSize;
var results = searcher.Search(finalQuery, totalToRequest);

So I create additional term queries for matching the current user's id, and for matching public snippets. Then I add both of those term queries to the BooleanQuery onlyThisUserOrPublic, telling it that both conditions need to be matched using Occur.SHOULD. This is like saying, "either the User Id matches the current user's id, or the snippet is public".

Then I add both my new permission query and the user query into another BooleanQuery, telling it this time that both conditions MUST occur. This gives us a final query of (matches user input AND (snippet's user id == current user id OR snippet is public))

And then we loop through the results (there is some paging code there) and make the resulting snippet list, which I omitted for brevity.

To Be Continued...


The tags are not included in the search at this time. Once tag search is done, I will write part 2 of the post concentrating just on that.

Please leave thoughts and comments below.

Sunday, February 1, 2015

.NET MVC: Refactoring-friendly JavaScript

The Problem

It often happens that you need to make an Ajax call from you JavaScript to an MVC action. So you code it like this (I'm assuming that your JavaScript is in a separate file):

$.ajax({
    url: "MyWebApp/Data/GetAllProducts",
    type: "POST",
    dataType: "json",
    headers: { ... },
    data: { ... },
    success: function (data) { ... }
    error: function (result, error, moreInfo) { ... }
});

Later you decide the GetAllProducts action should really be called GetProducts, because you want to use the same action to get a list of all products, and a list of filtered products, depending on parameters.

So you refactor-rename the action using vanilla Visual Studio, or ReSharper, or another favorite plugin. But guess what? Now you have to go and manually change all your hard-coded JavaScript strings to point to the right action. It's possible that the tool will offer to change these strings too, but you never know whether it'll find all of them or not.

The Solution

In the Razor view of the relevant page, you can initialize your JavaScript like this:

$(document).ready(function() {
    MyScript.init({
        getProductsUrl: "@(Url.Action("GetAllProducts", "Data"))"
    });
});

Define your init function in the JavaScript:

var MyScript = new function() {
    var m_options;

    this.init = function(options) {
        m_options = options;
    };
}

Later you can make your Ajax call like this:

$.ajax({
    url: m_options.getProductsUrl,
    type: "POST",
    dataType: "json",
    headers: { ... },
    data: { ... },
    success: function (data) { ... }
    error: function (result, error, moreInfo) { ... }
});

Voila! Now when you refactor-rename the tool won't be confused and you just saved yourself half an hour of work. And, you can use the same technique to pass in other things from Razor. Just add properties to your options object.
And here is a link to the snippet in the solution at INT64:

http://www.int64.io/Home/Snippet/SWh2V1pxU0JEeG9sNE8wanJjVDljUT090