Download Android App


Alternate Blog View: Timeslide Sidebar Magazine

Monday, August 4, 2014

Pray!

Right off YouTube -

"Please pray for my friend's 4 year old son. I just found out that 4 minutes of his life was not documented on Facebook."

Saturday, June 7, 2014

Collocation extraction using NLTK

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations include noun phrases like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the rich and powerful.

Collocations are important for a number of applications: natural language generation (to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided), computational lexicography (to automatically identify the important collocations to be listed in a dictionary entry), parsing (so that preference can be given to parses with natural collocations), and corpus linguistic research.

2 Collocation algorithms:

a. The simplest method for finding collocations in a text corpus is counting. Pass the candidate phrases through a part-of-speech filter which only lets through those patterns that are likely to be “phrases”.

b. Mean and variance based methods work by looking at the pattern of varying distance between two words. 

We also want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics. It is usually couched in terms of hypothesis testing. The options are t-test, chi-square, PMI, likelihood ratios, Poisson-Stirling measure and Jaccard index.

Likelihood ratios is an effective approach to hypothesis testing for extracting collocations. You can read more about it here.

Now, to some code to see how collocations can be extracted. Collocation processing modules are available in Apache Mahout, NLTK and Stanford NLP. In this article, we will use NLTK library written in Python to extract bigram collocations from a set of documents.

'''
@author: Nishant
'''

import nltk
from nltk.collocations import *

from com.nishant import DataLoader

if __name__ == '__main__':

    tokens = []
    stopwords = nltk.corpus.stopwords.words('english')

    '''Load multiple documents and return it as a list. 
       Provide your implementation here'''
    data = DataLoader.load("data.xml")

    print 'Total documents loaded: %d' % (data.len)

    ''' Apply stop words '''
    for d in data:
        data_tokens = nltk.wordpunct_tokenize(d)
        filtered_tokens = [w for w in data_tokens if w.lower() not in stopwords]
        tokens = tokens + filtered_tokens

    print 'Total tokens loaded: %d' % (tokens.len)
    print 'Calculating Collocations'

    ''' Extract only bigrams within a window of 5. 
        Implementation for trigram also available. NLTK
        has utility functions for frequency, ngram and word filter''' 
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens, 5)
    finder.apply_freq_filter(5)

    ''' Return the 1000 n-grams with the highest likelihood ratio.
        You can also use PMI or other methods and compare results '''
    print 'Printing Top 1000 Collocations''
    print finder.nbest(bigram_measures.likelihood_ratio, 1000)


Output >>
('side', 'neck'), ('stomach', 'pain'), ('doctor', 'suggested'), ('loss', 'appetite') .....

NLTK documentation for collocations are available here.
Must read on this topic - http://nlp.stanford.edu/fsnlp/promo/colloc.pdf


Monday, April 21, 2014

The sweetness of developing REST services using Dropwizard

Jersey is my goto software for developing REST services and then I came across Dropwizard. It is a simple and neat framework written on top of Jersey and glues together all essential libraries for creating production ready services.

Before externalizing a web service, it must be operationally ready to take real world traffic and provide HA. So many engineers end up writing health checks, enabling the required logs and metrics metrics etc. All of these features are available out of the box in Dropwizard. Furthermore, it has nice features such as HTTP client for invoking other services, authentication,  integration with Hibernate and DI.

So, while the documentation is straight forward and 'Getting Started' guide does get you started, integration with Spring and JPA via Hibernate requires some work. This article will help you with that assuming you have a working service written using Dropwizard.

1. Wiring up Spring and Dropwizard. HelloWorldApplication is the entry point to the Dropwizard application. The run method is where we initialize Spring.


   @Override
   public void run(HelloWorldConfiguration configuration,
   Environment environment) {

  //init Spring context
        //before we init the app context, we have to create a parent context with all the config objects others rely on to get initialized
        AnnotationConfigWebApplicationContext parent = new AnnotationConfigWebApplicationContext();
        AnnotationConfigWebApplicationContext ctx = new AnnotationConfigWebApplicationContext();

        parent.refresh();
        parent.getBeanFactory().registerSingleton("configuration", configuration);
        parent.registerShutdownHook();
        parent.start();

        //the real main app context has a link to the parent context
        ctx.setParent(parent);
        ctx.register(MyAppSpringConfiguration.class);
        ctx.refresh();
        ctx.registerShutdownHook();
        ctx.start();

        //now that Spring is started, let's get all the beans that matter into DropWizard

        //health checks
        Map healthChecks = ctx.getBeansOfType(HealthCheck.class);
        for(Map.Entry entry : healthChecks.entrySet()) {
            environment.healthChecks().register("template", entry.getValue());
        }

        //resources
        Map resources = ctx.getBeansWithAnnotation(Path.class);
        for(Map.Entry entry : resources.entrySet()) {
            environment.jersey().register(entry.getValue());
        }

        //last, but not least,let's link Spring to the embedded Jetty in Dropwizard
        environment.servlets().addServletListeners(new SpringContextLoaderListener(ctx));
}


It is a standard way to initialize Spring. Two important points to note here. One, no need to register resources explicitly. Spring will look for classes annotated with @PATH and register it with Dropwizard. Second is to include MyAppSpringConfiguration which provides additional resources to be included.

2. Definition for MyAppSpringConfiguration is below:


/**
Main Spring Configuration
 */
@Configuration
@ImportResource({ "classpath:spring/applicationContext.xml", "classpath:spring/dao.xml" })
@ComponentScan(basePackages = {"com.nishant.example"})
public class MyAppSpringConfiguration {

}

3. Next, annotate the resource class with @Service annotation and we are ready!


@Service
@Path("/hello-world")
@Produces(MediaType.APPLICATION_JSON)
public class HelloWorldResource {

 @Autowired
 private HelloWorldConfiguration configuration;

    private final AtomicLong counter = new AtomicLong();

    @GET
    @Timed
    public Saying sayHello(@QueryParam("name") Optional name) {

        final String value = String.format(configuration.getTemplate(), name.or(configuration.getDefaultName()));
        return new Saying(counter.incrementAndGet(), value);
    }
}


For complete working example, see https://github.com/nchandra/SampleDropwizardService



Tuesday, September 24, 2013

Hibernate Search based Autocomplete Suggester

In this article, I will show how to implement auto-completion using Hibernate Search.

The same can be achieved using Solr or ElasticSearch. But I decided to use Hibernate Search as its the simplest to get started with, easily integrates with an existing application and leverages the same core - Lucene. And we get all of this without the overhead of managing Solr/ElasticSearch cluster. In all, I found Hibernate Search to be the go-to search engine for simple use cases.

For our use case, we build a product title based auto-completion where often, the user queries are searches for product title. While typing, users should immediately see titles matching their requests, and Hibernate Search should do the hard work to filter the relevant documents in near real-time.

Lets have the following JPA annotated Product entity class.

public class Product {

  @Id
 @Column(name = "sku")
 private String sku;

  @Column(name = "upc")
 private String upc;

  @Column(name = "title")
 private String title;

....
}

We are interested in returning suggestions based on the 'title' field. Title will be indexed based on 2 strategies - N-Gram and Edge N-Gram.

Edge N-Gram - This will match only from the left edge of the suggestion text. For this we use KeywordTokenizerFactory (emits the entire input as a single token)  and EdgeNGramFilterFactory along with some regex cleansing.

N-Gram matches from the start of every word, so that you can get right-truncated suggestions for any word in the text, not only from the first word. The main difference from N-gram is the tokenizer which is StandardTokenizerFactory along with NGramFilterFactory.

Using these strategies, if the document field is "A brown fox" and the query is
a) "A bro"- Will match
b) "bro" - Will match

Implementation: In the entity defined above, we can map 'title' property twice with the above strategies. Below are the annotations to instruct Hibernate to index 'title' twice.


@Entity
@Table(name = "item_master")
@Indexed(index = "Products")
@AnalyzerDefs({

@AnalyzerDef(name = "autocompleteEdgeAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") }),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = StopFilterFactory.class),
 // Index partial words starting at the front, so we can provide
 // Autocomplete functionality
 @TokenFilterDef(factory = EdgeNGramFilterFactory.class, params = {
   @Parameter(name = "minGramSize", value = "3"),
   @Parameter(name = "maxGramSize", value = "50") }) }),

@AnalyzerDef(name = "autocompleteNGramAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = WordDelimiterFilterFactory.class),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = NGramFilterFactory.class, params = {
   @Parameter(name = "minGramSize", value = "3"),
   @Parameter(name = "maxGramSize", value = "5") }),
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern",value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") })
}),

@AnalyzerDef(name = "standardAnalyzer",

// Split input into tokens according to tokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),

filters = {
 // Normalize token text to lowercase, as the user is unlikely to
 // care about casing when searching for matches
 @TokenFilterDef(factory = WordDelimiterFilterFactory.class),
 @TokenFilterDef(factory = LowerCaseFilterFactory.class),
 @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
   @Parameter(name = "pattern", value = "([^a-zA-Z0-9\\.])"),
   @Parameter(name = "replacement", value = " "),
   @Parameter(name = "replace", value = "all") })
}) // Def
})
public class Product {

....
}


Explanation: 2 custom analyzers - autocompleteEdgeAnalyzer and autocompleteNGramAnalyzer have been defined as per theory in the previous section. Next, we apply these analyzers on the 'title' field to create 2 different indexes. Here is how we do it:


@Column(name = "title")
@Fields({
  @Field(name = "title", index = Index.YES, store = Store.YES,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "standardAnalyzer")),
  @Field(name = "edgeNGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteEdgeAnalyzer")),
  @Field(name = "nGramTitle", index = Index.YES, store = Store.NO,
analyze = Analyze.YES, analyzer = @Analyzer(definition = "autocompleteNGramAnalyzer"))
})
private String title;

Start indexing:

 public void index() throws InterruptedException {
  getFullTextSession().createIndexer().startAndWait();
 }

Once indexed, inspect the index using Luke and you should be able to see title analyzed and stored as N-Grams and Edge N-Grams.

Search Query:

 private static final String TITLE_EDGE_NGRAM_INDEX = "edgeNGramTitle";
 private static final String TITLE_NGRAM_INDEX = "nGramTitle";

 @Transactional(readOnly = true)
 public synchronized List getSuggestions(final String searchTerm) {

 QueryBuilder titleQB = getFullTextSession().getSearchFactory()
   .buildQueryBuilder().forEntity(Product.class).get();

 Query query = titleQB.phrase().withSlop(2).onField(TITLE_NGRAM_INDEX)
   .andField(TITLE_EDGE_NGRAM_INDEX).boostedTo(5)
   .sentence(searchTerm.toLowerCase()).createQuery();

 FullTextQuery fullTextQuery = getFullTextSession().createFullTextQuery(
    query, Product.class);
 fullTextQuery.setMaxResults(20);

 @SuppressWarnings("unchecked")
 List<product> results = fullTextQuery.list();
 return results;
}

And we have a working suggester.

What next? Expose the functionality via a REST API and integrate it with jQuery, examples of which can be easily found.

You can also use the same strategies with Solr and ElasticSearch.

Friday, June 28, 2013

The story of 'Big Data'

Bob, a dairy farm owner, was unhappy. He owned many cows but could not milk them efficiently. Business was good though.

One morning, his neighbor, an owner of fodder factory, came and told him that he had found a way to efficiently milk the cows by letting them sleep for 24 hours and then milking them. Bob ran towards his farm, shouting "big cow, big cow"...

Bob and his neighbor were very happy. Bob got more cows to earn more profit. Few months passed.

One fine morning, his neighbor came and told him that he had found a way to milk the cows round the clock, without going through 24 hours of sleep period - by giving a rapid burst of low voltage electric shock to cows. Bob ran towards his farm, shouting "real-time big cow, real-time big cow"...

Now we have big cows, real-time big cows and fodder factories everywhere.

Wednesday, June 12, 2013

Graph Analytics: Discovering the undiscovered!

I was researching graph analytics and it's probable applications. Graph analysis and big data are overlapping areas and then I came across this piece of text which beautifully summarizes the difficulty of discovering the unknown.

"Many Big Data problems are about searching for things you know that you want to find. It's challenging because the volumes of data make it like searching for a needle in a haystack. But that’s easy because a needle and a piece of hay, though similar, do not look exactly alike.

However, discovering problems are about finding what you don't know. Imagine trying to find a needle in a stack of needles. That's even harder. How can you find the right needle if you don't know what it looks like? How can you discover something new if you don't know what you're looking for?

In order to find the unknown, you often have to know the right question to ask. It takes time and effort to ask every question and you keep learning as you continue to ask questions ....."

This happens to be true for any evolving technology that solves some existing problem in a novel way, and due to its sheer sophistication, opens door for us to ask for more, to be inquisitive and finally discover. 

Original source: YarcData

Monday, June 10, 2013

Architecture: Cloud based Log Management Service



This article discusses the architecture of a cloud-based log management service.  The LaaS (Logging-as-a-Service), code named ‘Trolly’, will provide log management and data analysis. We will cover technical requirements, high level architecture, data flow and technology selection based on open source software.

1.0 Modelling
The proposed architecture is a layered architecture, and each layer has components. Within layers, pipeline and event-driven architecture are employed.

1.1 Layered architecture

Multi-layered software architecture is a software architecture that uses many layers for allocating the responsibilities.

1.2 Pipeline architecture

A pipeline consists of a chain of processing elements (processes, threads etc.), arranged so that the output of each element is the input of the next.

A pipe is a message queue. A message can be anything. A filter is a process, thread, or other component that perpetually reads messages from an input pipe, one at a time, processes each message, then writes the result to an output pipe. Thus, it is possible to form pipelines of filters connected by pipes.

1.3    Event driven architecture

Event-driven architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events.

An event can be defined as "a significant change in state". For this system, a line in a log file represents an event.

2.     Motivation

Layered architecture helps in:

·     Clear separation of responsibilities – search, alerting and visualization.
·     Ability to replace one or several layers implementation with minimum effort and side effects.

An event driven architecture is extremely loosely coupled and well distributed. Event driven architectures have loose coupling within space, time and synchronization, providing a scalable infrastructure for information exchange and distributed workflows.

3.     Requirements

3.1   In scope

·         Data Visualization
·         Search
·         Custom Alerts

3.2    Technical requirements

·         Near-real time processing
·         Multi-tenant – Support multiple clients
·         High availability
·         Scalable – The system must scale for following reasons:
         o   New clients sign up
         o   Existing clients upgrade


3.3    Not covered in this article

·         Client Registration
·         Billing, Reporting and Archival
·         User Management
·         Third party integration
-      Self-service – Automatic provisioning of hardware and software 
·         Multi-data center deployment and redundancy
·         Client and Service geographic proximity

4.0 Trolly Architecture
Trolly High Level Architecture

Trolly has the following components:

i.                     Data collection
            a.       Adaptors for accepting input log streams
ii.                   Data access
            a.       Portal
            b.      API gateway
iii.                  Data pre-processing
            a.       Core processor
iv.                 Alerting
            a.       Complex event processing engine
v.                   Search service
            a.       Distributed indexer and search
vi.                 Log storage
            a.       Distributed data store

Each component is explained in the following sections.


 
Data Flow and Components Diagram

5.     Concepts

5.1.1          Definitions

Client: Users of Trolly and uniquely identified by a Client Id.
Source: Log source e.g. Apache HTTP logs. Client can configure multiple sources.
Stream: A log stream sent by a client and originating from a pre-defined source. Stream is keyed by Stream Id.  Stream uniquely maps to a source.

5.1.2          Data Collection

The API gateway will accept logs from multiple end points like rsyslog, REST over HTTP/HTTPS etc. It will authenticate client and queue the log event for further preprocessing.

RabbitMQ will be used for messaging. A cluster setup with mirrored queues provides high availability.

5.1.3          Data pre-processing

The core processor cleans data, applies regex based on the input format and converts it to JSON format. Each processed log event is forked to three different locations - CEP engine, indexing engine and data store.

The JSON file is:
i.                     Sent to CEP engine for alerting.
ii.                   Sent to indexing service.
iii.                  Stored in a distributed datastore.

5.1.4          Custom Alerts

Here we choose software that provides low-latency and high-throughput complex event processing. The software must support HA configuration. Esper is a complex event processing library. It is stream oriented, supports HA and can process 100,000 - 500,000 events / sec at an average latency of 3 micro seconds and well suited for our requirements.

The system will create a new queue for every Stream defined by a client. This will act as a source for Esper. Custom user supplied alerts are converted to Espers DSL (EPL). The subscribers receive output events that is -

i.                     Persisted it in MySQL database and
ii.                   Published to 3rd party integration end-points e.g. SMS gateway.

5.1.5          Indexing and Search

The design must support near-real time indexing, high availability (HA) and multi-tenancy. To address scalability and HA, we use a distributed cluster based design. ElasticSearch is a good fit for above requirements.

ElasticSearch supports multi-tenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query.

5.1.5.1                Index Setup

The system will use indices based on time. For example, the index name can be the date in a format like <Client Id>-DD-MM-YYYY. Client Id is a unique identifier assigned to every client. The timeframe depends on how long the client intends to keep logs.

For storing a week worth of logs, system will maintain one index per day and for a year worth of logs, system will maintain one index per month. This will help in narrowing searches to the relevant indices. Often the searches will be for recent logs, and a “quick search” option in GUI can only looks in the last index.

Further, the BookKeeping module will rotate and optimize logs as well as delete old indexes.

5.1.5.2           High Availability

Each index will be divided into 5 or more shrads. Each shrad will have one replica. Replicas will help the cluster continue running when one or more nodes go down. It will also be used for searching.

5.1.5.3                Performance considerations

·         Empirically determine number of shrads and replicas
·         Refresh indexes asynchronously and automatically every few seconds
·         Enable bulk indexing


Proposed Elastic Search Topology

5.1.6          Log Analysis and Storage

If the log volume is low, the original log line can be compressed and stored in the search index. But this seems unlikely and the raw logs will be written to a distributed highly-available data store.

5.1.6.1                Log Analysis Software Stack

 Hadoop is an open source framework that allows users to process very large data in parallel. The Hadoop core is mainly divided into two modules:
    o   HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.
    o   Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.
    o   The database will be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets - HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row.
    o   The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.

5.1.6.2           Data Flow

As discussed above, raw log lines are processed and stored as JSON in HBase. JSON will serve as a common input format for all visualization tools.

5.1.7          Data Access

5.1.7.1                Reporting and Visualization

For basic reporting, Google chart or High Charts are a good fit. For data visualization, D3 or InfoViz toolkit can be used.

     o   For analyzing recent log events, HBase will be directly queried. For advanced analysis, HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing can be performed.
      o   The data-access Web service is a REST-based service that eases the access and integrations with data clients.

6. Summary

The proposed architecture attempts to show how large amounts of unstructured log events can be easily analyzed using the Apache Hadoop, ElasticSearch and Esper framework.

Feel free to comment and share your thoughts about the proposed architecture.