Sunday, 12 February 2012

Under the bonnet of Endeca

Endeca Information Discovery (EID) grants BI users agility of querying, navigating and searching across structured, semi-structured and unstructured data. The backbone of all EID applications is MDEX Engine. It stores data and receive requests via Endeca Web Services. After the execution of query request, MDEX Engine will return result to Endeca Web Services in XML format. Then front-end application in Endeca Studio performs formatting of the query result and return them to the client browser.

Today let's open the bonnet of EID and have a look at the nuts and bolts inside MDEX Engine.

Firstly of all, everything is running in "Dgraph" which is the term for the process of MDEX Engine. Data in MDEX Engine won't be accessible without a related running Dgraph. Relevant Data from variant source will be extracted, transformed and loaded into MDEX Engine. Compare to traditional RDBMS or OLAP Cubes, MDEX Engine structure its data in a different way.

The data model in the MDEX Engine consists of records and attributes.
  • Records are the fundamental units of data.
  • Attributes are the fundamental units of a record schema which describes the data model of Records.

For a data record, an assignment on an attribute (also known as key value pairs) provides information about that record. For example, for a list of bike records, an assignment on the "Category" attribute contains the category description (e.g. mountain) of the bike record. Each attribute is identified by a unique name.

Each attribute on a data record is itself represented by a record that describes this attribute. Following the bike records example, there is a record that describes the "Category" attribute. A collection of these records that describe attributes forms a schema for your records. The aspects of the attribute on a data record are configured in the schema. For example, an attribute on any data record can be searchable or not.

Let's have a look at an example which may help you to digest these concepts.

In an MDEX Engine which stores bike information, a typical Data Record will be like below:
  • TxnID = 12324
  • ProductID = 506
  • Category = Mountain Bike
  • Amount = $499.99
  • Suspension = Fox 32 F-Series
  • FrameType = Aluminium
  • Saddle = Bontrager SSR
  • Mountain Accessories = Fork and shock sag meter
  • Mountain Accessories = Water Bottle
  • Review = A great bike for off road. Smooth ride over the bumps
  • ReviewSentiment = Positive
  • ReviewTerm = Great
  • ReviewTerm = Off Road
  • ReviewTerm = Smooth
  • ReviewTerm = Bumps
In each line of this data record, Attribute is the part that is on the left-hand side of equation symbol. Attribute may be single-assign or multi-assign. In this example, attributes such as "TxnID" and "ProductID" are single-assign while attribute "Mountain Accessories" is multi-assign. In the MDEX Engine data model, Primary Keys (also known as Record Specs) are used to uniquely identify records.

The System Record that describes the Attribute “Category” may look like:
  • Name = Category
  • Type = String
  • Display Name = Category
  • Searchable = Yes
  • Sort = Ascendant
The collection of system records is called Schema.

In MDEX Engine, data records are not necessary to be stored in a conformed container. Null value key pair such as "AttributeName = Null" are not allowed. For example, as source data, if a relational database record has NULL value for column "Suspension", when it's loaded into MDEX Engine, a new MDEX data record will be inserted but no Attribute "Suspension" will be created for that record. So, it's not unusual have "Jagged records" like below exist in MDEX Engine, though they are describing the same business entity.

Data Records in MDEX Engine may be loaded from structured, semi-structured or unstructured data sources.
For structured data, each Tuple becomes a Data Record and each column (except for the columns with NULL value) becomes an Attribute.

Semi-Structured data is normally from enterprise applications, HTTP feeds, XML sources, etc. It will also be loaded as attribute/value pairs.This is a common cause of "jagged" record structure.

As the key differentiator, EID extends BI analysis to unstructured data such as text documents or social data. In MDEX Engine, unstructured data can be stored as their own records for "side-by-side" analysis. Or, they can be linked to existing data records by any available key.

Any unstructured attribute can be enriched using text analytics to expand the structure of its containing record. Common techniques include but are not limited to Automatic tagging, Named entity extraction ,Sentiment analysis ,Term extraction.

Beyond all these data records which consolidate information from database, XML document, Facebook, etc, MDEX Engine also creates hierarchy/relationship graphs, indexes for the attributes and attribute values. Those graphs and indexes are so important that information discovery can not be performed effectively and efficiently on MDEX data records without them.

In summary, MDEX Engine of Endeca stores information in data records as series of Attribute/Value pairs. Data Records can be structured differently with each other. With patented mechanisms of managing navigation graph for attribute relationships and hierarchies, users can quickly navigate through different attributes, search for keywords, or create queries as a more conventional approach. With MDEX Engine, no data is left behind.

Until next time, stay intelligent, stay agile.

No comments:

Post a Comment