Information, Intelligence, Intellectual: February 2012

Monday, 27 February 2012

Discovery channel - provided by Endeca Studio

Besides a unique way of organizing data, Endeca also provides an innovative interface for exploring and discovering information. Today let's have a closer look at the end user layer of EID - Endeca Studio.

Endeca Studio is a interactive, component-based environment for building analytic applications powered by MDEX Engine. It is built on web-based Liferay infrastructure that enable building analytic applications delivered through the use of Web browsers.

In Edeca Studio, a login user can have access to a number of Endeca analytic applications. Each analytic application may have multiple tabs and read data from more than one DGraphs. Each application tab contains variant components such as search box, chart, table, guided navigation, tag cloud, etc. The example here is an Endeca Studio application built for a consulting company. The initial view of this application shows overview information about staffs from different perspective - number break down for different categories, average billing rate, detail information for each individual and so on.

Suppose I am a manager who needs to find consultants for an urgent Siebel CRM project. I can start my journey of information discovery with "navigation". After clicking on "Now" and "Siebel -CRM" on the navigation panel, the system immediately responds by returning the smaller list of consultants who match those criteria, and by dynamically summarizing this smaller list of records and updating all metrics and analytics as well.

I can further refine my navigation result by choosing more navigation categories.

French language is also one of the requirements for this project. From the short-listed 21 staffs, I need to find people who can use French. However, the language competency is not available as a navigation category or table column so that I can filter on. It could be because the company has never captured that information in the HR system or other structured database systems. Individuals normally keep language competency in their resume files. Remember that Endeca MDEX engine is capable of consolidating both structured data and unstructured data. I can perform a search in this Endeca application against my MDEX engine. After correcting the spelling mistake automatically, Endeca finds me two consultants who can use French language and also meet all the other criteria.

To verify the validity of search result, I can choose to view the detail information for one of my candidates

If I click on the PDF icon, I can open the PDF file for that person's resume. As we can see, "Fluent in French" is one of the personal competencies that he included in his resume.

So, unlike those traditional BI operations such as choosing columns, defining filters, combining subject areas, it only took me a few navigation clicks and search to quickly find information that I need in Endeca Studio. More importantly, data stored in unstructured format was also not left out in the decision making process.

Sunday, 12 February 2012

Under the bonnet of Endeca

Endeca Information Discovery (EID) grants BI users agility of querying, navigating and searching across structured, semi-structured and unstructured data. The backbone of all EID applications is MDEX Engine. It stores data and receive requests via Endeca Web Services. After the execution of query request, MDEX Engine will return result to Endeca Web Services in XML format. Then front-end application in Endeca Studio performs formatting of the query result and return them to the client browser.

Today let's open the bonnet of EID and have a look at the nuts and bolts inside MDEX Engine.

Firstly of all, everything is running in "Dgraph" which is the term for the process of MDEX Engine. Data in MDEX Engine won't be accessible without a related running Dgraph. Relevant Data from variant source will be extracted, transformed and loaded into MDEX Engine. Compare to traditional RDBMS or OLAP Cubes, MDEX Engine structure its data in a different way.

The data model in the MDEX Engine consists of records and attributes.

Records are the fundamental units of data.
Attributes are the fundamental units of a record schema which describes the data model of Records.

For a data record, an assignment on an attribute (also known as key value pairs) provides information about that record. For example, for a list of bike records, an assignment on the "Category" attribute contains the category description (e.g. mountain) of the bike record. Each attribute is identified by a unique name.

Each attribute on a data record is itself represented by a record that describes this attribute. Following the bike records example, there is a record that describes the "Category" attribute. A collection of these records that describe attributes forms a schema for your records. The aspects of the attribute on a data record are configured in the schema. For example, an attribute on any data record can be searchable or not.

Let's have a look at an example which may help you to digest these concepts.

In an MDEX Engine which stores bike information, a typical Data Record will be like below:

TxnID = 12324
ProductID = 506
Category = Mountain Bike
Amount = $499.99
Suspension = Fox 32 F-Series
FrameType = Aluminium
Saddle = Bontrager SSR
Mountain Accessories = Fork and shock sag meter
Mountain Accessories = Water Bottle
Review = A great bike for off road. Smooth ride over the bumps
ReviewSentiment = Positive
ReviewTerm = Great
ReviewTerm = Off Road
ReviewTerm = Smooth
ReviewTerm = Bumps

In each line of this data record, Attribute is the part that is on the left-hand side of equation symbol. Attribute may be single-assign or multi-assign. In this example, attributes such as "TxnID" and "ProductID" are single-assign while attribute "Mountain Accessories" is multi-assign. In the MDEX Engine data model, Primary Keys (also known as Record Specs) are used to uniquely identify records.

The System Record that describes the Attribute “Category” may look like:

Name = Category
Type = String
Display Name = Category
Searchable = Yes
Sort = Ascendant

The collection of system records is called Schema.

In MDEX Engine, data records are not necessary to be stored in a conformed container. Null value key pair such as "AttributeName = Null" are not allowed. For example, as source data, if a relational database record has NULL value for column "Suspension", when it's loaded into MDEX Engine, a new MDEX data record will be inserted but no Attribute "Suspension" will be created for that record. So, it's not unusual have "Jagged records" like below exist in MDEX Engine, though they are describing the same business entity.

Data Records in MDEX Engine may be loaded from structured, semi-structured or unstructured data sources.

For structured data, each Tuple becomes a Data Record and each column (except for the columns with NULL value) becomes an Attribute.

Semi-Structured data is normally from enterprise applications, HTTP feeds, XML sources, etc. It will also be loaded as attribute/value pairs.This is a common cause of "jagged" record structure.

As the key differentiator, EID extends BI analysis to unstructured data such as text documents or social data. In MDEX Engine, unstructured data can be stored as their own records for "side-by-side" analysis. Or, they can be linked to existing data records by any available key.

Any unstructured attribute can be enriched using text analytics to expand the structure of its containing record. Common techniques include but are not limited to Automatic tagging, Named entity extraction ,Sentiment analysis ,Term extraction.

Beyond all these data records which consolidate information from database, XML document, Facebook, etc, MDEX Engine also creates hierarchy/relationship graphs, indexes for the attributes and attribute values. Those graphs and indexes are so important that information discovery can not be performed effectively and efficiently on MDEX data records without them.

In summary, MDEX Engine of Endeca stores information in data records as series of Attribute/Value pairs. Data Records can be structured differently with each other. With patented mechanisms of managing navigation graph for attribute relationships and hierarchies, users can quickly navigate through different attributes, search for keywords, or create queries as a more conventional approach. With MDEX Engine, no data is left behind.

Until next time, stay intelligent, stay agile.

Sunday, 5 February 2012

Agility Acquired

Together with well-designed ODS (operational data storage) or Data Warehouse, OBIEE is a comprehensive, reliable and scalable BI solution. Users get information in variant ways, operational reporting, dashboard, scorecard, ad-hoc analysis, what-if analysis, proactive alerting, mobile, etc. It can grow quickly and smoothly . From Gigabyte to Terabyte, from single server to cluster, from disk to In-Memory, hundreds or thousands users can access business information concurrently.

But what about Agility? Can I quickly perform data navigation without going through the modeling practice in DW or BI Server? What if my data is volatile and includes unstructured or semi-structured information? Is that possible to have analysis via search? All these request for quick and "good enough" analysis from business have been giving OBIEE hard time. Now, these challenges can gracefully addressed by a new member of Oracle BI : Endeca Information Discovery. It will complement Oracle's BI solution by providing agile data discovery on structured and unstructed information.

Endeca Information Discovery(EID) helps organization quickly explore all relevant data. You may have sales transactions from OLTP database, departmental forecast data from Excel files, customer survey result in word documents and product review articles on public websites/forums. Traditionally you have to model those data into relational star schema or multi-dimensional cubes before start to create reports and dashboards to answer provided questions from business. Some valuable information may not be included in analysis because the underlying data is unstructured and too hard to be modeled.

With EID, user can quickly consolidate data and perform data discovery in the style of navigating and searching. With no need to carefully create the logical and physical model, the MDEX engine (data storage of EID) enable users to centralize different information together, structured and unstructured, while keeping association between them. Then from Endeca Studio (browser based end-user layer), users can simply explore the data by searching key words or clicking through different attributes (think them as columns in dimensions), or create reports and charts in old ways. EID helps business to reveal answers to questions like "What is the sales revenue of my Top 5 products that my customers describe online with certain key words such as green, economic, etc?" "What are the other most contributing attributes such as "product color", "customer demography" for those Top 5 products?"

Below is the architecture of Endeca Information Discovery:

At a glance, it looks similar to the structure of traditional OBIEE. However, there are major differences from end-user's perspective. Unlike creating reports via choosing columns from tables in subject areas, EID users are able to quickly explore the data with the combination of traditional and agile approaches:

Endeca Information Discovery is an exciting complement of Oracle's current BI solution. The agility acquired enables the business to analyze information with much wider spectrum and faster speed . Finally the "invisible world" in business can be possibly seen and contribution to daily business decision making.

In the coming blog posts, I will gradually scratch the surface of EID and show you the details behind the scene of Oracle's new Agile BI.