Friday, May 08, 2009

More the data, less the business intelligence

SUMMARY
Often, there is a quest to do more with the data we have. The notion is - "We have this piece of data, probably, it means something important. Lets find out!" This has led to creation of mammoth corporate data warehouses which are poor in imparting knowledge and big on maintenance. Most of the data that they contain is fractured, incorrect, and redundant. Good decision making doesn't need ALL the data that is available. In fact, good decision making demands limited data, i.e. KPIs. This means that data warehouses which are our decision support engines need to built as per specs (not as per what source systems contain). Moreover, we need to educate ourselves to look beyond data - recent global events have highlighted acutely the inherent limitations of data and people who use it.


DATA WAREHOUSES AS DUMPING GROUNDS
As organizations become more tech-savvy, they want to do more with the data they generate. Letting the data be just created and consumed for its fundamental need – facilitating a core business transaction - is unacceptable. Organizations want to store, examine, tweak and mix data, and hope that it will be useful in some way.

More and more buzzwords – Business performance management, digital dashboards, scorecards, web analytics, on-demand BI – have added to the confusion in the information management space, all the while, making the CIO worried that he is not doing enough with the data he owns. Organizations are trying to extract the last ounce of meaning from the smallest data element. The notion is - "We have this piece of data, probably, it means something important. Lets find out!" I wonder if it is the right approach.

Data stores are typically built for analysis and provide "business intelligence". Large ones - data warehouses - contain aggregate of all the organizational data. Data marts focus on specific subject areas or departments. Operational data stores cater to immediate needs - minute-by-minute or daily updates, and are smaller in size. The thing to note is that these stores are not core transactional systems themselves. They are decision support instruments that help the management in their decision making process and come up with business plans based on them.
Capturing all the available data in these stores is fraught with numerous problems.

1. More data we bring into them, the more noise we add. Some of this noise can be attributed to incorrect system entry or untidy manual processing. However, sometimes, there is a genuine anomaly e.g. sudden freak rains on a given day mean that footfall on the high street was reduced by 40%. But does this information help us in business planning? Not really.

2. It is difficult to sieve through loads and loads of data. It obscures rather than informs. You might organize all the data in neat multi-dimensional fancy reports. However, someone has to trawl through them to understand what the hell all those arrays, tags, and numbers are. By the time you do that, the time to make a critical decision may have already passed.

3. Capturing minutiae into a data warehouse requires huge spending. The infrastructure to clean the data, pump it into the warehouse, and store it adds massively to the costs. For companies already struggling to build a plain vanilla data warehouse, it's a plain vanilla foolishness to waste money on a real-time all singing, all dancing BI- behemoth.

4. Be prepared for poor performance from large data warehouse – normal databases cannot match the performance of data warehouse appliances, which are specifically targeted for large data volumes, and sophisticated data analysis. However such appliances are extremely expensive. Regular RDBMS databases struggle to process vast quantity of data, the load times and SQL queries are much slower, whatever the optimization strategy used. A lot of the user queries result in full table scans and fetching of millions of records at a time. This is thanks to the approach "Let's see what I can get out of the system" rather than "What do I really want?" It puts undue pressure on the system leading to poor performance. No wonder, it leaves users dissatisfied, who in the end, turn away from the system.

5. With the oceans of data available, the decision makers are free to glean information that fits their views or needs. This phenomenon is called confirmation bias. Managers can use it to justify the spending last year or budget for the next, demonstrate exceptional performance, or apportion blame for poor performance on some external factors. It doesn't matter whether people do this selective data filtering consciously or sub-consciously. The fact is we do it.

WHY DOES ALL THE DATA GET ASKED FOR?
Most of the projects are still run as IT change initiatives. This means that business is considered ancillary to the whole activity. The IT project managers don’t put sufficient demand for business resources. Unfortunately, the team managers do not want to take out key resources from their day-to-day activities to engage in something that provides no immediate value-add and may harm their team's performance. In the absence of clear business requirements, the IT team is forced to design a system to cater to any eventuality. Alas, that is not possible.

Some blame rests on the IT team as well. The business representatives don't know what features or functionality will be available in the new world. Business can do with some handholding about the capabilities of the new architecture – screen layouts, filters, collaboration, scheduling, security, flexibility etc. so business knows what to expect and provide their requirements accordingly. A few mock-ups go a long way in flicking the light bulb on, giving the business users a number of ideas about ways to use the new architecture effectively and also demonstrate its limitations. In the absence of any business-IT workshops early on, business will keep drifting and won’t knowing what to expect. They will keep ask for everything.

DATA AND DECISION MAKING
Decision support stores provide a chance to see through the convoluted, shifting data spaghetti and glean out the most pertinent information from it. The starting point always has to be 'what do we need to find out' backed by a strong 'why' and then move to 'How can we achieve this?'

We fail to understand the intrinsic difference between data, information, and knowledge. Data is unprocessed content. Information is a higher abstraction which provides additional meaning to the data. Knowledge is application of the underlying information or data, it's about true understanding of 'what lies beneath'. The important thing to note is that more data doesn’t mean more knowledge. And, more information doesn’t equal more knowledge. In fact, just the opposite may be true.

Good decision making doesn't need ALL the data that is available. In fact, good decision making demands limited data, what is known as KPIs - key performance indicators. E.g. KPI for recruitment team is recruitment increase percentage. It can be derived by number of new recruits this year over previous year. You might capture a whole lot of information about the new recruits, but most of it is peripheral to the success of the recruitment team. A doctor doesn't need to record all your physiological variables to correctly diagnose a problem. Just 3-4 key indicators are enough to tell him if you are having a heart attack. However, as we humans believe that more information one can acquire to make a decision, the better. This is known as information bias. However, extra information cannot affect our decision – what is not worth knowing is not worth knowing. Based on this, it is critical to limit the content of any design support system - be it a warehouse, or operational data store.

Some of the really good decision making is intuitive. With just handful of information and unconscious rapid cognition, we can arrive at an accurate judgement very quickly. It is what Malcolm Gladwell calls thin-slicing in his fascinating book, Blink. It is a powerful, sophisticated tool for taking quick decisions with minimum information. In fact, in the face of mountains of data, this ability no longer functions. Intuition, by definition is fragile and short-lived, and too much information can often paralyze it.

WHAT TRUE ANALYSIS DEMANDS
Real, meaningful analysis needs time and distance. There is no benefit in seeing all the variables right this second to be able to make a decision about the ‘next XYZ strategy’. Over a period of time, the data will be sufficiently stable to be able to give a reliable picture of current business trends and decide on how to tackle the future ones. Distance refers to the fact that one doesn’t need granular, detailed data for analysis. One needs to zoom out slightly to see it in a broader context. Summarized data rolled up at reasonable levels spliced across relevant dimensions will provide a more substantive, vivid depiction of the overall performance and help sharper, quicker decision making.

It’s a fallacy that ALL the answers are there in the data we capture. A school might be deemed to have exceptional teaching model based on the performance of the students at Grade 10 tests. However, this conclusion doesn’t take into consideration the fact that the school has a rigorous test based admission process, ensuring only the best students get admitted to the school. No school maintains un-admitted students on their rolls. This is called survivorship bias or problem of silent evidence.

The other trouble is inability to predict cataclysmic events like earthquakes, terrorist attacks, fires or for that matter, a recession. Such is the disruptive force of these events that the forecasting and predictive models just don’t stand up to scrutiny. This, at a time when you expect such tools to guide you the most. The need is to look beyond the obvious.

Massive data stores give us a sense of being all-knowing. We think we can just immerse into them and we will come out enlightened. The truth, as explained, is not that simple. In reality, the tools are as good as the data they contain and the people that use them. Unfortunately, both are inherently flawed. We need to remember two things - Data doesn’t tell everything, and you need to know what you are looking for.

No comments: