Monday, November 1, 2010

Data Warehousing (ETL) QA Automation

About DWH

Data Warehousing is the process of transforming data in real-time information for decision-making process. It includes techniques, methodologies, or tools for data storage into an electronic repository. Time variant, volatile, and subject oriented data are collected from multiple sources and converted into homogeneous data, which can be retrieved for analysis and reports. The major demand in Data Warehousing market was for high speed data mining at lower hardware and implementation costs. The Data Warehousing solutions offered in market were prepackaged solutions, which are either referenced configuration or pre-assembled.

The concept of data warehousing (DW) dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". The intent to data warehouse is to build a decision support system. DW concept attempted to address various problems associated with this flow, mainly the high costs associated with it.

Data warehousing arises in an organization’s need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation. Over the last few decades the data warehouse market is growing fast and in particular there's been increased interest in data warehouse appliances — pre-packaged server and storage hardware bundled with analytic databases. The high level of interest in appliances is causing many data warehouse vendors to change their strategies and jostle for high quality data in what is becoming a red-hot part of the warehousing market.

The success of DW lot depends on quality data & here comes the role of quality assurance team. It’s been observe that there is no set mythology to test DW. There are tools which are quite expensive & I seriously doubt if any specific tool can fulfill testing requirement. Basically testing a DW is quite different than any other application.

DW testing is different

All works in Data Warehouse population are mostly through batch runs. Therefore the testing is different from what is done in transaction systems.

User triggered vs System triggered

Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)

In data Warehouse, most of the testing is system triggered as per the scripts for ETL ('Extraction, Transformation and Loading'), the view refresh scripts etc.

Therefore typically Data-Warehouse testing is divided into two parts--> 'Back-end' testing where the source systems data is compared to the end-result data in loaded area, and 'Front-end' testing where the user checks reports.

Batch vs online gratification

This is something, which makes it a challenge to retain user interest. A transaction system will provide instant OR at least overnight gratification to the users, when they enter a transaction, which either is processed online OR maximum via overnight batch. In the case of data- warehouse, most of the action is happening in the back-end and users have to trace the individual transactions.

Data

The test data in a transaction system is a very small sample of the overall production data. Typically to keep the matters simple, we include as many test cases as are needed to comprehensively include all possible test scenarios, in a limited set of test data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination and permutations of dimensions and facts. Typically fact data is always huge & testing the same is quite a challenge. QA queries needs to be well tuned to get test results fast.

Finding bugs

Analyzing millions of data is not easy & in DW cases while reporting bug QA needs to give root level analysis about at what stage there is an issue. As per my experience there are issues in source file, data loss while loading to stage tables, there might be logical error while transforming, and data loss from dimension to facts load or even business logic error. In many case QA needs to look at dev code to find logical anomaly.

Problem Statement

Lot of brainstorming required to develop effective testing framework. As there is no set mythology to test DW QA should keep looking to automate as much activities as possible. Tavant has developed one click automation framework in plsql which is successfully running over two years without much code changes. Below are the factors that inspire us to develop framework.

No fancy tools

Any DW tool cost is very high & for most of the project it’s become impossible to get license. Apart from the cost constraint finding which tool deliver the most as per your testing requirement is difficult; every tool require learning curve or some expertise to make most of it & we can’t ignore that fact that each tool comes with its own limitation. In our case we chose plsql.

Easy to use

The framework should be very simple to use & it shouldn’t require huge learning curve. Write stored procedure considering future requirement changes so that you don’t need to change code every time there is a requirement change.

Schedule test run

For each kind of job finally there should be only procedure which may call various other procedures. Proc should have input parameter so that you have control over amount of data you are going to scan. You can use a cron to schedule in linux or scheduler in windows.

What to test

Testing at each stage is very important therefore broadly it is divided into 3 stages

  1. Staging – This include source file verification which cover data availability, file format & business logic verification as the requirement.

  1. Warehouse - This covers data from dimension to facts , foreign key references, business logic and 2 sigma data account verification .

  1. 3. Reports - This covers reports UI , security, functionality & performance verification.

Figure 1

Code standardization

As multiple team members were working on same framework so it was very important to have coding standard so that each member can easily understand others code.

Tavant One Click DW Testing Solution

Tavant solution has demonstrated excellent performance in reducing testing time from 8 man days to 1 day & deliver esteem quality release as much as zero bugs in last 18 months either in UAT or Production.

Gradually & steadily we have developed framework in plsql which ensure end to end data verification including sending release mail. There were lots of brains storming sessions involved in developing this framework. Let me walk you through the journey.

Capture all queries into procedure

Initially we started with manually running queries into database to validate data. We use to publish all test result in excel which was humongous task & from this point we came with idea of putting all queries into procedure. The key is each query should return Pass / Fail as test result so that while publishing test report in excel you don’t have to update each query results in excel. Below figure shows how we store QA test result into database.

Figure 2

Data Loss

Data loss can damage entire business therefore verifying data flow from staging to dimension & dimension to facts is of utmost priority. Data can be less at various stages because of business logic where you have to eliminate wrong data.

Figure 3

Trending

Trends in any area of technologies are always an exercise in inaccuracy, but there are a number of noticeable trends which will have a significant impact in the short-to-medium term. How good the volume of data can be identifying by record count sigma verification. In DWH you always find data loads happening at regular interval & in our case we have done 2 sigma records count verification & below screen show that.

Figure 4

No manual intervention till production release

How would it sound if I tell you that I can do production release without any manual participation? When I say release do production it means sending mail to client saying production release successfully done, Disbelief? ; Tavant one click automation framework provides this feature. What we have done is that for each DEV jobs we have developed QA jobs & once dev loads completed it will set a flag in a table saying completed; QA Jobs will start from their & if all test cases are passed it will set QA flag indicating passed. If QA status is passed next dependent loads will start.

Below screen show the same.

Figure 5

Automated e-mail notification

Above process will continue until all loads are done. Once QA results are passed there will an automated email triggered to required mail group notifying data successfully moved to production. At any stage QA status set to fail next load will stop & mail notification will go to required group saying specific job failed. Below screen shows DEV / QA job status update.

Figure 6

QA test report in OBIEE

As I mentioned above that earlier we used to publish all QA test results in excel. Now with automation framework capture all test results and store the final pass / fail count. We have developed report in OBIEE which directly fetch data from database & below screen show the same

Graphical representation & Detail View

Figure 7

Capture QA Jobs performance

Because of the size of date queries tend to run into performance issues & it very important to keep monitoring the low performing jobs which are causing a delay in a process. We decided to capture all QA procedure execution time & the same is shown in below screen shot

Figure 8

Benefits

Quick and hassle-free testing

As it required training to verify end to end testing user find it quite easy because QA needs to do only things. First see are there any test results with FAIL status. If yes start analyzing. If all would have not been automated user needs to manually run queries & verify results.

Easy maintenance

Maintenance is the key factor in the success of any automation framework. In our case one the code is operationalize you don’t have to worry about existing code & generation test report. Also as I mentioned earlier we have integrated QA jobs along with dev job so user need not to worry when to start QA jobs it just run sequentially which makes life easier.

Speedy testing cycle

The motive behind developing automation framework is to reduce testing time without impacting the quality of data. Imagine if you have to run tons & tons of queries manually it would take humongous time & also any manual intervention tends to be error prone.

Team engrossment
Keep automating & scheduling current tasks will always give luxury to think out of the blue which always keeps zeal high & force to bring new ideas. Also designing automation framework makes team more technically sound & give opportunity to write higher quality code.

Customer gratification
Undoubtedly we all live to satiate customer belief & it give us immense pleasure to see zero bugs in UAT or PROD. As per my personal experience I can says that such high standard releases can be achieve by only and only one thing in mind & that is “ Whatever I am testing now will not test again. There has to be a way where test suite run automatically & I just have to look at final test result”

Buzz in DWH

Data warehousing, like any technology, has a history of innovations that did not receive market acceptance.

A 2009 Gartner Group paper predicted these developments in business intelligence/data warehousing market.

Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets.

By 2012, business units will control at least 40 percent of the total budget for business intelligence.

By 2010, 20 percent of organizations will have an industry-specific analytic application delivered via software as a service as a standard component of their business intelligence portfolio.

In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities.

By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.

As per ARRA Stimulus, Business Models, Report Information, Trends in K12 the stimulus funds will probably spur an unusually large purchase of SIS (Student Information System) and DWH (Data Warehouses) next summer. Why so?

ARRA is a federal funding aimed at stimulating the U.S. economy. On Feb. 17, 2009, President Barack Obama signed into law the American Recovery and Reinvestment Act (ARRA).

1. There are unprecedented demands for accountability in ARRA, and the only way to meet them is through solid data and a good reporting system.

2. According to the regulations stimulus money should not be spent on things that face a “funding cliff.” In other words you can’t go out and hire a bunch of people or sign up for a 10 year subscription. In the report we showed that most districts still prefer to buy their data systems with an upfront license, the subset that like an annual subscription model may be swayed by ARRA to buy license.

3. Most of the stimulus funds have not been spent yet – the big bolus of spending will be in the summer of 2010 as part of the annual purchase cycle.

Given these interlocking issues we believe a much larger than normal cohort of Districts will evaluate and purchase an SIS in the next 12 months. Districts that were already likely to purchase will accelerate the buy to fall within the window of ARRA spending. Districts with older systems will need to update in order to fully report on what they are doing. Districts with newer systems may add modules and functionality they don’t have yet (notably Data Warehouses).

Boom times lie ahead for the SIS and DWS providers. Hold on to your hats.

Reference

Most of the content is from my personal experience, however below website was referred to definitions & terminologies.

www.wikipedia.org

About the author

Amit Kurchania is a QA Lead with Tavant Technologies, Bangalore, India. Post graduated in computer science from Rajiv Gandhi Proudyogiki Vishwavidyalaya (RGPV), Bhopal - MP, in 2003. Having 7+ year of experience he played multiple roles in the field of software. Started his carrier as an automation testing engineer using selenium later work on logistic domain & currently he is handling data warehouse project & focusing mainly automation using plsql, during this course he has also automated using vb script & excel.

3 comments:

Bablu said...

Hi amit thanks a lot for this valuable post, can u also update the images which u have used to implement so that i can have a perfect idea of how this stuff works. thanks a millon.

Amit Kurchania said...

Hey Bablu .. I am very sorry as I am not very regular with my blog. Will try to upload images.

Anonymous said...

Amit, This is a great article. Can you please upload the images..

Thanks,