The name says it all: “Big Data.” The automation of a myriad of information handling services, especially in the last decade, has facilitated the collection of truly massive amounts of data. This information contains hugely valuable information but, by its very volume, it requires new concepts and applications to extract that value. Principal among these, are the ability to quickly and accurately create, store, retrieve, and analyze these immense amounts of widely varying information.
Big data tools are used as services to applications that analyze and format the data they manipulate. By their very nature, they require indexing with many layers of abstraction from the source data to process results in useful timescales. They support a growing array of business applications that include:
- Data Mining – extracting trends and relationships from distilling big data down into arrays of connections
- Subtle Exception Surveillance – examining data trends for building direction changes that, while currently minor, are unidirectional in their evolution into major issue indications
- AI Training – using big data collections to train AI implementations to develop useful awareness and assessment algorithms
Obviously, testing must look at all these same indexes and abstractions as well as verifying all the reductive analysis functions that turn the raw data into useful information. The scalability of the system components must be verified across their operational range without processing the full content load that accompanies that scale of operation. At the same time, operation of the system must support its intended access model, be it real time, interactive or batch mode.
Approaches to Big Data Testing
The upfront issue is test target size. Big data test facilities require giant quantities of test data and long facility setup times, which are anathema to the fast change and fast release Agile development method. Ideally, the plan would be to create test beds that replicate the behaviors of each layer in the operational system but do it with vastly reduced setup times and storage overhead.
Specialized test tools to create test databases with automated population modules are a start, but more is required. Data error trapping and business rule implementation must be continuously verified along with repeatability and performance. This mandates the use of creative exploratory test approaches supported by reusable automated test tools that magnify the test engineer’s insights and use them to exercise the entire system. As if all this were not enough, there are implementations of legally required data security regulations to verify as well.
Some approaches to big data testing:
- Create a test installation with a continuous data restoration facility that imports new data and archives out old data at a continuous rate. This supports a realistic data lifecycle without having to create a new archive for each iteration.
- Use any existing quick import or replication methods to populate an installation with bulk data. Lacking such a system feature, consider making a test tool specifically for this purpose.
- Use read-only, static data samples for testing query operation and for performance bench-marking.
- Use the backup/restore capability of the system to reset an existing installation to a known state to support operational and regression tests.
Big Data Testing Challenges
As with testing any other technology, big data has its verification pitfalls. Some of these are described above but particular attention should be paid to test image virtualization, test automation, and the size of the test data sets.
1. Test Image Virtualization
Creation of virtual machines offers a path to rapid facility setup but beware of latency issues. Big data services are useful only when they are prompt and responsive and VMs can mask performance problems.
2. Test Automation
Like all other forms of test automation, big data test scripts require programming expertise. The skill set for creating and maintaining automated tests are the same as those used to create revenue-producing system code and keeping those skills in QA can be a genuine challenge.
3. Test Data Set Size
Larger test data sets are better test data sets. The problem is creating and resetting them during the verification process. Automation can help, but strong consideration should be given to installing test data generation capabilities into the product itself. The requirements to test using operational data quantities will not go away and making their generation part of the system will put them in line for the same updates that are released with the system.
Have a Big Data Testing Plan
Big data is only as good as the quality of the data it contains. Having a big data testing plan is critical to help recognize and identify defects and inconsistencies in the data early on. Doing so will help reduce costs and realize business goals.
Read about our experience testing big data for a Fortune 100 entertainment company by clicking here.