Flume Sqoop Pig HBase Unit Testing 1


Testing Flume

Scope

  • Testing will cover the functional testing of the data transfer from source machines (External Systems) to HDFS/HBase.

  • Testing of Individual Flume Components like, different Source types, Channel Types and Sink Types will be included.

  • Testing of Custom Flume Agents/Embedded flume agents in other automated jobs/tools.

Limitations & Exclusions

  • Installation of Flume (Infrastructure) may not need to be tested.

  • Record level transfer validation may not be done from source input file to the output file for data corruptions/data loss.

  • No Automation Tools available at this point of time of to test input records and output records whether there is any data loss or not.

Testing Approach

  • As there are no automation tools available for Flume Testing, we need to perform Manual Testing for flume data ingestion.

    • Manually verify the Number of records/events in Sample Input files

    • Verify the statistics on number of events that were attempted, successful and failed from external data generator to Source Component in Flume Agent. These statistics can be found in flume log files.

    • Manually Verify the same above statistics at Source ==> Channel Phase and as well as Channel ==> Sink (Target/Destination).

    • Make Sure that total no of input records from data generator = Total no of records reached on target + Any filtered records based on some criteria. This ensures that all of the input records are processed even if some of them are dropped due to filter criteria.

Test Cases

  • Test the agent with sample input file/data and verify the expected events/data on target location.

  • Feed the flume agent with Empty files/Already processed files and test the flume agent behaviour.

  • Test the agent with source files which are copy inprogress phase. Check whether the agent is able to handle only the completely copied files or it can pick up the files which are being copied currently. (In general processing partially copied files should not be allowed in agents).

  • In Case of HDFS target, Check whether HDFS destination directory structure is as per the expected format specifier properties specified in Sink configuration.

  • Validate the output file format, Compression technique applied, Serialization of the events against the properties specified for Flume Agent Sink.

  • Stop the agent forcibly in the middle of data ingestion and restart the agent to find out the impact of failover scenarios. Either we need to delete the partially copied target files before we restart the agent or we can have the data deduplicated on target.

  • Try to provide different source type events to Flume agent than expected, and see whether these records are throwing any EventDelivery Exceptions or being filtered out smoothly instead of hard stop for remaining events to get copied.

Testing SQOOP

Scope

  • Testing will cover the functional testing of the data transfer from RDBMS (Oracle, MySql, PostGreSql, Teradata, DB2, MS Sql Server, etc.) to HDFS/Hbase/Hive/Accumulo/Hcatalog.

  • Exporting of Data from Hadoop Eco Sytem to RDBMS is also considered.

  • Only Structured data will be transferred across RDBMS and Hadoop.

Limitations & Exclusions

  • Infrastructure Setup (Sqoop Installation) Testing might not be needed.

  • Unstructured, Semi Structured data copying from RDBMS to Hadoop or Hadoop to RDBMS can’t be performed.

Testing Approach

  • As there are no automation tools available for Sqoop Testing, we need to perform Manual Testing for Sqoop data retrieval.

  • For Import Process,

    • Submit the SQL queries on RDBMS tables and find the input record count.

    • Trigger the Sqoop Import command and observe the record count in the sqoop log messages.

    • Impose the same schema structure from RDBMS table to HDFS data and try to view the results. There should not be any schema mismatches due to incorrect data types in HDFS or Hive.

    • Retrieve some records randomly from RDBMS table and validate each column with the data copied to HDFS/Hive/HBase

  • For Export Process,

    • Manually get the record count from HDFS file/Hive/Hbase tables

    • Trigger the Sqoop Export command and validate the no of records exported in the Sqoop Log messages.

    • Retrieve some records randomly from HDFS file/Hive/Hbase table and validate each column with the data copied to RDBMS table

Test Cases

  • Trigger the given Sqoop import scripts, and validate that it is running fine without any exceptions, when there is no primary key in the input table.

  • Validate the parallel copying process by multiple mappers when there is a primary key in the input RDBMS table.

  • Validate the parallel copying process by multiple mappers when there is no primary key in the input RDBMS table.

  • Validate the output file format created on HDFS, and compression techniques applied, if used any.

  • Validate the delimiters for fields in HDFS files and Hive tables.

  • Try running the same sqoop scripts with various number of parallel map tasks (1, 10, 20, etc…). Output should contain the same no of records.

  • In case of incremental imports validate the records from RDBMS table with Where clause and verify the same records in new files in HDFS or Hive tables.

  • Validate the null value columns from RDBMS tables while importing into HDFS or Hive tables.

  • Verify BINARY, VARBINARY, BLOB, CLOB data types from RDBMS tables are handled properly in HDFS files.

Testing PIG

Scope

  • Testing will cover the functional testing pig latin scripts.

  • Processing of both structured and unstructured data sets are included in the scope.

  • Testing of pig operators and User defined functions, custom load/store functions are also included.

  • Execution of Pig Scripts in Local Mode & Mapreduce Mode will be included in scope.

Limitations & Exclusions

  • Pig Installation & configuration testing might not be needed.

Testing Approach

  • Pig Unit testing can be in two ways.

  1. Running Pig Latin Statements on Sample input data in Grunt Shell

  2. Using PigUnit framework to test pig scripts

  • We can perform Pig unit testing manually by limiting the no of input records and running Pig Latin Statements in Grunt Shell by using debugging operators like DUMP, EXPLAIN, ILLUSTRATE.

  • With PigUnit, we can perform unit testing, regression testing, and rapid prototyping

    • Prepare sample input and expected output for Pig scripts

    • Pass these as arguments to PigTest instance in PigUnit and Run the pig scripts inside PigUnit framework.

    • After completion of PigUnit run, framework will provide whether the expected results are matching with the actual output or not.

    • This way we can test any pig script from Eclipse by PigUnit in Local Mode.

Test Cases

  • Test the script by providing the correct schema for the input file tuples.

  • Validate the scenarios, where the actual data is not as per the given schema for some tuples.

  • Examine the pig scripts data flow with debugging operators like EXPLAIN, ILLUSTRATE.

  • Validate the null value fields in the input file and output file.

Testing HBase

Scope

  • Testing will cover the functional testing of Hbase Tables & Hbase Java native API.

Limitations & Exclusions

  • Hbase Cluster Setup and Configuration testing is not needed.

Testing Approach

  • Currently As there are no automation tools available for Hbase Testing, So, Hbase Unit testing needs to be done manually.

  • We can use DESCRIBE command to validate the schema of the Hbase Tables.

  • Validate the column families, versions, timestamps of the cells.

  • Validation of bulk data loading into Hbase can be done by using the COUNT command to retrieve the record counts.

  • We can validate the record by performing ad-hoc queries, Scans on the Hbase tables and validate the columns/fields against the input data.

  • Run the Java Native API programs/Mapreduce Jobs to access/update/delete the sample Hbase tables.

Test Cases

  • Test the zookeeper quorum connections by attempting to open a session.

  • Apply appropriate Scan Filters and validate the records from results.

  • Use COUNT, DESCRIBE like commands to perform the validation of Hbase tables.

  • Run the Mapreduce programs for bulk loading and accessing of hbase tables records and validate the columns.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Flume Sqoop Pig HBase Unit Testing

  • RAMAKRISHNA

    first i would like to thank you for your valuable patience in posting the updates . u have mentioned that the online training session is going to start in 3rd week of may. can u send me the complete details about this.Your response is highly appreciated.


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.