Apache Pig Tutorial on Apache Pig Cogroup Operator

the cogroup operator works more or less in the same way as the group operator. the only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

grouping two relations using cogroup

assume that we have two files namely student_details.txt and employee_details.txt in the hdfs directory /pig_data/ as shown below.

student_details.txt

001,rajiv,reddy,21,9848022337,hyderabad
002,siddarth,battacharya,22,9848022338,kolkata
003,rajesh,khanna,22,9848022339,delhi
004,preethi,agarwal,21,9848022330,pune
005,trupthi,mohanthy,23,9848022336,bhuwaneshwar
006,archana,mishra,23,9848022335,chennai
007,komal,nayak,24,9848022334,trivendram
008,bharathi,nambiayar,24,9848022333,chennai

employee_details.txt

001,robin,22,newyork 
002,bob,23,kolkata 
003,maya,23,tokyo 
004,sara,25,london 
005,david,23,bhuwaneshwar 
006,maggy,22,chennai

and we have loaded these files into pig with the relation names student_details and employee_details respectively, as shown below.

grunt> student_details = load 'hdfs://localhost:9000/pig_data/student_details.txt' using pigstorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); 
  
grunt> employee_details = load 'hdfs://localhost:9000/pig_data/employee_details.txt' using pigstorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.

grunt> cogroup_data = cogroup student_details by age, employee_details by age;

verification

verify the relation cogroup_data using the dump operator as shown below.

grunt> dump cogroup_data;

output

it will produce the following output, displaying the contents of the relation named cogroup_data as shown below.

(21,{(4,preethi,agarwal,21,9848022330,pune), (1,rajiv,reddy,21,9848022337,hyderabad)}, 
   {    })  
(22,{ (3,rajesh,khanna,22,9848022339,delhi), (2,siddarth,battacharya,22,9848022338,kolkata) },  
   { (6,maggy,22,chennai),(1,robin,22,newyork) })  
(23,{(6,archana,mishra,23,9848022335,chennai),(5,trupthi,mohanthy,23,9848022336 ,bhuwaneshwar)}, 
   {(5,david,23,bhuwaneshwar),(3,maya,23,tokyo),(2,bob,23,kolkata)}) 
(24,{(8,bharathi,nambiayar,24,9848022333,chennai),(7,komal,nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,sara,25,london)})

the cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.

for example, if we consider the 1st tuple of the result, it is grouped by age 21. and it contains two bags −

  • the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and

  • the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.

in case a relation doesn’t have tuples having the age value 21, it returns an empty bag.