in the previous chapter, we learnt how to load data into apache pig. you can store the loaded data in the file system using the store operator. this chapter explains how to store data in apache pig using the store operator.
syntax
given below is the syntax of the store statement.
store relation_name into ' required_directory_path ' [using function];
example
assume we have a file student_data.txt in hdfs with the following content.
001,rajiv,reddy,9848022337,hyderabad 002,siddarth,battacharya,9848022338,kolkata 003,rajesh,khanna,9848022339,delhi 004,preethi,agarwal,9848022330,pune 005,trupthi,mohanthy,9848022336,bhuwaneshwar 006,archana,mishra,9848022335,chennai.
and we have read it into a relation student using the load operator as shown below.
grunt> student = load 'hdfs://localhost:9000/pig_data/student_data.txt' using pigstorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
now, let us store the relation in the hdfs directory “/pig_output/” as shown below.
grunt> store student into ' hdfs://localhost:9000/pig_output/ ' using pigstorage (',');
output
after executing the store statement, you will get the following output. a directory is created with the specified name and the data will be stored in it.
2015-10-05 13:05:05,429 [main] info org.apache.pig.backend.hadoop.executionengine.mapreducelayer. mapreducelau ncher - 100% complete 2015-10-05 13:05:05,429 [main] info org.apache.pig.tools.pigstats.mapreduce.simplepigstats - script statistics: hadoopversion pigversion userid startedat finishedat features 2.6.0 0.15.0 hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 unknown success! job stats (time in seconds): jobid maps reduces maxmaptime minmaptime avgmaptime medianmaptime job_14459_06 1 0 n/a n/a n/a n/a maxreducetime minreducetime avgreducetime medianreducetime alias feature 0 0 0 0 student map_only output folder hdfs://localhost:9000/pig_output/ input(s): successfully read 0 records from: "hdfs://localhost:9000/pig_data/student_data.txt" output(s): successfully stored 0 records in: "hdfs://localhost:9000/pig_output" counters: total records written : 0 total bytes written : 0 spillable memory manager spill count : 0 total bags proactively spilled: 0 total records proactively spilled: 0 job dag: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] info org.apache.pig.backend.hadoop.executionengine .mapreducelayer.mapreducelau ncher - success!
verification
you can verify the stored data as shown below.
step 1
first of all, list out the files in the directory named pig_output using the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_output/' found 2 items rw-r--r- 1 hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_output/_success rw-r--r- 1 hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_output/part-m-00000
you can observe that two files were created after executing the store statement.
step 2
using cat command, list the contents of the file named part-m-00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_output/part-m-00000' 1,rajiv,reddy,9848022337,hyderabad 2,siddarth,battacharya,9848022338,kolkata 3,rajesh,khanna,9848022339,delhi 4,preethi,agarwal,9848022330,pune 5,trupthi,mohanthy,9848022336,bhuwaneshwar 6,archana,mishra,9848022335,chennai