博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
通过Spark SQL关联查询两个HDFS上的文件操作
阅读量:4951 次
发布时间:2019-06-12

本文共 1569 字,大约阅读时间需要 5 分钟。

order_created.txt   订单编号  订单创建时间

10703007267488  2014-05-01 06:01:12.334+0110101043505096  2014-05-01 07:28:12.342+0110103043509747  2014-05-01 07:50:12.33+0110103043501575  2014-05-01 09:27:12.33+0110104043514061  2014-05-01 09:03:12.324+01

 

order_picked.txt   订单编号  订单提取时间

10703007267488  2014-05-01 07:02:12.334+0110101043505096  2014-05-01 08:29:12.342+0110103043509747  2014-05-01 10:55:12.33+01

 

上传上述两个文件到HDFS:

hadoop fs -put order_created.txt /data/order_created.txthadoop fs -put order_picked.txt /data/order_picked.txt

 

通过Spark SQL关联查询两个文件

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._case class OrderCreated(order_no:String,create_date:String)case class OrderPicked(order_no:String,picked_date:String)val order_created = sc.textFile("/data/order_created.txt").map(_.split("\t")).map( d => OrderCreated(d(0),d(1)))val order_picked = sc.textFile("/data/order_picked.txt").map(_.split("\t")).map( d => OrderPicked(d(0),d(1)))order_created.registerTempTable("t_order_created")order_picked.registerTempTable("t_order_picked")#手工设置Spark SQL task个数hiveContext.setConf("spark.sql.shuffle.partitions","10")hiveContext.sql("select a.order_no, a.create_date, b.picked_date from t_order_created a join t_order_picked b on a.order_no = b.order_no").collect.foreach(println)

 

执行结果如下:

[10101043505096,2014-05-01 07:28:12.342+01,2014-05-01 08:29:12.342+01][10703007267488,2014-05-01 06:01:12.334+01,2014-05-01 07:02:12.334+01][10103043509747,2014-05-01 07:50:12.33+01,2014-05-01 10:55:12.33+01]

 

转载于:https://www.cnblogs.com/luogankun/p/4268431.html

你可能感兴趣的文章