背景:需要从 nginx 日志中分析出一个列表页中各种商品曝光次数。数据存储在 hive 表里面。
核心字段有 3 个,request (用户请求参数),response(接口返回),url(接口地址), 这 3 个都是 string 类型
列表接口的 response 大致长这个样子
{
"errno": 0,
"errmsg": "SUCCESS",
"data": {
"list": [
{
"id": 123,
"name": "xxx",
"other": {
"xxx": 1
}
},
{
"id": 456,
"name": "xxx",
"other": {
"xxx": 1
}
},
{
"id": 789,
"name": "xxx",
"other": {
"xxx": 1
}
},
{
"id": 222,
"name": "xxx",
"other": {
"xxx": 1
}
}
],
"page":{
"current_page":1,
"total_page":2151,
"count":10,
"total_count":21508
}
}
}
我目前的思路:
step1:通过 get_json_object 函数取到 response 中的 list 字段
step2:正则替换掉数组的第一个'['和最后一个']'
step3:把数据分割成多个 json object 数组
step4:把 json object 数组转成列数据
step5: 再使用 get_json_object 函数取出每一行的 id
step6: group by id 然后计数
目前写出来的 SQL
select
j.json_detail
from
(
select
split(
regexp_replace(
regexp_extract(
get_json_object(
response,
'$.data.list'
),
'^\[(.+)\]$',
1
),
'\}\}\,\{',
'\}\}\|\|\{'
),
'\\|\\|'
) as info
from
nginx_log
where
concat_ws('-', year, month, day) = '${query_time}'
and url = 'product_list'
) m lateral view explode(info) j as json_detail;
目前这个 SQL 写到第 4 步就不能运行了,报错
执行用时:2 分钟 22 秒
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
INFO : number of splits:2
INFO : Submitting tokens for job: job_xxx_xxx
INFO : The url to track the job: http://xxxxx/xxxx
INFO : Starting Job = job_xxx_xxx, Tracking URL = xxx
INFO : Kill Command = /usr/local/hadoop-current/bin/hadoop job -killjob_xxx_xxx
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2021-10-20 19:13:29,451 Stage-1 map = 0%, reduce = 0%
INFO : 2021-10-20 19:14:14,588 Stage-1 map = 100%, reduce = 100%
ERROR : Ended Job = job_xxx_xxx with errors
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
stackoverflow 这个贴子 说这个报错并没有指出哪里出错了,但是我也不知道去哪里查详细的日志。