Hive: multi-insert and parallel execution problem

I’ve been having trouble with Hive after I added a SELECT clause to a multi-insert and started seeing java.lang.InterruptedExceptions. What follows is the smallest example I’ve been able to put together to demonstrate
the problem. It fails most of the time, but will just occasionally run without problem.

The query is:

FROM (
  SELECT a, b
    FROM input_a
    JOIN input_b ON input_a.key = input_b.key
) INPUT
INSERT OVERWRITE TABLE output_a
SELECT DISTINCT a
INSERT OVERWRITE TABLE output_b
SELECT DISTINCT b;

and the error from the hive CLI client is:

java.io.IOException: java.lang.InterruptedException
	at org.apache.hadoop.ipc.Client.call(Client.java:1204)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
	at $Proxy15.mkdirs(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
	at $Proxy15.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425)
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2108)
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2079)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:543)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1867)
	at org.apache.hadoop.hive.ql.exec.ExecDriver.createTmpDirs(ExecDriver.java:221)
	at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:444)
	at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:913)
	at org.apache.hadoop.ipc.Client.call(Client.java:1198)
	... 19 more
Job Submission failed with exception 'java.io.IOException(java.lang.InterruptedException)'

after which I’m dumped back at the console.

The input and output tables are as follows:

CREATE EXTERNAL TABLE input_a(a INT, KEY INT)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/user/vagrant/multi-insert-input-a';
 
CREATE EXTERNAL TABLE input_b(b INT, KEY INT)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/user/vagrant/multi-insert-input-b';
 
hive> SELECT * FROM input_a;
OK
10  1
20  2
30  3
40  4
50  5
TIME taken: 4.043 seconds
hive> SELECT * FROM input_b;
OK
100 1
200 2
300 3
400 4
500 5
TIME taken: 0.142 seconds
 
CREATE TABLE output_a(a INT);
CREATE TABLE output_b(b INT);

I discovered this problem when I took an existing multi-table insert statement that had several INSERT clauses, but only one of which used the DISTINCT option, and added a new clause which used DISTINCT. Notice that the issue occurs just after attempting to start jobs 2 and 3 in parallel. If I stop the jobs running in parallel things work fine. One obvious way to do this is by setting hive.exec.parallel to false. If the query is made simpler, e.g. by removing the DISTINCT option from either INSERT, then there are only 2 jobs required and, since the second depends on the first, they don’t run in parallel, so no problem. I think this explains why the addition of the extra clause caused my problem.

Searching around the web I find similar, though not identical, errors in this unanswered hive-user mailing list post and this blog (which is far beyond my Japanese reading skills). With help from Google translate I think the author of that blog suggests the problem only presents itself when a small number of nodes is used, and I’m running in pseudo-distributed mode. The problem also seems like it might be related to issue HADOOP-6762, which was recently fixed in development versions. Can anyone confirm my problem is a symptom of that issue? I’m using Cloudera CDH4, and I noticed in the discussion of HADOOP-6762 that someone suggests a fix for the same problem has been in the CDH distribution for some time.

I’m off to try and get help on the CDH Users group. I’ll update if there’s any progress.

For reference, I’m running hive_0.10.0+67-1.cdh4.2.0.p0.10~squeeze-cdh4.2.0_all.deb in pseudo-distributed mode with MR1, all inside a VirtualBox VM running Debian squeeze amd64. Here’s the full output from the hive CLI tool:

hive> FROM (
    >   SELECT a, b
    >     FROM input_a
    >     JOIN input_b ON input_a.key = input_b.key
    > ) input
    > INSERT OVERWRITE TABLE output_a
    > SELECT DISTINCT a
    > INSERT OVERWRITE TABLE output_b
    > SELECT DISTINCT b;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201302281640_0094, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201302281640_0094
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201302281640_0094
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2013-02-28 22:00:11,242 Stage-2 map = 0%,  reduce = 0%
2013-02-28 22:00:17,286 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
2013-02-28 22:00:18,294 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
2013-02-28 22:00:19,305 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
2013-02-28 22:00:20,314 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
2013-02-28 22:00:21,327 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.45 sec
2013-02-28 22:00:22,336 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.45 sec
2013-02-28 22:00:23,356 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.45 sec
2013-02-28 22:00:24,372 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.45 sec
MapReduce Total cumulative CPU time: 4 seconds 450 msec
Ended Job = job_201302281640_0094
Launching Job 2 out of 3
Launching Job 3 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
java.io.IOException: java.lang.InterruptedException
	at org.apache.hadoop.ipc.Client.call(Client.java:1204)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
	at $Proxy15.mkdirs(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor76.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
	at $Proxy15.mkdirs(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425)
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2108)
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2079)
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:543)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1867)
	at org.apache.hadoop.hive.ql.exec.ExecDriver.createTmpDirs(ExecDriver.java:221)
	at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:444)
	at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:913)
	at org.apache.hadoop.ipc.Client.call(Client.java:1198)
	... 19 more
Job Submission failed with exception 'java.io.IOException(java.lang.InterruptedException)'
Starting Job = job_201302281640_0095, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201302281640_0095
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201302281640_0095
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

5 thoughts on “Hive: multi-insert and parallel execution problem”

Jie Wang says:

March 25, 2013 at 12:22 am

Hey! How did you solve this problem?
Mick says:

April 2, 2013 at 5:25 pm

I haven’t yet. I’ve had no response from the CDH Users group. My next idea was to see if I could reproduce on a vanilla Hive install, and if so approach a Hive user group for help. In the meantime we’re working round the problem by turning off parallel execution when running the affected query.
Mick says:

May 18, 2013 at 3:45 pm

Today I planned to try to reproduce on vanilla Hive, but discovered someone else has since hit the same problem: issue HIVE-4436. I’ve confirmed that the included patch applied against CDH 4.1.2’s version of hive fixes the symptom in our environment. (Yes, we were using 4.2.0, but have since moved back to 4.1.2)

Hopefully that bug will soon be fixed and an upgrade will solve our problems.
Raju says:

November 17, 2013 at 5:39 pm

Hi mick

Have you fixed this issue , if so please reply me
I’m using apache hadoop 1.1.2 apache hive 0.11
In hadoop 1.1.2 patch for bug 6762 not available . It’s available from 2.0 versions
Mick says:

November 18, 2013 at 5:52 pm

Hi Raju,

The patch on issue HIVE-4436 (not 6762) fixed it for us, but we were using a different version of Hadoop from you. I hadn’t noticed before that the title for HIVE-4436 says the problem is for Hadoop-2. I saw the same error on MR1 in CDH4.1.2, which I think means version 0.20.x of Hadoop, so I guess the problem affects more versions. I think your options are to try an upgrade to Hive 0.12.0, which includes the relevant patch, or to disable hive.stats.autogather, which is the suggested workaround in HIVE-4436. (Or to apply the patch to some other version of Hive, but that seems to be more effort and more risk than just upgrading to 0.12.0.)

I hope that helps.

Sinking In

Hive: multi-insert and parallel execution problem

5 thoughts on “Hive: multi-insert and parallel execution problem”

Leave a Reply

Probably overthinking it…