Tag Archives: Database

Hive: multi-insert and parallel execution problem

February 28, 2013 Mick 5 Comments

I’ve been having trouble with Hive after I added a SELECT clause to a multi-insert and started seeing java.lang.InterruptedExceptions. What follows is the smallest example I’ve been able to put together to demonstrate
the problem. It fails most of the time, but will just occasionally run without problem.

The query is:

FROM (
  SELECT a, b
    FROM input_a
    JOIN input_b ON input_a.key = input_b.key
) INPUT
INSERT OVERWRITE TABLE output_a
SELECT DISTINCT a
INSERT OVERWRITE TABLE output_b
SELECT DISTINCT b;

and the error from the hive CLI client is: Continue reading Hive: multi-insert and parallel execution problem →

Programming

Hive UDFs in views

February 15, 2013 Mick Leave a comment

You can create user-defined functions in Hive. Simple ones are simple. The syntax for declaring a function is also simple:

CREATE TEMPORARY FUNCTION my_func AS 'in.sinking.udf.MyFunction';

What’s that TEMPORARY doing there? Well, it means that my_func is only available during the current hive session.

I found myself creating a VIEW that uses my_func – how does that work? Pretty well, as long as you only query it from the same hive session in which you declare the function. When you next fire up hive you’ll find your VIEW mysteriously fails with:

SemanticException Line 16:4 Invalid function '`my_func`' in definition of VIEW ...

Gah – that took me a while to figure out. The workaround seems to be to bung the CREATE TEMPORARY FUNCTION ... clause into your .hiverc, thereby making it a bit more permanent. There seems to be an old issue on a related subject in this issue on Hive’s Jira.

Programming

Seven Databases: Neo4j and misunderstanding indexes

September 18, 2012 Mick Leave a comment

The Neo4j chapter of Seven Databases in Seven Weeks has a short discussion of indexing (starting on p241 of the P1.0 version of the PDF). I found it mislead me into thinking there were two types of index, when really there are just two ways to query an index.

Continue reading Seven Databases: Neo4j and misunderstanding indexes →

Programming

Seven Databases: MongoDB and Cities

September 4, 2012 Mick 2 Comments

A few weeks ago the “nerd club” reading group at we7 moved on to Seven Databases in Seven Weeks. It’s a fun book, and I’ve been enjoying working through the exercises. The chapter on MongoDB has an exercise to use the geospatial indexing feature to search for “cities” near London. After a bit of digging and some pretty pictures I discover that things are not quite right with the supplied data.

Continue reading Seven Databases: MongoDB and Cities →

Programming

SQL Set Intersection – harder than I thought?

February 26, 2012 Mick Leave a comment

I’ve heard it said that SQL is a set-based language, and that thinking in sets is the way to make proper use of it. So I thought I was on solid ground when I decided to represent a bag of sets as a relation:

CREATE TABLE bag_of_sets (
  set_id INTEGER NOT NULL,
  member CHAR(1) NOT NULL,
  CONSTRAINT enforce_set PRIMARY KEY (set_id, member)
);

As far as I know that’s a good, fully-normalized representation. As intersection is such a fundamental operation on sets, I expected to find a natural way to express it in SQL. But this is the best I can manage:

-- Intersection
SELECT
    member
  FROM bag_of_sets
  GROUP BY member
  HAVING COUNT(*) = (SELECT COUNT(DISTINCT set_id) FROM bag_of_sets);

Ugly, isn’t it? How long does it take you to realise what it’s trying to do? And then, how long does it take you to be sure it’s correct? The intention is hidden amongst lots of implementation detail (a thought I often have about the SQL I write). And having to mention bag_of_sets twice seems so wrong. Have I missed something better? Continue reading SQL Set Intersection – harder than I thought? →

Programming

Pure SQL put-if-absent (or insert-if-not-exist)

August 17, 2008 Mick Leave a comment

I learned a little trick from a colleague this week: a pure SQL put-if-absent operation. I needed a database patch to insert a couple of rows into a database table. Sounds dull, but gets a little more interesting because I was trying to pay off a small technical debt: these rows had already been added to the production database. I needed an idempotent patch, one that would add the rows to any development database which doesn’t have them, but that wouldn’t give an error on the production database, which does.

It would have been reasonably simple to write in a procedural language (we use PostgreSQL, so it would have been PL/pgSQL), but for that I’d need to write a function, then call that function from a SELECT statement, then delete the function. But that all seems a bit messy.

To illustrate the SQL alternative we found, we’ll need a table to update. Here’s a playpen we can experiment in:
Continue reading Pure SQL put-if-absent (or insert-if-not-exist) →

Sinking In

Tag Archives: Database

Hive: multi-insert and parallel execution problem

Hive UDFs in views

Seven Databases: Neo4j and misunderstanding indexes

Seven Databases: MongoDB and Cities

SQL Set Intersection – harder than I thought?

Pure SQL put-if-absent (or insert-if-not-exist)

Probably overthinking it…