Quantcast
Channel: Database – IT Hare on Soft.ware
Viewing all articles
Browse latest Browse all 11

DB Execution Plans for C++/Java Developers, and for anybody who has read Knuth

$
0
0
Execution Plans with no index, single-column index, and multi-column index
#DDMoG, Vol. VII

[[This is Chapter 27(b) from “beta” Volume VII of the upcoming book "Development&Deployment of Multiplayer Online Games", which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the "release" book to those who help with improving; for further details see "Book Beta Testing". All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the "1st beta" of the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

As it was already noted – execution plans are the cornerstone of database query optimization; however – surprisingly few people out there realize how they work. On the other hand – IMO the basic understanding of “how this execution plan will perform”, and “what is the ideal execution plan we’d like to get” is very far from any black magic and is pretty easy to achieve, as long as we’re familiar with basic data structures (discussed at length in [Knuth-Vol1] and [Knuth-Vol3]), more specifically – trees, lists, and hash tables. For the purposes of this Chapter, we’ll use C++ and Java to illustrate these structures; however – this understanding is not language-specific and is actually all about big-O complexities of basic operations over trees, lists and hash tables.

That being said, it is important to realize inherent limitations of such analysis:

  • This kind of analysis is a big-O analysis, which means that it is inherently approximate. It may easily happen that whatever-execution-plan-we-devised-along-the-lines-discussed-here, is inferior to whatever-your-SQL-compiler-generates.
    • Still, IF we see that your SQL-compiler’s-version is orders of magnitude worse (~=”has worse big-O asymptotic”) than our-own-execution-plan – it is a very strong reason to start questioning the optimality of the plan currently generated by our SQL-compiler’s.
  • When we have that ideal-execution-plan – we still need to convince our SQL compiler to use it
    • Most of the time1 – it is not a problem; as soon as we create an index which is necessary to enable that ideal-execution-plan – most of decent SQL compilers tend to use it and get pretty close to the ideal.
    • However, pretty often it happens that after creating the index, SQL compiler still doesn’t want to go for our ideal execution plan. It means one of two things:
      • Our plan is not really ideal, or
      • We need to use some kind of RDBMS-dependent black magic to convince SQL compiler to kill his darling execution plan, and to make it use an apparently-better-our-one. This process is highly dependent on a specific DB you have, and can use anything from officially-supported optimizer hints to outright black magic such as “adding OR 1=0 to WHERE clause”2 (and if you’re on PostgreSQL, you’re out of luck at least for the foreseeable future3).

1 Unfortunately, it doesn’t happen as often as we’d like it to be; from my experience, it is like “60 to 80% of all the cases I’ve seen”.
2 Hi, DB/2 DBAs
3 As of 2017, PostgreSQL developers are still living in an imaginary world of “hey, our almighty SQL compiler can optimize every single case ideally” (and if it cannot – it is a bug to be fixed). In this regard – while I certainly admire their courage, it is worth noting that the developers of DB/2 (which is BTW one-of-the-longest-existing RBDMS now), had been living in exactly the same imaginary world for about 20 years – before reverting their stance around DB/2 v8. It is even more interesting that even while de-jure hints were still denied by DB/2 team – they were forced to commit to supporting-forever-and-ever de-facto hints such as adding “OR 1=0” to WHERE clause (“OR 1=0” practice was just that ubiquitous among their serious customers, that dropping it would break way too many banks/insurance-companies/stock-exchanges/… around the globe

 

Approximating Execution Plans using Familiar Terms

The whole point of this exercise is to take execution-plan primitives (those which we discussed above) and to express them in the terms-which-are-familiar-to-us (as in “us=non-DB developers”). Let’s note that all the examples below are NOT “a thing which we’ll write to manipulate DB”, and even NOT “how DB works internally”; instead – they are “how we can approximate behavior of execution plans performance-wise, using terms-which-are-familiar to-non-DB-developers to understand execution plans better”. Sure, as any approximation – this one will be, well, rather approximate <wink />; on the other hand – it has been seen to provide extremely good insights for optimizing SQL queries (we’ll see a very practical example of it a bit later).

Let’s name all our approximations made in this section, “big-O approximations of execution plans”.

Approximating Tables and Indexes

For our purposes – we can approximate each table as a single-linked (and actually unordered, with list order potentially changing on each run(!)) list consisting of rows.

And each B-tree index can be approximated as a tree-based map, mapping index key into pointers to these rows-sitting-within-the-list. As for hash indexes – they can be approximated as similar hash-based maps.

It can be summarized as the following table:

RDBMS C++ big-O approximation Java big-O approximation Big-O approximation using terms from Knuth
Table std::list<Row>4 List<Row> Single-linked list5
B-tree Index std::map<Key,Row*> TreeMap<Key,Row> Tree (with added pointers to the next node, to facilitate O(1) iterations)
Hash Index std::unordered_map<Key,Row*> HashMap<Key,Row> Hash table

4 Note that unlike with std::list<>, RDBMS doesn’t really guarantee that the order of the iteration over the list is the same for each iteration
5 As for RDBMS, order of rows within the list is not guaranteed and we are (mostly) iterating over the whole list – single-linked list will do

 

Approximating Scans

With this in mind – we can proceed with approximating Table Scans and Index Scans:

RDBMS C++ big-O approximation Java big-O approximation Big-O approximation using terms from Knuth Vol. I Complexity
Table Scan Iteration over std::list<> Iteration over List<> Iteration over list O(N)
Index Search / Unique Index Scan (B-tree Index) std:: map<>:: find() TreeMap.get() Search in tree O(log(N))
Index Search / Unique Index Scan (Hash Index) std:: unordered_map<>:: find() HashMap.get() Search in hash table O(1)
Range Index Scan (B-tree Index)6 Iteration over std::map<>7 Iteration over TreeMap Traversing tree O(N)8
Range Index Scan with LIMIT/FETCH FIRST/SELECT TOP/… (B-tree Index) Iteration over std::map<> with early exit by the number of items specified in LIMIT etc. Iteration over TreeMap with early exit Iteration over tree with early exit O(log(N))+
min(O(M),L)
=O(log(N)+O(1)
=O(log(N))9

To make it more obvious – let’s write down our approximation of the Range-Index-Scan-with-LIMIT in C++:10

using Index std::map<Key,Row*>;

ResultSet<Row*> RangeIndexScanWithLimit( 
                Index& index, Key rangeBegin, Key rangeEnd, 
                size_t limit ) {
  ResultSet<Row*> ret;
  size_t n = 0;
  for( Index::iterator it=index.find(rangeBegin),
       n <= limit &&
         it != index.end() &&
         (*it).first <= rangeEnd,
       ++it, ++n ) {
    ret.append( (*it).second );
  }
  return ret;
}

There isn’t much to explain here: we’re just finding the boundary of the range, then iterate through our std::map<> and then simply break our search if we run out of the range, OR if we reach pre-defined limit.

The same thing expressed in Java terms, would look very similar:

public class Index extends TreeMap<Key,Row> {}

ResultSet<Row> RangeIndexScanWithLimit( 
               Index index, Key rangeBegin, Key rangeEnd, 
               int limit ) {
  ResultSet<Row> ret = new ResultSet<Row>();
  int n = 0;
  for( Map.Entry<Key,Row> entry : 
       index.subMap(rangeBegin,rangeEnd) ) {
    if( n++ >= limit )
      break;
    ret.append( entry.getValue() );
  }
  return ret;
}

And if speaking in terms of Knuth, it is still about the same:

  • We have to find one boundary of the range (which is an O(log(N) operation)
  • Then we have to traverse our tree in the order of our key, starting from found range boundary
    • Traversing has to last until either we run into the end of the range (which is O(N)) – or until we got all our LIMIT rows (L); it means that such traversing is a min(O(M),L) operation

6 not available for Hash Index
7 NB: just as std::map<> iterations and TreeMap iterations, RDBMS iterations return sorted sets
8 N being “number of elements within the range”
9 Here: L is a LIMIT parameter, N is a number of items in the whole index, and M < N – is a number of items in the requested range; as L is independent of N – it formally qualifies as O(1) (in practice – it can be assumed so as long as L << N, which surely stands for any L within a few hundreds or so).
10 to simplify things, in all our approximation examples we’ll be omitting all the template-related stuff (implying it whenever necessary)

 

Approximating Joins

After we described our approximations for tables, indexes, and scans – the only thing we need to understand to proceed further, is how to approximate execution-plan-level joins (Nested Loop Join, Sorted Merge Join, and Hash Join). Apparently, approximating them is not a rocket science too.

Nested Loop Join

In our C++ big-O approximation, Nested Loop Join (a.k.a. NL Join) would look along the following lines:

ResultSet<Row*> NestedLoopJoin( 
                ResultSet<Row2*> &outer,
                SearchableDataSet<Row*>& inner) {
    //SerchableDataSet allows to perform searches within;
    //  in practice – usually it is either Index Scan, or
    //  Table Scan, or another ResultSet<>
  ResultSet<Row*> ret;
  for( auto outeritem: outer ) {
    Key key = outeritem->key;
    Row* found = inner.find(key);
    if( found )
      ret.append(found);
  }
  return ret;
}

In Java, the same thing would look similar to the following:

ResultSet<Row> NestedLoopJoin( 
               ResultSet<Row2> outer,
               SearchableDataSet<Row> inner ) {
  ResultSet<Row> ret = new ResultSet<Row>();
  for( OuterEntry outeritem : outer.entrySet() ) {
    Key key = outeritem.getKey();
    Row found = inner.get(key);
    if( found != null )
      ret.append(found);
  }
  return ret;
}

And in terms of our Knuth-based big-O approximation, we can describe the same thing as follows:

  • We’re going along the outer data set (which actually can be any kind of collection), iterating through its items
  • For each such item, we’re extracting an item – and get an item key out of it (in terms of RDBMS – the key is usually11 one of the sides of equation in JOIN … ON statement)
  • For each such item key – we’re performing a search within the second data set to find if there is a match.

Whatever of the C++/Java/[Knuth] approximation we’re using – the performance-wise result is still the same: we have to make N (where N is number of items in the outer data set) searches over the inner data set. As a result – if the inner data set is actually a Table Scan – we’re speaking about the cost of NL Join being O(N*M)12 <crazy! />, and if the inner data set is an Index Scan – it is more like O(N*log(M)).


11 though not exactly necessary, as other statements – in particular, nested SELECT statements – can often be converted to different execution-plan-level Joins
12 with M being a number of rows in the inner data set

 

Sort Merge Join

When speaking of Sort Merge Joins – they’re also simple; the most important thing is that we’re assuming that both input data sets for Sort Merge Join are already sorted. Then, in our C++ approximation, Sort Merge Join will look along the following lines:

SortedResultSet<Row*> SortMergeJoin( 
                      SortedResultSet<Row2*> &a, 
                      SortedResultSet<Row*>& b) {
  SortedResultSet<Row*> ret;
  auto ait = a.begin();
  auto bit = b.begin();
  while( ait != a.end() && bit != b.end() ) {
    if( (*ait)->key < (*bit)->key )
      ++ait;
    else if( (*ait)->key > (*bit)->key )
      ++bit;
    else {
      ret.append( *bit.second );
      ++ait;
      ++bit;
    }
  }
  return ret;
}

In Java, it would look more or less as follows:

SortedResultSet<Row> SortMergeJoin( 
                     SortedResultSet<Row2> a,
                     SortedResultSet<Row> b ) {
  SortedResultSet<Row> ret = new SortedResultSet<Row>();
  SortedResultSetIterator<Key,Row2> ait = a.iterator();
  SortedResultSetIterator<Key,Row> bit = b.iterator();

  if( !ait.hasNext() || !bit.hasNext() )
    return ret;

  Map.Entry<Key,Row> lasta = ait.next();
  Map.Entry<Key,Row> lastb = bit.next();
  while( ait.hasNext() && bit.hasNext() ) {
    int cmp = lasta.getKey().compare(lastb.getKey());
    if( cmp < 0 )
      lasta = ait.next();
    else if( cmp > 0 )
      lastb = bit.next();
    else {
      ret.append( lastb.getValue() );
      lasta = ait.next();
      lastb = bit.next();
    }
  }
  return ret;
}

And in terms of our Knuth-based big-O approximation (actually, it is “Merge Sort” from Vol. 3), it can be expressed along the following lines:

  • We’re going over two sorted lists, and compare the values at current positions within the lists
  • If current value within the first list is smaller – we’re advancing in the first list, if current value in the second list is smaller – we’re advancing in the second list.
  • If values in both lists are equal – we found the match, so it belongs to our output result set

Hash Join

With Hash Join – the situation is a bit more interesting. Here, we’ll be using a temporary in-memory structure to handle the join. C++ approximation of a Hash Join may look along the following lines:

ResultSet<Row*> HashJoin(
                ResultSet<Row2*> &a,
                ResultSet<Row*>& b) {
  //preparing temporary hash table
  std::unordered_map<Key,Row*> tmp;
  for( auto itema: a )
    tmp.insert(itema->key,itema);

  ResultSet<Row*> ret;
  for( auto itemb: b ) {
    auto found = tmp.find( itemb->key );
    if( found != tmp.end() )
      ret.append( found.second );
  }
  return ret;
}

A similar thing in Java would look somewhat similar to the following:

ResultSet<Row> HashJoin( 
               ResultSet<Row2> a,
               ResultSet<Row> b ) {
  //preparing temporary hash table
  HashMap<Key,Row> tmp = new HashMap<Key,Row>();
  for( Map.Entry<Key,Row2> itema : a.entrySet() )
    tmp.put(itema.getKey(),itema.getValue());

  ResultSet<Row> ret = new ResultSet<Row>();
  for(Map.Entry<Key,Row> itemb : b.entrySet()) {
    Row found = tmp.get( itemb.getKey());
    if( found != null )
      ret.append( found );
  }
  return ret;
}

And for our Knuth-inspired big-O approximation, we can say that:

  • We convert one of the data sets into a temporary in-memory mapping (usually hash-based one)
  • Then, iterating over another data set – we’re making a very quick search within this temporary mapping

Most importantly, regardless of the way we represent it – we can see that for a Hash Join the following observations stand:

  • It makes sense ONLY if one of the source data sets is small enough to fit in RAM
  • As long as the above stands, Hash Join is not that bad: as search in a temporary hash map is O(1), we can get O(N) complexity for the whole Hash Join (where N is a number of items in the larger set)

A Very Practical Real-World Example

Now, with all the items in our vocabulary defined, let’s see how this approach can be applied to a very practical example from the real world.13

In pretty much all kinds of operational databases, there is an extremely common request: “Gimme last 100 games by this particular player”. For games – the need of such a request usually follows from the need to show most recent games to the player. However – it is certainly NOT restricted to games: in particular, for a shopping site it can be “last 20 orders to show on the first ‘last orders’ page”, and for a bank it can be “last 50 account entries to show”. NB: let’s note that the final result of our exercise – while being unknown at least to 90% of DBAs out there – is certainly NOT novel. What is interesting, however – is how we can use analogies from C++/Java/Knuth to arrive to this not-so-trivial end result.

Once upon a time, there was a system which needed to perform this “last 100 games for a particular player” request some thousands of time per day. Pretty soon – it became a significant bottleneck. And while in the appropriate GAMES table there were indexes both by TIME and by PLAYERID – it didn’t really work fast. Analysis of execution plans has revealed that optimizer could use either index by TIME, or the one by PLAYERID, but none of them was good:14

  • If using index by TIME with LIMIT – the system would need to scan all the index entries by all the players until 100 rows for this particular player was reached. So, if our player played little enough to have his 100 last games to go into last year – we were in a Deep Trouble™. With this index – the complexity of the execution plan effectively was O(Nplayers).
  • If using index by PLAYERID – LIMIT wouldn’t work (we won’t have any sorting after the index by mere PLAYERID), so if the player in question was an active one (with tens of thousands games overall) – then it also caused significant slowdowns. With such an index by PLAYERID, the complexity of the execution plan effectively was O(Ngames-for-specific-player).

At this point – we started to think “what can be improved here?”. And the most important thing-we-have-done – was starting to think:

If I was a compiler-with-ability-to-create-indexes-as-I-need-it – what would I do to make The Ideal Execution Plan™ for this request?

Keeping the primitives above in mind, we came up with the following Knuth-inspired approximation:

  • We’d use some kind of index to jump to the very last-of-the-games-for-this-specific-player
  • Then, we’d scan this index (ONLY by this-specific-player) backwards for just 100 records.
  • Bingo! The cost would be only reading 100 rows, which is going to be pretty fast (in fact – for a sizeable table with lots of different players it is ideally fast, you cannot possibly beat this approach). From a Big-O point of view – the complexity of this approach had an upper bound of O(100)=O(1) <sic! />.

If we’d use C++ big-O approximations at that time, our Ideal Execution Plan™ would look as follows:

ResultSet<Row*> last100Games( int playerID, int limit,
                std::map<Key,Row*>& index ) {
  ResultSet<Row*> ret;
  size_t n = 0;
  for( Index::iterator it=index.find_last_by_time(playerID),
       n <= limit &&
         it != index.end(),
       --it, ++n ) {
    ret.append( (*it).second );
  }
  return ret;
}

And in Java, it would look similar to the following:

ResultSet<Row> last100Games( int playerID, int limit,
               TreeMap<Key,Row> index ) {
  ResultSet<Row> ret = new ResultSet<Row>();
  int n = 0;
  for( Map.Entry<Key,Row> entry :
       index.tailMap_by_time(playerID) ) {
    //to work, index should have DESCENDING sort order
    if( n++ >= limit )
      break;
    ret.append( entry.getValue() );
  }
  return ret;
}

Ok, with this nice Ideal Execution Plan in mind, now there was a second question: whether such-an-index-we-need-to-use-here, exists? Apparently – it does; in fact – if we make our Key as a tuple of (PlayerID,TIME) – the whole thing will work perfectly as designed.15 The trick here is that as soon as we have a set which is ordered by a tuple (with a quick search by that tuple available) – we can easily and extremely efficiently work with any of the subsets-of-this-set where subset is defined by first X fields of the tuple (in our case – it is a subset defined by a single field PlayerID). Most importantly – this is NOT a property which is unique to B-trees and RDBMS indexes; rather – it is a common property of ALL the ordered sets, including iterations over tree-based structures (including RDBMS B-trees, Knuth trees, C++ std::map<>, and Java’s TreeMap).

With this in mind – we thought that

Hey, if we add a composite index by (PLAYERID,TIME) DESCENDING,16 it should allow SQL compiler to use our Ideal-Execution-Plan, improving things for this particular request greatly.

In that real-world case I’m speaking about, the trickiest part was to convince our database developers that such an index can help. However – after we managed to push it through (and after waiting for half a day for the index to be created over that-table-with-a-hundred-of-millions-rows) – SQL compiler DID pick the index up, and performance improved between 10x and 100x(!).

And if we think about it – it is not surprising; given-the-data-which-was-stored-in-the-table-at-that-time – it is exactly the kind of gains one could expect from using such an index, for this type of request.

BTW, if we take a closer look at it – we’ll see that it is almost-equivalent to our approximated Ranged-Scan-With-LIMIT (and it was actually the execution plan used by complier after it has been provided with this appropriate composite index).


13 NB: when I was first faced with such a task – I didn’t think in the terms which were that clear (and I wouldn’t be able to explain my reasoning to the outside world at that time); however – now as I can externalize my-thinking-back-then, I can say that even back then my reasoning followed exactly the same logic.
14 This can be easily illustrated by using the technique discussed in this section – but to save a bit of space, I’ll leave it as a home exercise <wink />
15 actually, in the code above we’ll need to replace index.find_last_by_time(playerID) with index.find(Key(playerID,plus_infinity_time)), and index.tailMap_by_time(playerID) with index.tailMap(Key(playerID,plus_infinity_time)) – but it won’t really change much
16 or bidirectional

 

Different incarnations of the very same problem

If you think that all the stuff above is very obvious – good for you (and even better for your employer <wink />) but I can assure you that there are LOTS of people who don’t realize this kind of things.

Actually, this particular problem with such a composite index by (CUSTOMERID, TIME) missing (or unused), is very common in the industry; I’ve seen its manifestations much more than I’d like to. The most obvious one (in addition to the example above) was a case when a payment processor, to recover from “transaction is in unknown state”, provided an API which allowed merchant to request merchant’s transactions by MERCHANTID;17 however – this request was Really Slow™, so they had to restrict such requests to at-most-2-hours-of-transactions-at-a-time (and such a request took like 10 minutes to get reply from them back); as you can imagine – implementing a recovery which worked correctly if their site was down for more-than-2-hours – was quite involving (and for no real reason too18). TBH, I didn’t have a chance to see their DB, but with everything-I-know, I can be 99+% sure that their problem was that they didn’t have that composite index by (CUSTOMERID, TIME) (which, as we have seen, can be more-or-less easily deduced from the Ideal Execution Plan™ within the approximation paradigm above).


17 as discussed in Vol. VI’s chapter on Payment Processing, this is not the best way to handle such a recovery, but it would do – if not for those performance issues
18 BTW, if you happen to work for this company – please please add this index, it will cost pretty much nothing – and will simplify things both for you and your merchants greatly

 

TL;DR for Approximations of Execution Plans

Phew; I hope that I managed to convince you that big-O approximations such as the one above, MAY be a Really Good Thing™. On the other hand, I am the first one to admit that they’re not perfect:

  • Essentially, such approximations provide only big-O-like analysis; it means that they’re very VERY approximate to start with.
    • On the other hand – if your big-O analysis (for example, using approximations as discussed above) shows that your current execution plan can be improved – we can be easily speaking about performance gains of the orders of magnitude.
      • This in turn means, that for every query which is problematic, such a big-O analysis should be the very first step when trying to optimize.
      • Sure, there are many more other things which can be taken into account when optimizing (such as including non-key fields into the index) – but they tend to give much smaller gains than optimizations-which-improve-big-O-behavior.
    • Even after we have this Ideal-Execution-Plan and added all the necessary indexes – chances are that our specific almighty SQL compiler won’t be smart enough to pick it up. In this case – it becomes a job (IMO more like Black Magic™) for your DBAs to convince the compiler to do whatever-it-should-do.
      • Still – more often than not, the compiler is smart enough to use indexes at least in OLTP settings (as noted above – IMHO it is like 80%, though YMMV greatly).
      • Moreover – for all-DBs-with-hints-I-heard-about – more often than not, “convincing” SQL compiler19 is possible (especially if your DBAs are good, and do have experience with torturing “convincing” this particular RDBMS to do the right things).

19 though I have to admit that in practice it is more like “torturing compiler until it changes its mind”

 

[[To Be Continued…

This concludes beta Chapter 27(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 27(c), where we’ll continue our discussion on well-known ways for database optimization]]


References

[Knuth-Vol1] Donald E Knuth, “The Art of Computer Programming, vol. 1: Fundamental Algorithms”Amazon

[Knuth-Vol3] Donald E Knuth, “The Art of Computer Programming, vol. 3: Sorting and Searching”Amazon

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

P.S.

Don't like this post? Criticize↯

P.P.S.

We've tried to optimize our feed for viewing in your RSS viewer. However, our pages are quite complicated, so if you see any glitches when viewing this page in your RSS viewer, please refer to our original page.


Viewing all articles
Browse latest Browse all 11

Trending Articles