When does a table become too big?

I happened across an interesting thread on Twitter today which posed the question about when you should intervene with alternative solutions as a table gets larger over time:

It is always good practice to be proactive on thinking about potential future problems before they occur because obviously it is better to solve them before they become a crisis for your customers. Conversely I can already hear you reaching for your keyboards and getting ready to flame me with Knuth quotes Smile

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

Anyway, after some others had inquired a little more about typical usage:

I replied with a claim about relational databases and how they would handle such a table

But as anyone that knows me from AskTom I always recommend that no-one should ever believe anything they read on the internet Smile unless it comes with some evidence to back it up.

So here’s a demo I’ve put together on the machine at home, and I’m going to scale up things a tad to really test out just what the limits of using a relational database might be. Based on what I’ve read from the Twitter thread, I’m assuming we have a table that logs every time a user “likes” a post and we’ll store when that like was logged. That gives me this simple structure:



SQL> create table user_likes (
  2    user_id int not null,
  3    post_id int not null,
  4    dt      date default sysdate not null
  5  ) ;

Table created.

I can’t be sure how close this is to the true requirement/implementation, but lets go with it and see how it handles some load.

I’m going to populate this table under the following assumptions

We have a user population of 200,000 distinct users (approximately 2x how many followers @techgirl1908 has).
We have 10,000 posts that might be liked (I don’t know how many posts have really been done)
To give our table some good volume, we’ll assume every post has been liked by every user .

Based on this assumption, we can populate the table in random sequence with the following INSERT


SQL> insert /*+ APPEND */ into user_likes
  2  with
  3  u as (
  4    select /*+ materialize */ rownum user_id from dual
  5    connect by level <= 200000
  6    order by dbms_random.value ),
  7  p as (
  8    select /*+ materialize */ rownum post_id from dual
  9    connect by level <= 10000
 10    order by dbms_random.value )
 11  select user_id, post_id, sysdate
 12  from u,p;

2000000000 rows created.

which yields 2billion rows. Yes, 2 billion. I don’t want to see if 9 million is an issue – I want to know if 200 times that amount is an issue. Smile

Now I’ll index that table based on the following (assumed) requirements

For a known user and post, do a single row lookup to see if a “like” is present.
For a known user, show me all of the posts they have liked.
For a known post, show me all of the users that have liked it

This is why I let every user and every post be present in the table, so that when we benchmark the three above scenarios we wont get any artificially good results due to the non-existence of data. In fact, I’ve done exactly the opposite. Not only will every query we perform in any of the above tests always find some data, they are deliberately more expensive than what we’d expect to find in an real life scenario. Querying posts for a user will always find 10,000 posts! Querying users for a post will always find 200,000 users! I’m going for a worst case scenario for the data we’re using.

I’ll add the two indexes we’ll need.


SQL> alter table user_likes
  2   add constraint user_likes_pk primary key
  3     ( user_id, post_id );

Table altered.

SQL> create index user_likes_ix on user_likes ( post_id, user_id , dt);

Index created.

And now here’s a PL/SQL routine that will run through the test. It will do 50,000 iterations picking a random user and random post, and exercise the above queries.

Then we’ll output the average query elapsed time, and the min/max extrema to get an idea of the response time duration.

(If you’re not familiar with PL/SQL, just skip ahead to the results in blue)


SQL> set serverout on
SQL> declare
  2    type numlist is table of number index by pls_integer;
  3    iter int := 50000;
  4    u numlist;
  5    p numlist;
  6    res numlist; resrow user_likes%rowtype;
  7    s1 timestamp;
  8    s2 timestamp;
  9    s3 timestamp;
 10    delt number;
 11    lo number := 999999999;
 12    hi number := 0;
 13    cnt int;
 14  begin
 15    select trunc(dbms_random.value(1,200000) )
 16    bulk collect into u
 17    from dual
 18    connect by level <= iter;
 19
 20    select trunc(dbms_random.value(1,10000) )
 21    bulk collect into p
 22    from dual
 23    connect by level <= iter;
 24
 25    s1 := localtimestamp;
 26    for i in 1 .. iter
 27    loop
 28      s2 := systimestamp;
 29      select user_id
 30      bulk collect into res
 31      from   user_likes
 32      where  user_id = u(i);
 33      delt := extract(second from localtimestamp - s2);
 34      if delt < lo then lo := delt; end if;
 35      if delt > hi then hi := delt; cnt := res.count; end if;
 36    end loop;
 37    delt := extract(second from localtimestamp - s1) + 60*extract(minute from localtimestamp - s1);
 38
 39    dbms_output.put_line(iter||' executions: All posts for nominated user');
 40    dbms_output.put_line('=====================================');
 41    dbms_output.put_line('Total: '||delt);
 42    dbms_output.put_line('Avg: '||(delt/iter));
 43    dbms_output.put_line('Low: '||lo);
 44    dbms_output.put_line('Hi: '||hi);
 45
 46    lo  := 999999999;
 47    hi  := 0;
 48    s1 := localtimestamp;
 49    for i in 1 .. iter
 50    loop
 51      s2 := systimestamp;
 52      select user_id
 53      bulk collect into res
 54      from   user_likes
 55      where  post_id = p(i);
 56      delt := extract(second from localtimestamp - s2);
 57      if delt < lo then lo := delt; end if;
 58      if delt > hi then hi := delt; cnt := res.count; end if;
 59    end loop;
 60    delt := extract(second from localtimestamp - s1) + 60*extract(minute from localtimestamp - s1);
 61
 62    dbms_output.put_line(iter||' executions: All users for nominated post');
 63    dbms_output.put_line('=====================================');
 64    dbms_output.put_line('Total: '||delt);
 65    dbms_output.put_line('Avg: '||(delt/iter));
 66    dbms_output.put_line('Low: '||lo);
 67    dbms_output.put_line('Hi: '||hi);
 68
 69    lo  := 999999999;
 70    hi  := 0;
 71    s1 := localtimestamp;
 72    for i in 1 .. iter
 73    loop
 74      s2 := systimestamp;
 75      select user_id
 76      into resrow
 77      from   user_likes
 78      where  post_id = p(i)
 79      and    user_id = u(i);
 80      delt := extract(second from localtimestamp - s2);
 81      if delt < lo then lo := delt; end if;
 82      if delt > hi then hi := delt; end if;
 83    end loop;
 84    delt := extract(second from localtimestamp - s1) + 60*extract(minute from localtimestamp - s1);
 85
 86    dbms_output.put_line(iter||' executions: Lookup single post for a nominated user');
 87    dbms_output.put_line('=======================================');
 88    dbms_output.put_line('Total: '||delt);
 89    dbms_output.put_line('Avg: '||(delt/iter));
 90    dbms_output.put_line('Low: '||lo);
 91    dbms_output.put_line('Hi: '||hi);
 92
 93  end;
 94  /
50000 executions: All posts for nominated user
=====================================
Total: 214.157
Avg: .00428314
Low: 0
Hi: .246
50000 executions: All users for nominated post
=====================================
Total: 2602.467
Avg: .05204934
Low: .019
Hi: .371
50000 executions: Lookup single post for a nominated user
=======================================
Total: 9.169
Avg: .00018338
Low: 0
Hi: .003

Lets take a moment to digest some of that information.

You can do ~5400 single row lookups (known user/post combo) per second! (ie, less than a millisecond per query)
Grabbing 10,000 posts for a known user will take about 4ms, and at worst 246ms
Grabbing 200,000 users for a known post will take about 50ms, and at worst 371ms

and I’m doing this on a 4year old desktop PC!

So there we go. You can see that even as we scale to billions of rows, the compact branch structure of B-tree indexes means very efficient data access to data. When the volume of rows needed by an index is small (and in this case, I’m even treating 200,000 rows as small!), then in most circumstances, the overall size of the table is not an issue.

13 responses to “When does a table become too big?”

Brown

November 14, 2021 at 5:08 am

Hi Conor,
I saw your comment on Twitter and I’m glad I came this far to find answers. Thanks so much for this.

Loading…

1. apprenticeprogrammer
  
  November 14, 2021 at 9:49 am
  
  Great post! I like that there was a simulation to really prove something. However, I think there’s a slight error:
  Wouldn’t this be equal to **50** ms and NOT 5 ms? Avg: .05204934
  
  Loading…
  
  1. Connor McDonald
    
    November 14, 2021 at 12:35 pm
    
    Indeed – nice catch, my eyes aren’t as good as they used to be 🙂 I’ll update the post
    
    Loading…
    
Tyler

November 14, 2021 at 5:32 am

Cool post, would also be interesting to see the implications of dml here (concurrency considerations) not just query performance 🙂

Loading…

1. Connor McDonald
  
  November 14, 2021 at 12:34 pm
  
  Followup post coming shortly on exactly this topic
  
  Loading…
  
The big table dilemma – what about transaction rates? – Learning is not a spectator sport

November 14, 2021 at 1:38 pm

[…] I posted some demonstrations of the impressive power of indexes in relational databases for data retrieval […]

Loading…

Taylor

November 14, 2021 at 10:12 pm

Out of curiosity how would query performance fair if your requirement was to return the number of likes for either a specific or range of dates.
My assumption is that things will slow down here unless partitions are introduced on dt?

Loading…

Dominique

November 15, 2021 at 10:07 pm

just for my curiosity: how long does it take to create the table and its index on your “old” desktop ?

Loading…

1. Connor McDonald
  
  November 15, 2021 at 10:13 pm
  
  Everything done in serial, ie, no parallel load because I’d rather not be smashing my CPU and SSD’s repeatedly for tests. (I want some more years out of this machine :-)). I cant be exactly sure, but I think about 30mins to load, and 20mins per index. (My db is not in archivelog mode)
  
  Loading…
  
Adeel

November 25, 2021 at 2:27 pm

I am also a senior person in I.T. In my point of view, the table size does matter but one more thing that we avoid is the number of queries to the table one can handle properly. If you are not handling the queries properly even after managing the size of the database , you can gear up the response time by indexing the queries

Loading…

The Overflow #101: Invest in your favorite developer – Coduza – Blog

November 26, 2021 at 9:38 pm

[…] When does a table become too big? connor-mcdonald.comWhat started as an interesting question from Angie Jones has turned into quite the conversation amongst developers: how do tables stand up to really, really large numbers? […]

Loading…

Zakhar NuNrg

December 2, 2021 at 10:27 pm

Hi,
1) Why did you add dt field to the index?
2) Indexes must be rebuild each time we add or delete a record to a table, so in case we add or delete records more frequently than we do selects, do proper indexes still a right choise?

Loading…

1. Connor McDonald
  
  December 2, 2021 at 11:13 pm
  
  1) This a common technique to add extra columns to the index so that a query does not need to visit the table at all during a read. Its not something you would *always* do, but for certain application requirements it can be useful
  
  2) See the followup blog post where we see what impact indexes have on DML
  
  Loading…