Data denormalization … another take

I read an interesting article the other day about data modelling and data denormalization.

I’ll paraphrase the topic and requirement here (apologies to the original author for any omissions or excessive brevity).

We have a fictional application with chat rooms, people subscribing to those chat rooms, and posting messages in the chat rooms. To satisfy this we have the following data model

So for general queries, logging new messages etc, that’s all fine. But an additional requirement is present. When user “X” logs on to the application, they want to see how many rooms that they subscribe to have unread messages, where “unread” is defined as: there is a message in the MSGS table for that room with a timestamp more recent than the LAST_VIEWED timestamp for that user/room combination in ROOM_USERS.

Now if the MSGS table has millions of records, and users check them infrequently, then a query along the lines of:


select count(distinct m.room_id)
from   MSGS m,
       ROOM_USERS ru
where  m.tstamp > ru.last_viewed
and    ru.user_id = ...

is perhaps a scalability threat. For the sake of this discussion, let’s assume it is.

The article then gets onto the topic of denormalization, and validly makes the point that this is a non-trivial task. I’m not disputing any of that content. The desired functionality is to be able to cheaply identify the number of unread rooms without scanning the MSGS table.

But there is also this statement

Oracle’s materialized views seem to offer some of the most advanced functionality out there, but are still very limited.
There’s no database in the market today that can implement our desired definition of User.numUnreadFields via a materialized view, as far as I know.

So I figured I would try to help out Smile

First let’s build some objects to fit out model




SQL> create table users ( user_id int primary key, uname varchar2(20));

Table created.

SQL> create table rooms ( room_id int primary key, rname varchar2(20));

Table created.

SQL> create table room_users (
  2    room_id int ,
  3    user_id int,
  4    last_viewed date not null,
  5    constraint room_users_pk primary key ( user_id, room_id ),
  6    constraint room_users_fk_users foreign key ( user_id  ) references users ( user_id ),
  7    constraint room_users_fk_rooms foreign key ( room_id  ) references rooms ( room_id )
  8    );

Table created.

SQL> create table msgs (
  2    msg_id int primary key,
  3    sender_user_id int not null,
  4    room_id int not null,
  5    tstamp     date not null,
  6    text varchar2(100) not null,
  7    constraint msgs_fk_membership foreign key ( sender_user_id ,room_id )
  8      references room_users ( user_id , room_id)
  9  );

Table created.

SQL> create index msgs_ix1 on msgs ( room_id );

Index created.

SQL> create index msgs_ix2 on msgs ( sender_user_id );

Index created.

SQL> create index msgs_ix3 on msgs ( tstamp );

Index created.

SQL> insert into users
  2  select rownum, 'User '||rownum
  3  from dual
  4  connect by level <= 100; 100 rows created. SQL> insert into rooms
  2  select 10000+rownum, 'Room '||(10000+rownum)
  3  from dual
  4  connect by level <= 50; 50 rows created. SQL> insert into room_users
  2  select 10000+rownum, rownum, trunc(sysdate) from dual connect by level <= 10; 10 rows created. SQL> insert into room_users
  2  select 10000+rownum, rownum+2, trunc(sysdate-2) from dual connect by level <= 10; 10 rows created. SQL> insert into room_users
  2  select 10000+rownum, rownum+7, trunc(sysdate-7) from dual connect by level <= 10;

10 rows created.

So I’ve created some objects, put some constraints and typical indexes that I might expect to see in such an application. Here’s our users, rooms and the subscriptions



SQL> select * from users;

   USER_ID UNAME
---------- --------------------
         1 User 1
         2 User 2
         3 User 3
         4 User 4
         5 User 5
         6 User 6
         7 User 7
         8 User 8
         9 User 9
        10 User 10
        11 User 11
        12 User 12
        13 User 13
        14 User 14
        15 User 15
        16 User 16
        17 User 17
        18 User 18
        19 User 19
        20 User 20
        21 User 21
        22 User 22
        23 User 23
        24 User 24
        25 User 25
        26 User 26
        27 User 27
        28 User 28
        29 User 29
        30 User 30
        31 User 31
        32 User 32
        33 User 33
        34 User 34
        35 User 35
        36 User 36
        37 User 37
        38 User 38
        39 User 39
        40 User 40
        41 User 41
        42 User 42
        43 User 43
        44 User 44
        45 User 45
        46 User 46
        47 User 47
        48 User 48
        49 User 49
        50 User 50
        51 User 51
        52 User 52
        53 User 53
        54 User 54
        55 User 55
        56 User 56
        57 User 57
        58 User 58
        59 User 59
        60 User 60
        61 User 61
        62 User 62
        63 User 63
        64 User 64
        65 User 65
        66 User 66
        67 User 67
        68 User 68
        69 User 69
        70 User 70
        71 User 71
        72 User 72
        73 User 73
        74 User 74
        75 User 75
        76 User 76
        77 User 77
        78 User 78
        79 User 79
        80 User 80
        81 User 81
        82 User 82
        83 User 83
        84 User 84
        85 User 85
        86 User 86
        87 User 87
        88 User 88
        89 User 89
        90 User 90
        91 User 91
        92 User 92
        93 User 93
        94 User 94
        95 User 95
        96 User 96
        97 User 97
        98 User 98
        99 User 99
       100 User 100

100 rows selected.

SQL> select * from rooms;

   ROOM_ID RNAME
---------- --------------------
     10001 Room 10001
     10002 Room 10002
     10003 Room 10003
     10004 Room 10004
     10005 Room 10005
     10006 Room 10006
     10007 Room 10007
     10008 Room 10008
     10009 Room 10009
     10010 Room 10010
     10011 Room 10011
     10012 Room 10012
     10013 Room 10013
     10014 Room 10014
     10015 Room 10015
     10016 Room 10016
     10017 Room 10017
     10018 Room 10018
     10019 Room 10019
     10020 Room 10020
     10021 Room 10021
     10022 Room 10022
     10023 Room 10023
     10024 Room 10024
     10025 Room 10025
     10026 Room 10026
     10027 Room 10027
     10028 Room 10028
     10029 Room 10029
     10030 Room 10030
     10031 Room 10031
     10032 Room 10032
     10033 Room 10033
     10034 Room 10034
     10035 Room 10035
     10036 Room 10036
     10037 Room 10037
     10038 Room 10038
     10039 Room 10039
     10040 Room 10040
     10041 Room 10041
     10042 Room 10042
     10043 Room 10043
     10044 Room 10044
     10045 Room 10045
     10046 Room 10046
     10047 Room 10047
     10048 Room 10048
     10049 Room 10049
     10050 Room 10050

50 rows selected.

SQL> select * from room_users;

   ROOM_ID    USER_ID LAST_VIEW
---------- ---------- ---------
     10001          1 06-OCT-16
     10002          2 06-OCT-16
     10003          3 06-OCT-16
     10004          4 06-OCT-16
     10005          5 06-OCT-16
     10006          6 06-OCT-16
     10007          7 06-OCT-16
     10008          8 06-OCT-16
     10009          9 06-OCT-16
     10010         10 06-OCT-16
     10001          3 04-OCT-16
     10002          4 04-OCT-16
     10003          5 04-OCT-16
     10004          6 04-OCT-16
     10005          7 04-OCT-16
     10006          8 06-OCT-16
     10007          9 04-OCT-16
     10008         10 04-OCT-16
     10009         11 04-OCT-16
     10010         12 04-OCT-16
     10001          8 06-OCT-16
     10002          9 29-SEP-16
     10003         10 29-SEP-16
     10004         11 29-SEP-16
     10005         12 29-SEP-16
     10006         13 29-SEP-16
     10007         14 29-SEP-16
     10008         15 29-SEP-16
     10009         16 29-SEP-16
     10010         17 29-SEP-16

30 rows selected.

SQL>

Now lets add some messages, with timestamps spanning the last month or so


SQL> insert into msgs
  2  select rownum,
  3         mod(rownum,10)+1,
  4         10000+mod(rownum,10)+1,
  5         sysdate-32+rownum/10,
  6         'text '||rownum
  7  from dual
  8  connect by level <= 300; 300 rows created. SQL>
SQL>
SQL> insert into msgs
  2  select 1000+rownum,
  3         2+mod(rownum,10)+1,
  4         10000+mod(rownum,10)+1,
  5         sysdate-32+rownum/10,
  6         'text '||(rownum+1000)
  7  from dual
  8  connect by level <= 300; 300 rows created. SQL>
SQL>
SQL> insert into msgs
  2  select 2000+rownum,
  3         7+mod(rownum,10)+1,
  4         10000+mod(rownum,10)+1,
  5         sysdate-32+rownum/10,
  6         'text '||(rownum+2000)
  7  from dual
  8  connect by level <= 300;

300 rows created.

Let’s now see if we can get a materialized view to help us out. Obviously we want this view to be fast refreshable, so first we’ll put materialized logs on the appropriate tables


SQL> create materialized view log on msgs
  2  with primary key, rowid,
  3     sequence (room_id, sender_user_id,tstamp) including new values;

Materialized view log created.

SQL> create materialized view log on room_users
  2  with primary key, rowid,
  3     sequence (last_viewed) including new values;

Materialized view log created.

Now lets add a materialized view to capture the information we need


SQL> create materialized view MSGS_MV refresh fast on commit as
  2  select
  3    ru.room_id,
  4    ru.user_id,
  5    ru.last_viewed,
  6    count(*) msg_count,
  7    count(case when m.tstamp > ru.last_viewed then 1 end ) has_unread
  8  from msgs m,
  9       room_users ru
 10  where m.room_id = ru.room_id
 11  group by
 12    ru.room_id,
 13    ru.user_id,
 14    ru.last_viewed;

Materialized view created.

So we’ve got a row for each room/user pairing, plus we are bringing the LAST_VIEWED attribute along for the ride. The COUNT(*) is not needed by our application, but is required to make the materialized view fast refreshable. The second COUNT is a count of the unread messages in this room for this user. Note that this is NOT the requirement satisfied yet, because we are looking for a ROOM count, not a MESSAGE count. But lets look at this view in action now.


SQL> select
  2       mv.room_id,
  3       mv.user_id,
  4       mv.last_viewed,
  5       case when mv.has_unread > 0 then 'Y' end unread,
  6       r.rname,
  7       u.uname
  8  from MSGS_MV mv,
  9       users u,
 10       rooms r
 11  where mv.user_id = u.user_id
 12  and   mv.room_id = r.room_id
 13  order by 2,1;

   ROOM_ID    USER_ID LAST_VIEW U RNAME                UNAME
---------- ---------- --------- - -------------------- --------------------
     10001          1 06-OCT-16   Room 10001           User 1
     10002          2 06-OCT-16   Room 10002           User 2
     10001          3 04-OCT-16 Y Room 10001           User 3
     10003          3 06-OCT-16   Room 10003           User 3
     10002          4 04-OCT-16   Room 10002           User 4
     10004          4 06-OCT-16   Room 10004           User 4
     10003          5 04-OCT-16   Room 10003           User 5
     10005          5 06-OCT-16   Room 10005           User 5
     10004          6 04-OCT-16   Room 10004           User 6
     10006          6 06-OCT-16   Room 10006           User 6
     10005          7 04-OCT-16 Y Room 10005           User 7
     10007          7 06-OCT-16   Room 10007           User 7
     10001          8 29-SEP-16 Y Room 10001           User 8
     10006          8 04-OCT-16 Y Room 10006           User 8
     10008          8 06-OCT-16   Room 10008           User 8
     10002          9 29-SEP-16 Y Room 10002           User 9
     10007          9 04-OCT-16 Y Room 10007           User 9
     10009          9 06-OCT-16   Room 10009           User 9
     10003         10 29-SEP-16 Y Room 10003           User 10
     10008         10 04-OCT-16 Y Room 10008           User 10
     10010         10 06-OCT-16   Room 10010           User 10
     10004         11 29-SEP-16 Y Room 10004           User 11
     10009         11 04-OCT-16 Y Room 10009           User 11
     10005         12 29-SEP-16 Y Room 10005           User 12
     10010         12 04-OCT-16 Y Room 10010           User 12
     10006         13 29-SEP-16 Y Room 10006           User 13
     10007         14 29-SEP-16 Y Room 10007           User 14
     10008         15 29-SEP-16 Y Room 10008           User 15
     10009         16 29-SEP-16 Y Room 10009           User 16
     10010         17 29-SEP-16 Y Room 10010           User 17

30 rows selected.

We can see with a simple CASE statement, we can now isolate per user/room combination, which rooms are “unread”. Normally we would run such a query with a given user_id, but I’m showing all of the data here just for explanation purposes. From here, satisfying our room count requirement is trivial


SQL> select user_id, count(unread)
  2  from
  3  (
  4  select
  5       mv.room_id,
  6       mv.user_id,
  7       mv.last_viewed,
  8       case when mv.has_unread > 0 then 'Y' end unread,
  9       r.rname,
 10       u.uname
 11  from MSGS_MV mv,
 12       users u,
 13       rooms r
 14  where mv.user_id = u.user_id
 15  and   mv.room_id = r.room_id
 16  )
 17  group by user_id
 18  order by 1;

   USER_ID COUNT(UNREAD)
---------- -------------
         1             0
         2             0
         3             1
         4             0
         5             0
         6             0
         7             1
         8             2
         9             2
        10             2
        11             2
        12             2
        13             1
        14             1
        15             1
        16             1
        17             1

17 rows selected.

And there we have it. Once again I’m showing ALL users here, but the same applies for any single user. We have the unread room count using the materialized view and not from the MSGS table.

So how do we keep this materialized view in sync with the detail ? Well…that’s already taken care of, because it is fast refreshable on commit. We can see from the two above outputs that user 8 has two unread rooms, namely room 10001 and 10006. So let’s have her login and jump into chat room 10001.


SQL> update room_users
  2  set last_viewed = sysdate
  3  where room_id = 10001
  4  and user_id = 8;

1 row updated.

SQL>
SQL> commit;

Commit complete.

and now let’s take a look at our unread room query


SQL> select user_id, count(unread)
  2  from
  3  (
  4  select
  5       mv.room_id,
  6       mv.user_id,
  7       mv.last_viewed,
  8       case when mv.has_unread > 0 then 'Y' end unread,
  9       r.rname,
 10       u.uname
 11  from MSGS_MV mv,
 12       users u,
 13       rooms r
 14  where mv.user_id = u.user_id
 15  and   mv.room_id = r.room_id
 16  )
 17  group by user_id
 18  order by 1;

   USER_ID COUNT(UNREAD)
---------- -------------
         1             0
         2             0
         3             1
         4             0
         5             0
         6             0
         7             1
         8             1  <===
         9             2
        10             2
        11             2
        12             2
        13             1
        14             1
        15             1
        16             1
        17             1

17 rows selected.

As we expect, the unread count for user 8 has dropped by 1. Now she jumps rooms…


SQL> update room_users
  2  set last_viewed = sysdate
  3  where room_id = 10006
  4  and user_id = 8;

1 row updated.

SQL>
SQL> commit;

Commit complete.

SQL>
SQL> select user_id, count(unread)
  2  from
  3  (
  4  select
  5       mv.room_id,
  6       mv.user_id,
  7       mv.last_viewed,
  8       case when mv.has_unread > 0 then 'Y' end unread,
  9       r.rname,
 10       u.uname
 11  from MSGS_MV mv,
 12       users u,
 13       rooms r
 14  where mv.user_id = u.user_id
 15  and   mv.room_id = r.room_id
 16  )
 17  group by user_id
 18  order by 1;

   USER_ID COUNT(UNREAD)
---------- -------------
         1             0
         2             0
         3             1
         4             0
         5             0
         6             0
         7             1
         8             0  <===
         9             2
        10             2
        11             2
        12             2
        13             1
        14             1
        15             1
        16             1
        17             1

17 rows selected.

And as we can see, we have cleared her room status. It’s all been handled by the database, and the application does not need do any of the “heavy lifting”.

Just to reiterate, I’m not trying to denounce the author’s statements – denormalization IS hard to get right, especially if you try to hand craft it in a multi-user environment. It’s very easy to end up with something that appears to work but silently corrupts the derived data over time.

But with a little care and thought, you’ll be amazed at what materialized views in Oracle are capable of.

16 responses to “Data denormalization … another take”

Toon Koppelaars

October 6, 2016 at 5:31 pm

Great write-up Connor!

Just wanted to share the first thought I got while reading the requirement “see how many rooms that they subscribe to have unread messages” and the query used to tackle that.

The way that query was ‘worded’ could indeed end up being a scalability issue.

But a more natural way of expressing exactly that what was worded in Engligh language into SQL would be:

select count(*)
from ROOM_USERS ru
where ru.user_id = :current_user
and exists
(select ‘an more recent msg’
from MSGS m
where m.room_id = ru.room_id
and m.user_id = ru.user_id
and m.tstamp > ru.last_viewed)

Now, assuming for this business requirement, an index on MSGS(ROOM_ID,USER_ID,TSTAMP) is available, a plan involving a SEMI-join is likely to be generated by the CBO and that should perform very well here.

🙂

Loading…

connormcdonald

October 6, 2016 at 5:49 pm

Agreed, and in defence of the original author, I *think* your query is the one they suggested using, albeit with the annotation:

“HOLY MOTHER, that query has quite a few moving parts, including scans and joins on multiple data types”

Loading…

Kevan Gelling

October 6, 2016 at 7:09 pm

REFRESH FAST ON COMMIT is a great solution unless you use database links, in which case you get this:

SELECT * FROM dual@db_link;

UPDATE msgs
SET tstamp = tstamp + 1
WHERE ROWNUM = 1;

COMMIT ;

Error at line 17
ORA-02050: transaction 45.9.165086 rolled back, some remote DBs may be in-doubt
ORA-02051: another session or branch in same transaction failed or finalized

Loading…

1. Dom Brooks
  
  October 6, 2016 at 8:12 pm
  
  Not just DB links, any distributed transaction…
  
  Loading…
  
  1. Dom Brooks
    
    October 6, 2016 at 8:20 pm
    
    Sorry. Didn’t mean to post twice.
    
    Also fast refresh MVs can sometimes do horrid things to your commit times.
    https://orastory.wordpress.com/2014/11/27/the-mess-that-is-fast-refresh-join-only-materialized-views/
    
    Loading…
    
Dom Brooks

October 6, 2016 at 8:12 pm

Not just DB links – any distributed transaction…

Loading…

Liron Shapira

October 7, 2016 at 2:55 am

Connor, thanks for this awesome writeup! I didn’t realize Oracle was this capable.

I would love to explore if Oracle’s materialized views can model a full “denormalization graph”. Can we start from a single table modeling the global stream of events (as in Event Sourcing), then declare all the tables we’ve been talking about (User, Room, Message, RoomUser) to be materialized views?

Then the MV for User.numUnreadViews is a second step of denormalization.

My SQL skills are weak, so I’d really appreciate if you can comment on whether that’s feasible and show what that would look like.

Loading…

1. Sam Ward
  
  October 7, 2016 at 7:30 am
  
  The WAL (Write Ahead Log) is already performing event sourcing behind the scenes. The DB engine writes queries to the log immediately, so that the log can be replayed or verified to produce the tables. In essence, the tables are materialized views of the WAL.
  
  Loading…
  
  1. Liron Shapira
    
    October 7, 2016 at 7:32 am
    
    Sure but I want to *start* from a model of the global stream of events. If you just look at the WAL, it doesn’t look like the stream of event objects that I want to start with.
    
    Loading…
    
2. connormcdonald
  
  October 7, 2016 at 9:17 am
  
  Hi Liron,
  
  Thanks for stopping by. Theoretically I imagine you could create a system where you have a single stream of data coming in from a source, and farm materialized views off that.
  
  But as Sam points out, that’s somewhat akin to reinventing the core functionality of most databases, namely, a stream of “change vectors”, with the tables being a “point in time” representation of that stream. There’s a big difference between “doable” and “best way to do it” 🙂
  
  Probably the most common implementation of such a strategy is as per Dan’s comment – events come in as queue messages (whether its AQ or some other tech doesnt really matter, its more just having a guaranteed delivery, ordered, consistent messaging system) with appropriate daemons or similar processing or farming out the work.
  
  A materialized view is almost an internalisation of that very concept – the materialized view logs are the stream, and the refresh (either on commit or on demand) is the “downstream” processing
  
  Cheers,
  Connor
  
  Loading…
  
  1. Liron Shapira
    
    October 7, 2016 at 9:30 am
    
    I agree with everything you’re saying, but I’m still looking to figure out a Functional Reactive Programming paradigm for the whole database. Materialized Views are good FRP, but manually piping streams around or dealing with logs is not.
    
    (I have done a lot of programming on the client UI side. React (and MobX) have been a revolutionary change in the last few years, and totally sold me on Functional Reactive Programming. I believe that FRP style is destined to extend from the client back to the database layer too.)
    
    So to humor me… do you think I can realistically architect an Oracle SQL database which is one normal table (the Events table) and then 2 or 3 layers of materialized views?
    
    Loading…
    
    1. connormcdonald
      
      October 7, 2016 at 9:50 am
      
      architect ? yes
      realistically ? no 🙂
      
      The chain of work to do during commit processing would be too large, which means now you would be relaxing your data consistency requirements. And once it becomes asynchronous, then I’m not sure we’re seeing a lot of benefit with materialized views over a standard message processing implementation. Perhaps a little less coding ?
      
      Loading…
      
      1. Liron Shapira
        
        October 7, 2016 at 9:53 am
        
        Haha gotcha thanks. Btw I’m looking at Event Store too: https://groups.google.com/forum/#!topic/event-store/—1gZyGJy0
        
        Loading…
Dan McGhan

October 7, 2016 at 5:39 am

When I hear “Event Sourcing,” I think of Advanced Queues, another tool in the Oracle toolbox. In 12c they switched to JMS sharded queues for much better throughput and scalability:
http://www.oracle.com/technetwork/database/jms-wp-2533807.pdf

These are database backed queues that can be used to provide a degree of eventual consistency when necessary.

Loading…

Rob van Wijk

October 7, 2016 at 3:10 pm

Practical problem with this MV solution is scalability. When one session inserts a row into msgs and commits, it will refresh the MV, and other sessions doing the same simultaneously will wait with “enq: JI – contention”. So hopefully the chat rooms are all quiet ones.

http://rwijk.blogspot.nl/2010/01/enq-ji-contention.html

Loading…

Mike Wood

October 11, 2016 at 3:18 pm

Thanks Connor,

I added an enable query rewrite to the materialized view (MV) definition so that I did not have to reference the MV directly.

I know you are well aware of this capability but thought it is interesting to note that not only does your design remove the scan of MSGS by the use of the MV its implementation may also have little to no impact on maintaining existing code.

Can you tell me if this method (query rewrite) of implementation is robust for this scenario, i.e. are there circumstances where refresh fast on commit requires direct reference to the MV?

Loading…