Quick and easy masking

Posted by

I had a request from a client a while back regarding masking of data. They had an application with sensitive data in the Production environment (where access and audit were very tightly controlled) but the issue was how to respect that sensitivity in non-Production environments whilst still preserving full size data sizes for application testing.

After some conversations about requirements, it turned out that since (even in non-Production environments) all access to application components was logged and audited, the issue was simply protecting against “inadvertent” access to sensitive data. For example, in the application, if I searched for “males with black hair with an interest in technology” I should never see the name “Connor McDonald” on screen in the results, because simply viewing that data could be treated as a breach of privacy.

Addenda: Anyone adding a comment to this post with “Shouldn’t that be males with black hair with a hint of grey” will be banned forever Smile

I suggested Data Redaction as a potential solution, but they wanted the data actually obfuscated within the database. Then I suggested Data Masking which definitely does the job, but they felt this offered far more than their simple requirement of just obfuscate the data and retain the distribution of data to preserve performance characteristics as much as possible.

So ultimately here is the basis of a little routine I created for them to obfuscate their data to satisfy their requirement of not viewing meaningful data in their non-Production environments. Please read the footnote at the end of this post before you consider using this code. The logic here is straight forward

  • Take the alphanumeric characters in a logical order as a single string.
  • Use DBMS_RANDOM to come up with 255 variations of that string and store them in an array.
  • Use the TRANSLATE function to perform a simple obfuscation of the input string.

That translation could then be used during the process of copying data from Production, or as a post-copy task.

First I’ll demo the concept of coming up with randomised strings. Here’s a simple anonymous block to create 10 random strings based on a base string of: ABC….Zabc…z0123456789


SQL> set serverout on
SQL> declare
  2    type rand_list is table of varchar2(62) index by pls_integer;
  3    source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  4    perms  rand_list;
  5    source_variant varchar2(62);
  6  begin
  7    for i in 1 .. 10 loop
  8      source_variant := null;
  9      for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
 10      loop
 11        source_variant := source_variant || substr(source,j.r,1);
 12      end loop;
 13      for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
 14      loop
 15        source_variant := source_variant || substr(source,52+j.r,1);
 16      end loop;
 17      perms(i) := source_variant;
 18      dbms_output.put_line(source_variant);
 19    end loop;
 20  end;
 21  /
xrLCuowbNfvXlMJSBKctPyheZWTnQUjYAdizkFVOIHGsmDRpaEqg5637918402
OkVxApdsLqTlyvgYXQGaSeNPMCjZEJhHRtbcIBrowKUunifzDFmW8013742695
gcVxtNeMdBuRJhvYrnkCIWzoyFwfLpqSATabQGUZKmDOEilPHsXj5601892743
vOUuisITyJjbaQLqfnpFVBoCrYtzHeDAmWExPkNwgScXKRlZGMhd3104879256
VLZkXndmytCWRwKeNgYifvTESzGFhHcqaMJPoljxUsAQrubOpBDI7302189564
VfPLKmObZzBMCpgrjUaQYieRNJykDwtnuWXFxdTEIlScqvohAGsH0138924756
ABMkfozebljnPGqHJhpNwxYEUCuQLyDrOasIgtFZKVmciTXSdvRW5702139864
iNjrmVTlyzFSLswfBvaWKUAHIZOcgMPYXCGqeoDJbhpktERdQunx1976045328
hiYGzxMEZFPcmkutbDlrCeynSITKHNJaXAwfpRsdqLjvQUoWVOgB0479632158
uZpPXOHLAvzricdjtqSQBKVhYoFfmUxEDNCwWlIJsbRgkMeGynaT6451328790

So a simple obfuscation could be:


SQL> select
  2    translate(
  3      'Connor McDonald',
  4      'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
  5      'ehZAFjmtYwWilTRDnCcLrvQqEpakKGPMxXyudVUNsoJOBfzSIbHg2163095784')
  6  from dual;

TRANSLATE('CONN
---------------
ZsNNsO lKAsNaVG

Armed with this, I can create a package which extends this to choose from 255 random strings and use these as a simple obfuscator for source data.


SQL> create or replace
  2  package masker is
  3    function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return  sys.odcivarchar2list pipelined;
  4    function standard_source return varchar2;
  5    function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2;
  6    function mask(p_input varchar2) return varchar2;
  7  end;
  8  /

Package created.

SQL> create or replace
  2  package body masker is
  3    perms sys.odcivarchar2list := sys.odcivarchar2list();
  4
  5  procedure init(p_size int,p_seed int) is
  6    source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  7    source_variant varchar2(62);
  8  begin
  9    dbms_random.seed(p_seed);
 10    perms.delete;
 11    perms.extend(p_size);
 12    for i in 1 .. p_size loop
 13      source_variant := null;
 14      for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
 15      loop
 16        source_variant := source_variant || substr(source,j.r,1);
 17      end loop;
 18      for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
 19      loop
 20        source_variant := source_variant || substr(source,52+j.r,1);
 21      end loop;
 22      perms(i) := source_variant;
 23  --      dbms_output.put_line(source_variant);
 24    end loop;
 25  end;
 26
 27  function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return sys.odcivarchar2list pipelined is
 28  begin
 29    if perms.count < p_size or p_refresh in ('Y','y') then
 30      init(p_size,p_seed);
 31    end if;
 32
 33    for i in 1 .. p_size
 34    loop
 35      pipe row ( perms(i));
 36    end loop;
 37
 38    return;
 39  end;
 40
 41  function standard_source return varchar2 is
 42  begin
 43     return 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
 44  end;
 45
 46  function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2 is
 47  begin
 48    if perms.count < p_size or p_refresh in ('Y','y') then
 49      init(p_size,p_seed);
 50    end if;
 51    return perms(p_idx);
 52  end;
 53
 54  function mask(p_input varchar2) return varchar2 is
 55  begin
 56    return translate(p_input,masker.standard_source,masker.idx_entry(ascii(p_input)));
 57  end;
 58
 59  end;
 60  /

Package body created.

Now I’ll test this with some sample data


SQL> create table t ( pk int, n varchar2(50));

Table created.

SQL>
SQL> insert into t values (1,'Connor McDonald');

1 row created.

SQL> insert into t values (2,'Chris Saxon');

1 row created.

SQL> insert into t values (3,'Maria Colgan');

1 row created.

SQL> insert into t values (4,'Bobby Curtis');

1 row created.

SQL> insert into t values (5,'Joel Kallman');

1 row created.

SQL> insert into t values (6,'Steven Feuerstein');

1 row created.

SQL> insert into t values (7,'Connor McDonald');

1 row created.

SQL> select pk, n, masker.mask(n) diddled
  2  from t;

        PK N                                                  DIDDLED
---------- -------------------------------------------------- ------------------------------
         1 Connor McDonald                                    sAJJAe CvnAJjWt
         2 Chris Saxon                                        sweOy RjrAJ
         3 Maria Colgan                                       czEJz BhMbzm
         4 Bobby Curtis                                       nkjjI EpzLBS
         5 Joel Kallman                                       oYfi luiiIuj
         6 Steven Feuerstein                                  CyUrUE SUtUWQyUXE
         7 Connor McDonald                                    sAJJAe CvnAJjWt

7 rows selected.

There we go! A simple obfuscator that runs quite efficiently on source data whilst preserving the distribution of the data.

Footnote: Remember that any masking scheme that does not completely randomize the obfuscation of  source data is not a complete encryption or security solution. Most mechanisms to deduce source data from a manipulated variant of that data involves distribution analysis of letters, n-grams, words and phrases. The moment you preserve distribution of source data for (say) performance and/or optimizer plan preservation, you are allowing for source data to be derived if there is enough of it available to perform that analysis. Remember that this is not a substitute for appropriate security and encryption protections.

Got some thoughts? Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.