I had a request from a client a while back regarding masking of data. They had an application with sensitive data in the Production environment (where access and audit were very tightly controlled) but the issue was how to respect that sensitivity in non-Production environments whilst still preserving full size data sizes for application testing.
After some conversations about requirements, it turned out that since (even in non-Production environments) all access to application components was logged and audited, the issue was simply protecting against “inadvertent” access to sensitive data. For example, in the application, if I searched for “males with black hair with an interest in technology” I should never see the name “Connor McDonald” on screen in the results, because simply viewing that data could be treated as a breach of privacy.
Addenda: Anyone adding a comment to this post with “Shouldn’t that be males with black hair with a hint of grey” will be banned forever
I suggested Data Redaction as a potential solution, but they wanted the data actually obfuscated within the database. Then I suggested Data Masking which definitely does the job, but they felt this offered far more than their simple requirement of just obfuscate the data and retain the distribution of data to preserve performance characteristics as much as possible.
So ultimately here is the basis of a little routine I created for them to obfuscate their data to satisfy their requirement of not viewing meaningful data in their non-Production environments. Please read the footnote at the end of this post before you consider using this code. The logic here is straight forward
- Take the alphanumeric characters in a logical order as a single string.
- Use DBMS_RANDOM to come up with 255 variations of that string and store them in an array.
- Use the TRANSLATE function to perform a simple obfuscation of the input string.
That translation could then be used during the process of copying data from Production, or as a post-copy task.
First I’ll demo the concept of coming up with randomised strings. Here’s a simple anonymous block to create 10 random strings based on a base string of: ABC….Zabc…z0123456789
SQL> set serverout on
SQL> declare
2 type rand_list is table of varchar2(62) index by pls_integer;
3 source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
4 perms rand_list;
5 source_variant varchar2(62);
6 begin
7 for i in 1 .. 10 loop
8 source_variant := null;
9 for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
10 loop
11 source_variant := source_variant || substr(source,j.r,1);
12 end loop;
13 for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
14 loop
15 source_variant := source_variant || substr(source,52+j.r,1);
16 end loop;
17 perms(i) := source_variant;
18 dbms_output.put_line(source_variant);
19 end loop;
20 end;
21 /
xrLCuowbNfvXlMJSBKctPyheZWTnQUjYAdizkFVOIHGsmDRpaEqg5637918402
OkVxApdsLqTlyvgYXQGaSeNPMCjZEJhHRtbcIBrowKUunifzDFmW8013742695
gcVxtNeMdBuRJhvYrnkCIWzoyFwfLpqSATabQGUZKmDOEilPHsXj5601892743
vOUuisITyJjbaQLqfnpFVBoCrYtzHeDAmWExPkNwgScXKRlZGMhd3104879256
VLZkXndmytCWRwKeNgYifvTESzGFhHcqaMJPoljxUsAQrubOpBDI7302189564
VfPLKmObZzBMCpgrjUaQYieRNJykDwtnuWXFxdTEIlScqvohAGsH0138924756
ABMkfozebljnPGqHJhpNwxYEUCuQLyDrOasIgtFZKVmciTXSdvRW5702139864
iNjrmVTlyzFSLswfBvaWKUAHIZOcgMPYXCGqeoDJbhpktERdQunx1976045328
hiYGzxMEZFPcmkutbDlrCeynSITKHNJaXAwfpRsdqLjvQUoWVOgB0479632158
uZpPXOHLAvzricdjtqSQBKVhYoFfmUxEDNCwWlIJsbRgkMeGynaT6451328790
So a simple obfuscation could be:
SQL> select
2 translate(
3 'Connor McDonald',
4 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
5 'ehZAFjmtYwWilTRDnCcLrvQqEpakKGPMxXyudVUNsoJOBfzSIbHg2163095784')
6 from dual;
TRANSLATE('CONN
---------------
ZsNNsO lKAsNaVG
Armed with this, I can create a package which extends this to choose from 255 random strings and use these as a simple obfuscator for source data.
SQL> create or replace
2 package masker is
3 function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return sys.odcivarchar2list pipelined;
4 function standard_source return varchar2;
5 function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2;
6 function mask(p_input varchar2) return varchar2;
7 end;
8 /
Package created.
SQL> create or replace
2 package body masker is
3 perms sys.odcivarchar2list := sys.odcivarchar2list();
4
5 procedure init(p_size int,p_seed int) is
6 source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
7 source_variant varchar2(62);
8 begin
9 dbms_random.seed(p_seed);
10 perms.delete;
11 perms.extend(p_size);
12 for i in 1 .. p_size loop
13 source_variant := null;
14 for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
15 loop
16 source_variant := source_variant || substr(source,j.r,1);
17 end loop;
18 for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
19 loop
20 source_variant := source_variant || substr(source,52+j.r,1);
21 end loop;
22 perms(i) := source_variant;
23 -- dbms_output.put_line(source_variant);
24 end loop;
25 end;
26
27 function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return sys.odcivarchar2list pipelined is
28 begin
29 if perms.count < p_size or p_refresh in ('Y','y') then
30 init(p_size,p_seed);
31 end if;
32
33 for i in 1 .. p_size
34 loop
35 pipe row ( perms(i));
36 end loop;
37
38 return;
39 end;
40
41 function standard_source return varchar2 is
42 begin
43 return 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
44 end;
45
46 function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2 is
47 begin
48 if perms.count < p_size or p_refresh in ('Y','y') then
49 init(p_size,p_seed);
50 end if;
51 return perms(p_idx);
52 end;
53
54 function mask(p_input varchar2) return varchar2 is
55 begin
56 return translate(p_input,masker.standard_source,masker.idx_entry(ascii(p_input)));
57 end;
58
59 end;
60 /
Package body created.
Now I’ll test this with some sample data
SQL> create table t ( pk int, n varchar2(50));
Table created.
SQL>
SQL> insert into t values (1,'Connor McDonald');
1 row created.
SQL> insert into t values (2,'Chris Saxon');
1 row created.
SQL> insert into t values (3,'Maria Colgan');
1 row created.
SQL> insert into t values (4,'Bobby Curtis');
1 row created.
SQL> insert into t values (5,'Joel Kallman');
1 row created.
SQL> insert into t values (6,'Steven Feuerstein');
1 row created.
SQL> insert into t values (7,'Connor McDonald');
1 row created.
SQL> select pk, n, masker.mask(n) diddled
2 from t;
PK N DIDDLED
---------- -------------------------------------------------- ------------------------------
1 Connor McDonald sAJJAe CvnAJjWt
2 Chris Saxon sweOy RjrAJ
3 Maria Colgan czEJz BhMbzm
4 Bobby Curtis nkjjI EpzLBS
5 Joel Kallman oYfi luiiIuj
6 Steven Feuerstein CyUrUE SUtUWQyUXE
7 Connor McDonald sAJJAe CvnAJjWt
7 rows selected.
There we go! A simple obfuscator that runs quite efficiently on source data whilst preserving the distribution of the data.
Footnote: Remember that any masking scheme that does not completely randomize the obfuscation of source data is not a complete encryption or security solution. Most mechanisms to deduce source data from a manipulated variant of that data involves distribution analysis of letters, n-grams, words and phrases. The moment you preserve distribution of source data for (say) performance and/or optimizer plan preservation, you are allowing for source data to be derived if there is enough of it available to perform that analysis. Remember that this is not a substitute for appropriate security and encryption protections.