Transformation Inheritance¶
Description¶
If you have partitioned tables or want to apply a transformation to a primary key and propagate it to all tables referencing that column, you can do so with Greenmask.
Apply for inherited¶
Using apply_for_inherited
, you can apply transformations to all partitions of a partitioned table, including any
subpartitions.
Configuration conflicts¶
When a partition has a transformation defined manually via config, and apply_for_inherited
is set on the parent table,
Greenmask will merge both the inherited and manually defined configurations. The manually defined transformation will
execute last, giving it higher priority.
If this situation occurs, you will see the following information in the log:
{
"level": "info",
"ParentTableSchema": "public",
"ParentTableName": "sales",
"ChildTableSchema": "public",
"ChildTableName": "sales_2022_feb",
"ChildTableConfig": [
{
"name": "RandomDate",
"params": {
"column": "sale_date",
"engine": "random",
"max": "2005-01-01",
"min": "2001-01-01"
}
}
],
"time": "2024-11-03T22:14:01+02:00",
"message": "config will be merged: found manually defined transformers on the partitioned table"
}
Apply for references¶
Using apply_for_references
, you can apply transformations to columns involved in a primary key or in tables with a
foreign key that references that column. This simplifies the transformation process by requiring you to define the
transformation only on the primary key column, which will then be applied to all tables referencing that column.
The transformer must be deterministic or support hash
engine and the hash
engin must be set in the
configuration file.
List of transformers that supports apply_for_references
:
- Hash
- NoiseDate
- NoiseFloat
- NoiseInt
- NoiseNumeric
- RandomBool
- RandomDate
- RandomEmail
- RandomFloat
- RandomInt
- RandomIp
- RandomMac
- RandomNumeric
- RandomString
- RandomUuid
- RandomUnixTimestamp
End-to-End Identifiers¶
End-to-end identifiers in databases are unique identifiers that are consistently used across multiple tables in a relational database schema, allowing for a seamless chain of references from one table to another. These identifiers typically serve as primary keys in one table and are propagated as foreign keys in other tables, creating a direct, traceable link from one end of a data relationship to the other.
Greenmask can detect end-to-end identifiers and apply transformations across the entire sequence of tables. These identifiers are detected when the following condition is met: the foreign key serves as both a primary key and a foreign key in the referenced table.
Configuration conflicts¶
When on the referenced column a transformation is manually defined via config, and the apply_for_references
is set on
parent table, the transformation defined will be chosen and the inherited transformation will be ignored. You will
receive a INFO
message in the logs.
{
"level": "info",
"TransformerName": "RandomInt",
"ParentTableSchema": "public",
"ParentTableName": "tablea",
"ChildTableSchema": "public",
"ChildTableName": "tablec",
"ChildColumnName": "id2",
"TransformerConfig": {
"name": "RandomInt",
"apply_for_references": true
},
"time": "2024-11-03T21:28:10+02:00",
"message": "skipping apply transformer for reference: found manually configured transformer"
}
Limitations¶
- The transformation must be deterministic.
- The transformation condition will not be applied to the referenced column.
- Not all transformers support
apply_for_references
Warning
We do not recommend using apply_for_references
with transformation conditions, as these conditions are not
inherited by transformers on the referenced columns. This may lead to inconsistencies in the data.
Example 1. Partitioned tables¶
In this example, we have a partitioned table sales
that is partitioned by year and then by month. Each partition
contains a subset of data based on the year and month of the sale. The sales
table has a primary key sale_id
and is
partitioned by sale_date
. The sale_date
column is transformed using the RandomDate
transformer.
CREATE TABLE sales
(
sale_id SERIAL NOT NULL,
sale_date DATE NOT NULL,
amount NUMERIC(10, 2) NOT NULL
) PARTITION BY RANGE (EXTRACT(YEAR FROM sale_date));
-- Step 2: Create first-level partitions by year
CREATE TABLE sales_2022 PARTITION OF sales
FOR VALUES FROM (2022) TO (2023)
PARTITION BY LIST (EXTRACT(MONTH FROM sale_date));
CREATE TABLE sales_2023 PARTITION OF sales
FOR VALUES FROM (2023) TO (2024)
PARTITION BY LIST (EXTRACT(MONTH FROM sale_date));
-- Step 3: Create second-level partitions by month for each year, adding PRIMARY KEY on each partition
-- Monthly partitions for 2022
CREATE TABLE sales_2022_jan PARTITION OF sales_2022 FOR VALUES IN (1)
WITH (fillfactor = 70);
CREATE TABLE sales_2022_feb PARTITION OF sales_2022 FOR VALUES IN (2);
CREATE TABLE sales_2022_mar PARTITION OF sales_2022 FOR VALUES IN (3);
-- Continue adding monthly partitions for 2022...
-- Monthly partitions for 2023
CREATE TABLE sales_2023_jan PARTITION OF sales_2023 FOR VALUES IN (1);
CREATE TABLE sales_2023_feb PARTITION OF sales_2023 FOR VALUES IN (2);
CREATE TABLE sales_2023_mar PARTITION OF sales_2023 FOR VALUES IN (3);
-- Continue adding monthly partitions for 2023...
-- Step 4: Insert sample data
INSERT INTO sales (sale_date, amount)
VALUES ('2022-01-15', 100.00);
INSERT INTO sales (sale_date, amount)
VALUES ('2022-02-20', 150.00);
INSERT INTO sales (sale_date, amount)
VALUES ('2023-03-10', 200.00);
To transform the sale_date
column in the sales
table and all its partitions, you can use the following
configuration:
- schema: public
name: sales
apply_for_inherited: true
transformers:
- name: RandomDate
params:
min: "2022-01-01"
max: "2022-03-01"
column: "sale_date"
engine: "random"
Example 2. Simple table references¶
This is ordinary table references where the primary key of the users
table is referenced in the orders
table.
-- Enable the extension for UUID generation (if not enabled)
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE users
(
user_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
username VARCHAR(50) NOT NULL
);
CREATE TABLE orders
(
order_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id UUID REFERENCES users (user_id),
order_date DATE NOT NULL
);
INSERT INTO users (username)
VALUES ('john_doe');
INSERT INTO users (username)
VALUES ('jane_smith');
INSERT INTO orders (user_id, order_date)
VALUES ((SELECT user_id FROM users WHERE username = 'john_doe'), '2024-10-31'),
((SELECT user_id FROM users WHERE username = 'jane_smith'), '2024-10-30');
To transform the username
column in the users
table, you can use the following configuration:
- schema: public
name: users
apply_for_inherited: true
transformers:
- name: RandomUuid
apply_for_references: true
params:
column: "user_id"
engine: "hash"
This will apply the RandomUuid
transformation to the user_id
column in the orders
table automatically.
Example 3. References on tables with end-to-end identifiers¶
In this example, we have three tables: tablea
, tableb
, and tablec
. All tables have a composite primary key.
In the tables tableb
and tablec
, the primary key is also a foreign key that references the primary key of tablea
.
This means that all PKs are end-to-end identifiers.
CREATE TABLE tablea
(
id1 INT,
id2 INT,
data VARCHAR(50),
PRIMARY KEY (id1, id2)
);
CREATE TABLE tableb
(
id1 INT,
id2 INT,
detail VARCHAR(50),
PRIMARY KEY (id1, id2),
FOREIGN KEY (id1, id2) REFERENCES tablea (id1, id2) ON DELETE CASCADE
);
CREATE TABLE tablec
(
id1 INT,
id2 INT,
description VARCHAR(50),
PRIMARY KEY (id1, id2),
FOREIGN KEY (id1, id2) REFERENCES tableb (id1, id2) ON DELETE CASCADE
);
INSERT INTO tablea (id1, id2, data)
VALUES (1, 1, 'Data A1'),
(2, 1, 'Data A2'),
(3, 1, 'Data A3');
INSERT INTO tableb (id1, id2, detail)
VALUES (1, 1, 'Detail B1'),
(2, 1, 'Detail B2'),
(3, 1, 'Detail B3');
INSERT INTO tablec (id1, id2, description)
VALUES (1, 1, 'Description C1'),
(2, 1, 'Description C2'),
(3, 1, 'Description C3');
To transform the data
column in tablea
, you can use the following configuration:
- schema: public
name: "tablea"
apply_for_inherited: true
transformers:
- name: RandomInt
apply_for_references: true
params:
min: 0
max: 100
column: "id1"
engine: "hash"
- name: RandomInt
apply_for_references: true
params:
min: 0
max: 100
column: "id2"
engine: "hash"
This will apply the RandomInt
transformation to the id1
and id2
columns in tableb
and tablec
automatically.