The following is a summary of recommendations from the book Architecting Data Lakes (warning: this book has a good share of references to Zaloni’s Data Platform -which is a bit annoying- but it is still good reading material)

There should be at least 5 zones with the following functions:

  • Transient Landing Zone: with managed ingestion (i.e.: when metadata is known ahead of time), this is where the following functions take place: versioning, basic data quality checks (e.g.: range checks, integrity checks, etc.), mask/tokenise data and metadata capture.
  • Raw Zone: this is where data is added to the catalogue and each cell is tagged according to how “visible” it is to different users to specify who has access to the data in each cell.
  • Trusted Zone: in this stage more data cleansing and data validations are performed. Also data is usually grouped into two categories: master data and reference data (new single version of truth).
  • Refined zone: in this zone data is transformed and standardised to common format to derive insights as per LOB’s reqs. Data is watermarked by assigning an unique ID for each record (at either the record or file level). Data is exposed to consumers with appropriate role based access.
  • Sandbox zone: in this area data is exposed for is for ad-hoc exploratory use cases and the result can go back to the Raw Zone.
  • Special functions:
    • Functions outside this process:
      • Life-cycle: to move from hot to cold storage.
    • Functions across all steps:
      • Data lineage.
      • Detect whether certain streams are not being ingested based on the service-level agreements (SLAs) you set.

Bonus point: another interesting book on this topic is «Data Lake for Enterprises: Lambda Architecture for building enterprise data systems».

Thanks for reading,

Javier Caceres