In our usual process of naming columns, intentionally, or unintentionally we use a mix of both sources manually to come up with names for columns. This makes the process of deciding on column names extremely fluid and inconsistent.
Following are, few of the consequences.
WDTH
or WIDTH
, WT
or WEIGHT
, PRC
or PRICE
, PCT
or PERCENT
, YRS
or YEARS
etc.Since there is no tool/platform to use, to get feedback on column names during the process of naming, all the above actions further propagate inconsistencies in column naming across Data Landscape.
Apply to Data at Rest:
Column Names in Physical Layer:
All physical names will simply be the full logical name (see exceptions in catalog for limited abbreviations) with a specified physical platform separation scheme to replace the spaces as follows
Abbreviation usage Exceptions to Full Names:
** Class Word Abbreviations apply to NDF Curated, Integrated, and Aggregated layers. These do not apply to integration based canonical models as they specify the full word.
Primary Keys:
Foreign Keys:
Understanding and consistency in names are much more important than shortness going forward
Physical Column Ordering
Potential Exception Needs to be Considered for Data-In-Motion Structures for Data Payloads
Employee_First_Name
where Name
is the class word.MSRP
would need to be MSRP_Price
.Material_Lead_Tm_Days, Plant_Build_Early_Planned_Nbr_of_Days
etc.Travel_Expense_Amt_USD, Material_Cost_USD, Retail_Prc_USD
etc.This pacakge is used as CLI to get the feedback on column names of a given entity, when validated against Naming Standards
anv <COLUMN_NAMES_TEXT_FILE>
command providing the column names text file.INSTALLATION
$ pip3 install git+https://github.com/siddartham/attribute-name-validator.git
REPORT GENERATION
$ attribute-name-validator COLUMN_NAMES_TEXT_FILE
OUTPUT
$ ls reports/
COLUMN_NAMING_GUIDELINES_AND_ANALYSIS_REPORT_USAGE.html CATALOG.xlsx COLUMN_NAMES_TEXT_FILE_REPORT.xlsx
HOW TO USE THE REPORT
On running the command on a column names text file or a folder with column name files, the tool produces two generic files, one XLSX file with the class word and acronym catalog used for the analysis and this HTML page with brief description of naming rules and package usage.
reports/CATALOG.xlsx
- Has the entire Catalog of Enterprise Approved Class Words and Acronyms considered in naming analysis.reports/COLUMN_NAMING_GUIDELINES_AND_ANALYSIS_REPORT_USAGE.html
- This HTML page expanding on the usage of package and reports generated.
It also has references to Enterprise Naming Guidelines.The tool generates a report per object named reports/ENTITY_NAME_REPORT.xlsx
, highlighting potential violations.
**In an ideal situation, the report generated for a given entity, ** would have the following.
CLASS_WORD_RULES_FOLLOWED?
column in CLASS_WORDS_ANALYSIS
sheet should be YES
or MAY BE
for all entries.
However, review the ones with MAY BE
or NO
.FULL_WORDS_USED
sheet.
For example, if words such as SESN
or CHNL
or GEO
or CRBN
or PRODCAL
or ITEMTYPE
or
PRICEMODIFIER
etc., and other similar words appear in the list, which are neither english/other language words,
nor part of approved abbreviated class words or acronyms, they stand in violation of enterprise guidelines and
thus have to be changed to their full word forms.APPROVED_ABBREVIATIONS_USED
sheet, meanings of short-forms used in naming columns should match one of
the meanings. For example, if STD
is used by developers of object to mean STANDARD
, and not
SEASON TO DATE
, an approved Acronym, the USED_CATALOG
report, against STD
approved usages, would not
show STANDARD
, only shows SEASON TO DATE
, thus have to rename it, even though STD
is an approved
abbreviation, but for a different purpose.FOR FURTHER USAGE OF THE TOOL
$ anv -h
usage: attribute-name-validator [-h] [-l] [--write-to-text-files] target_path
This command generates report on naming analysis from either a list of column name files in a folder,
or a given column names file, by looking up the words used to the column names in a local Abbreviation catalog
that comes with the installation of package. On the execution of this command, you will also have a reports
folder created with CATALOG.xlsx and COLUMN_NAMING_GUIDELINES_AND_ANALYSIS_REPORT_USAGE.html files,
which have more information related to the working of the tool
To add a set of Class Words Abbreviations and Acronyms, as exceptions beyond current enterprise guidelines
create anv.ini file and add exceptions under respective sections, as shown below
[additional-catalog]
acronyms = LOB
class-word-abbreviations = IN, IN3, CM3, LB
positional arguments:
target_path Path to file with column names or folder with column name files to analyze
optional arguments:
-h, --help show this help message and exit
-l, --log This flag, if present, shows logs of execution in sys.stdout.
--write-to-text-files, -wttf
This flag, if present, also creates text files of reports under their respective entity specific folder.
EXCEPTIONS TO CLASS WORD RULES
, EXCEPTIONS TO ABBREVIATED WORDS
._CM3
, _IN
, _IN3
, they need to be added to catalogC02E
or C02EKG
etc_WTD
, _MTD
, _YTD
, _4WK
, _12WK
etc., be added to class words, or as exceptions to class words?LOB, RTL
etc._PRICE
, sometimes _PRC
,
sometimes _WEIGHT
, sometimes _WT
, _WIDTH
, sometimes _WDTH
, but not _WH
- makes it more confusing._PRC
for _PRICE
, _WGHT
for _WEIGHT
,
_WDTH
for _WIDTH
, _AMT
for _AMOUNT
,
_TXT
for _TEXT
, _TM
for _TIME
, _VOL
for _VOLUME
, _NM
for _NAME
, _CNT
for _COUNT
etc._KG
for _KILOGRAMS, _M
for _METERS, _M2
for _SQUAREMETERS, _M3
for _CUBICMETERS, _CM
for _CENTIMETERS, _CM3
for _CUBICCENTIMETERS, _CO2E
for _CO2EQUIVALENT, _UOM, _USD, _UTC, _UUID,
etc. and more Class Words like that
can be allowed as they,CATALOG.xlsx
needs to be updated as needed by the DGP, with requests coming for new custom class words which don't
fall into any of the suggested category of class words.ACRONYMS
sheet Ex: LOB
- a well-used acronym, but not part of catalog._WTD, _MTD
,
while being approved acronyms, also used at the end?. _YEAR, _MONTH
are not approved class words, do they have to be_DAY, _YEAR, _MONTH
are not approved class words - suggest
that they need a class word qualifier like a _MONTH_NBR, _MONTH_TXT, _DAY_NBR
?FULL_WORD_REPORT
sheet in the report to have a new column, SYNONYM
/MEANING
, whose value is generated for every
FULL_WORD_REPORT
by looking up an English language words catalog of an NLTK
library. The value in the column for
a given word, would give further feed back on, if that is what they mean. - TODOAbove reported inconsistencies is primarily because of two things
There are two ways to go about it.
You can install this package from GitHub/PYPI. Anyone can install the package locally and
run the command providing a column names file, as they build new objects.
Every system(PC or MAC) comes with python by default, but would need to ensure ~python3.8
to install this package.
just have to install the package running above command.
However, to have that communication regarding package installation passed down
to everyone, and expect everyone working on deciding column names, to install this on their system without the
complications of few of them, having unsupported python version(<py3.8
) of python, would be too much of a hassle.
Even more challenging task, is to expect every one to generate reports for column names based on then latest Data Governance Platform approved catalog data set and rules, by upgrading the package as CATALOG.xlsx is a living document, and naming rules can change. DGP is expected to update it as and when a need arises.
Alternatively, can use a lightweight client, with Catalog and rules hosted in cloud.
A much easier method to propagate consistent column naming rules across enterprise would be to host it on a well communicated data governance platform and add to column naming process, a step to upload the document to the site and get a report generated by the tool and iterate the same with changes made as a process to follow while building/modifying an object.
This way, every person working on naming columns, would always get feedback based on latest catalog and rules approved by Data Governance Platform.
From the Data Governance Platform end,
They need to maintain the latest catalog and naming logic with the package as before. and deploy it at a well communicated end point to be consumed by users across the organisation, in the process of deciding column names.
Thus, the latest catalog of abbreviations & rules approved by Data Governance Platform would be consumed by everyone in the org for feedback, by just going back to the same hosted tool. Rather than any other individual action, everyone has to take, like upgrading to the latest package every time they have a task of deciding column names, since one can't be sure that the package they installed sometime back, has the latest catalog and naming rules from DGP.