This is part three of an occasional series on data governance (DG) and its enabling technologies. In parts one and two I gave an overview of what data governance is and how it might be undertaken; in this blog, I talk about business glossaries–one of the key DG-enabling technologies–and how to create one.
Let me begin by acknowledging that there is already a great deal that’s been written about business glossaries, and I borrow heavily and liberally from that substantial body of work (which I’ve done my best to document in the References section below). What I hope to contribute here is to summarize and consolidate this work for you into a single, stand-alone, blog post.
According to the Data Management Body of Knowledge (DMBoK), a business glossary (BG) “houses agreed-upon definitions of business terms and relates these to data.” There are other definitions of course, such as those from Lowell Fryman and Michelle Knight, but what they have in common is the centrality of terms and their definitions and interrelationships.
It turns out that there’s surprising depth to business glossaries, with the ideas that undergird them being drawn from such diverse disciplines as philosophy, linguistics, semiotics, cognitive science, and library and information science. Let me reassure you, though, that while I like to geek out as much as the next data nerd, I’ll keep the heady stuff to a minimum in favor of applicability. To that end let’s look at just a few of these ideas, beginning with the concept.
Computer scientist and wit Andrew S. Tanenbaum has said about standards that what’s nice about them “is that there are so many…to choose from.” To define concepts, I draw from three standards. The first is ISO 704, the international standard for compiling terminologies. ISO 704 states simply that a concept corresponds to a set of objects. ISO/IEC 11179, the international standard for metadata registries, extends this definition, stating that membership in these sets is based on characteristics common to all the set’s objects, which corresponds to the familiar Linnaean taxonomic classification system of kingdom, phylum, class, etc. The last standard, ANSI/NISO Z39.19, the American standard for monolingual controlled vocabularies, adds the important distinction between a concept and the label we assign to it, which is a term.
A term is the name we assign to a concept and is the meat and potatoes of the business glossary–its focus. A term gives identity to a concept, and for any one concept, there can be many terms attached to it. Figure 1 illustrates this. An English speaker looking at the object within the thought bubble would immediately identify it as a butterfly, because she saw characteristics common to all butterflies and assigned that label to the picture. But of course there’s more than one label for what an English speaker calls a butterfly. We’ve also got the Urdu, French, Arabic, Spanish, and Hindi labels, respectively. Notice, though, that while the label–the term–changes, the concept doesn’t. To paraphrase Juliet Capulet, a butterfly by any other name would be as lovely.
Figure 1 – What’s in a name?
It’s possible for a given BG term to map to more than one concept, for example mouse. Conversely, it’s possible for a concept to be represented by more than one term, e.g., water and H2O (see figure 2). The former is called synonymy and the latter is polysemy. Polysemy is undesirable in a business glossary: you want univocality–one term per concept. To create unique terms in the face of polysemy, it may be necessary to qualify the term. For example, the crane on the right could be labeled “construction crane,” and the mouse on the left could be “mouse (animal).”
Figure 2 – Cranes and mice
Key to the assignment of a term to a concept is the definition of the concept, since the definition drives the association of a term to its concept. The definition is a statement of the meaning of a concept, and in the same way that there should only be one term per concept, there should only be one definition per term. Further, the definition must apply to all instances of the concept.
The final business glossary preliminary is that of the category. Within a business glossary, terms are grouped into categories or clusters according to a set of criteria. Categories are useful because they provide the logical structure for the business glossary so that you can manage, make sense of and, most importantly, find terms within it.
To summarize, the relationship between concepts, terms, definitions, and categories is that a term labels a concept, a definition is of a concept, and terms can be grouped into categories based on a consistent and inclusive membership criterion.
Figure 3 – The four BG fundamentals
Creating a Business Glossary
There are five steps to creating a business glossary:
The first step is locating the concepts that will make up the business glossary. In step two you name the concepts, following naming best practices. For step three, you define the term, ensuring the definition is complete and inclusive of all instances of the term. Step four is where you group terms into categories, based on common membership criteria, in order to enable understanding and findability. Finally, in step five you establish term interrelationships to create concept systems and further facilitate findability and navigation.
Find. Subject matter content is the primary source of concepts, so start with a content inventory and then assess the inventoried content itself. Content includes documents like business plans and policies and procedures; modeling artifacts such as data models, ontologies (both custom and pre-existing, such as those on schema.org), and class diagrams; data dictionaries; interviews with stakeholders; web sites; and, crucially, intranet search logs, which can show you what information knowledge workers are looking for and the terms they use when they search for it.
Name. Once you’ve collected the content, it’s then time to extract candidate terms from it. Recall that a term labels a concept, thus terms are nouns or noun forms, such as noun phrases and gerunds. A noun phrase can be either adjectival (noun plus adjective) or prepositional (noun plus preposition). An example of the former is the “churned customer,” and the latter is “customer of record.” So yes, terms can be compound. Gerunds are verbs made into nouns with the addition of “-ing” e.g., “borrowing.” Keep in mind that nouns can also end in “tion,” such as “amortization.”
The process of term extraction and naming is iterative and informed by stakeholders and subject-matter experts (SMEs). Please keep in mind that a business glossary doesn’t have to contain every concept that appears in the corpus of your organization’s content, just those core to the business. Now, what’s core is in the eye of the beholder, but my main point here is that not every term needs to be governed; those corresponding to critical data elements can be a good place to start. Also, although automated methods such as optical character recognition (OCR) and computational linguistics techniques like part-of-speech tagging have shown promise in term extraction, this process is still mostly manual or at least includes a significant manual review.
Define. As I mentioned before, the definition drives the term and its categorization. Guidelines for term definitions are voluminous, and I consulted seven sources, but the bulk of the guidelines I list here–and indeed those cited by most writers on the subject–come from ISO/IEC 11179. I refer you specifically to part four of that standard. Additional guidelines come from a DoD standard called 8320-1 and the authors Malcom Chisholm and Ronald Ross.
A good definition:
- Is stated in the singular
- States what the concept is, not (only) what it isn’t. if a negative is included, it should come after the positive in order to reinforce it.
- Is stated in a descriptive phrase or sentence that can on its own
- Contains only commonly-understood abbreviations
- Does not contain other definitions
- Is unambiguous
- Is not circular (“A: see B”; “B: see A”)
- Is not tautological (“An A is an A”)
- Can be substituted for its term without a loss or change of meaning
- Is concise and precise (minimize op-ed, repetition, redundancy, rationale, and elaboration)
- Is not a list of values
- Is coextensive with its concept
- Covers the concept completely; if an exception can be found then the definition is incomplete
- Does not cover more than–or exceed–the concept
- Does not begin with an infinitive
- Minimizes conjunctions
- Does not begin with “any,” or “some”
- Starts with a noun (after the article)
- Doesn’t begin with the term being defined (“An A is…”)
- Begins with a word in the same class as the term
Above all else remember that the constituents of a definition must be essential to it. A constituent is considered essential if its absence from the definition fundamentally changes the concept represented by the term.
Organize. Earlier I mentioned how terms can be usefully grouped into categories in order to facilitate understanding and findability; this grouping is typically done hierarchically, resulting in a taxonomy of terms. I’m jumping ahead a bit, but Talend Data Catalog (TDC) takes taxonomies further by adding additional term relationship types beyond just hierarchical ones, promoting the taxonomy to a thesaurus (more on this shortly). Getting back to categories, it’s most common for the top-level categories of the business glossary to correspond to some aspect of how the business is structured, such as organizing by
- Business unit
- Business function
- Subject area
Multinationals like GE might start with industry, whereas other, smaller, organizations may start with business unit. Categorizing by subject area is also quite common, with top level categories such as Customer, Employee, Inventory, and Accounts. Ideally, you’ll end up with 4-8 top-level categories, with about as many sub-categories. If category-term assignment is skewed, i.e., one category has vastly more terms than others, then the classification scheme likely needs tweaking.
An alternative to the top-down approach of progressively dividing categories is called card sorting, which comes from the library and information science field and is frequently used by information architects to design the logical organization of website content. Card sorting is wonderfully simple: You take a sampling of terms, write them on Post-it® notes, one term per note, and ask participants (i.e., stakeholders and SMEs) to organize the notes into groups, capturing the criteria and rationale used. Do this several times with different groups of participants and then synthesize the results into a classification scheme.
Link. At this point you’ve collected your terms, named them, defined them rigorously, and grouped them. The last step is to establish relationships between terms (beyond those implied by their grouping into the same category), and by doing this you create business glossary with the structure of a thesaurus, a concept system capable of expressing a broad range of semantics. Thesaurus standards such as ANSI/NISO Z39.19 and ISO 25964 recognize three term-relationship types: hierarchical relationships, equivalence relationships, and associative relationships.
The hierarchical relationship type is between a pair of terms where one term’s scope completely includes the other–one’s the superordinate term and the other’s the subordinate term, respectively. The thesaurus standards recognize four kinds of hierarchical relationships:
- The narrower term (NT) is a subset or subcategory of the broader term (BT)
- : Loan NT Home equity loan
- The NT is a component or part of the BT
- Salary BT Compensation package
- NTs are instances of BTs
- Western region BT Sales territory
- One NT, multiple BTs
- Mortgage Insurance BT Mortgage product BT Insurance product
Within a hierarchy, all members must be of the same fundamental category, e.g., things, people, activities, properties, etc. The standard test for this is the “all-and-some” test. As an example, consider the two terms Mortgage loan and Home equity loan. It must be the case that all Home equity loans are Mortgage loans and that only some Mortgage loans must be Home equity loans. If instead it’s only “some” on both sides (i.e., only some Home equity loans are Mortgage loans) then it’s not a true generic relationship.
Figure 4 – The “all / some test”
The equivalence relationship denotes term correspondence–the two terms are regarded as the same or nearly so in a wide variety of contexts and should be virtually interchangeable. Equivalences can be of the following kinds:
- Synonyms, e.g., Consumer reporting agency/Credit bureau
- Lexical variants, e.g., Organization/Organisation, Fiber/Fibre
- Quasi-synonyms, e.g., Borrower/Mortgagor
- Abbreviations, initialisms, and acronyms, e.g., ARM / Adjustable-rate mortgage
- Nonproprietary name to trade name, e.g., Adhesive bandage/ Band-Aid®, Acetaminophen/Tylenol®
The associative relationship is used to suggest additional or alternative terms to search for. It’s a bit of a catch-all and is therefore easy to abuse, so use it judiciously–there should be a strong mental association between the terms. A good test when considering linking two terms with this relationship type is to ask how likely it is that someone seeking data or information under one of the terms might also be interested in the other. If it’s highly likely then by all means connect them. Table 1 lists different specific types of relationships that can be represented by the associative relationship.
Table 1 – Types of associative relationships
To the three term relationship types recognized by the thesaurus standards, Talend Data Catalog adds two more: contains / contained by and represents / represented by. The first is used to associate entities to the attributes that describe them and the second is used for connecting attributes to the domains from which they draw their value. These additional relationships typically occur in cases where you’ve seeded your business glossary in TDC from a data model or relational database schema, with the entities/tables and their attributes becoming glossary terms.
Business glossaries offer many benefits to their diverse user community. Principally, they promote a shared understanding. This is because, as noted data governance author Lowell Fryman puts it,
An individual can have…a different context for the meaning and definition of the terminology used in the business…. Many of us will have a different context, different understanding, different organizational filter, different cultural or regional semantics, different country regulations that define terminology and processes differently, and other differences.
Additionally, business glossaries enable…
- Improved decisions – apply business and technical knowledge
- Reduced risk – mitigate misuse of data due to inconsistent understanding of business concepts
- Alignment – improve coordination between technology assets and the business
- Knowledge access – retrieve the breadth of documented institutional wisdom
- Accountability – institute processes for managing terms from candidacy to retirement
- Productivity – locate and re-use data assets in furtherance of self-service analytics
In this blog post I’ve defined the business glossary, laid its foundation with fundamentals, detailed a development process, and established its benefits. I hope it was useful to you.
Chisholm, Malcolm D. Definitions in Information Management, 2010.
Data Management Association. Data Management Body of Knowledge. Basking Ridge: Technics Publications, 2017.
Department of Defense, DoD 8320.1-M: “Data Element Standardization Procedures.”
Fryman, Lowell. “What is a Business Glossary?” BeyeNETWORK, 2012.
—. “Audiences of a Business Glossary.” BeyeNETWORK, 2012.
—. “Business Glossaries and Metadata: Metadata Strategy and the Business Glossary.” The Data Administration Newsletter, 2018.
Fryman, Lowell, et al. The Data and Analytics Playbook. Cambridge, MA: Morgan Kaufmann, 2017.
International Organization for Standardization. ISO 704: “Terminology work – Principles and Methods.”
International Organization for Standardization. ISO 25964: “Thesauri and interoperability with other vocabularies.”
International Organization for Standardization. ISO/IEC 11179: “Information Technology — Metadata registries.”
Knight, Michelle. “Business Glossary Basics.” Dataversity, 2017.
Lancaster, F.W. Vocabulary Control for Information Retrieval. Washington: Information Resources Press, 1972.
McDavid, Doug. “Business Language Analysis for Object-Oriented Information Systems.” IBM Systems Journal, v35, 1996.
Mill, John Stuart. A System of Logic: Ratiocinative and Inductive. Oxford: Benediction Classics, 2011.
National Information Standards Organization. ANSI/NISO Z39.19-2005 (R2010): “Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies.”
Ross, Ronald G. How to Define Business Terms in Plain English. Business Rule Solutions, LLC, 2017.
Stewart, Darin L. Building Enterprise Taxonomies. Mokita Press, 2008.