We accept the data we think we deserve
Given that data comes in many forms, shapes and sizes, it's important to understand the things you can do with data. Amongst the numerous level of jargon we have to remember in our day to day lives, trying to figure out the difference between a FAT table and a FACT table is not what you want on a weary Thursday afternoon.
Show me the data
Pivotal to getting to grips with the fields of data, is finding it. This could be in your everyday life, through your wearable tracker, your personal finance data to the stats of your favourite sports team. In your job, this could be more tricky to pin down. The Excel spreadsheet is still the most popular method of distributing and sharing data within organisations.
Finding out where the data is collected or extracted from at your organisation will help you understand the data better.
> Is there a database?
> Who is the database architect?
> What data exists for you to access?
> How is this data used currently?
Companies are more and more moving toward robust systems with governance in place - but whilst most companies are now collecting and accessing data, the trick of course is using this data.
And the key to using it? Understanding it.
The Lexicon of Data
When dealing with data, there are some basic concepts and terms which are used in this field. Let's explore a few of the common ones together.
Join
Joining is the method of matching rows against each other. Common terms you'll hear are 'LEFT joins, RIGHT joins and INNER joins Here are a few examples below, to aid basic understanding.
INNER JOIN
LEFT JOIN
UNION
A union effectively a FULL OUTER JOIN. If this is weird jargon which doesn't make sense, think of it as basically stacking data tables on top of one another. Where a join matches fields and adds rows, a union adds columns (while also matching columns which are either defined as similar, or have similar names)
Typically, a join will make a data table wider (more columns) and a union will make a table longer (more rows)
As above the jargon, lexicon or words we use when talking about data are worth considering and being aware of. Here are a few common ones you may hear:
- MARK
A mark is a data point, an individual cell of data which is shown on a visualisation. - RECORD
A record is a row of data. Nothing more, nothing less. - FIELD
A field is a column of data - these fields may be dimensions or measure, and may be discrete or continuous. Check out this blog about Blue Things and Green Things which is a helpful guide in understanding what is what. - DIMENSION
A dimension is a field of data which can slice and dice your data - this can be a reference ID, a text field or a key header. - MEASURE
A measure is something which creates an axis, and is typically something which you can count (as a number) - HEADER
The term header can depend on the context it is used in. It could be a FIELD HEADER, which is the name of the column; or it could be a TABLE HEADER, which refers to the dimension shown by the visualisation. - DATABASE
A database is a set of data which is structured in a distinct pre-organised, pre-designed way which is typically held on a server. This server may be on-premise (within the company's network on a physical machine) or in the cloud (either a private cloud specific to the company, or the public cloud, which is accessible to anyone - TABLE
In the 'database hierarchy' a table exists one level down from a database, which stores a set or sub-set of data points. - SCHEMA
When a database is mapped out, a schema is the diagram which shows the links between the tables. This includes key fields (such as ID fields) as well as fields which join tables together. These links are typically referred to as a relationships. - VIEW
A view is a pre-calculated set of joins, unions or blends, which are committed to the database in order to speed up the building of a specific set of data. A view can be rendered on the client side (on the machine of the computer of the analyst), or on the databased side, as a materialised view. - HIERARCHY
A hierarchy refers to the logical order of different dimensional columns. An example of a hierarchy would be Continent > Country > State > City > Postal Code - DATA SOURCE
A datasource is a combination of tables from a database in order to drive an app, report or system.
You have access to data - now what?
Typically, when approaching your data, you typically go in with purpose.
The purpose could be to replicate existing reporting, which has much more of a 'explanatory' approach - what questions are already asked, and how can this dataset help me answer that? At the other end of the spectrum is the 'exploratory' side of data analysis. This can be interpreted in two ways; the exploration you do as an analyst in order to create a report, chart or dashboard, or the type of visual you create, which is allows the end-user to explore the data. For this latter example, some core questions the end-user will ask need to be addressed, but also baked in would be the capability to ask more, deeper questions.
So what about the data exploration before the build begins? In modern self-service tools such as Tableau, this 'shuffle' can help to find (to steal a phrase from Ben Jones) the contours of your data. These are listed below.
Other best practices for investigating your data include asking questions such as:
- What is a row/record in the data?
- What is are key dimensions to filter or slice this data by?
- Which are measures which correlate, and what is the most used level of detail used?
- What are outliers? What are expected values for each measure?
- How are measured commonly aggregated as?
These questions come in particular importance when building out knowledge of a brand new datasource or table, in conjunction with the 10 questions Ben Jones mentioned in his contours of data checklist.
Impacts, gains & conclusions
Data is a substance which can drive real change, insight and ensure that decisions are made with full, contextualised information. However, this is dependant that both the creators of visualisations and the users have a tacit level of data literacy.
Through understanding the base layer of language used, and subsequently being able to apply this, the journey to becoming more data literate begin - and the steps to a truly data driven, self-service environment continues.