Video script for a presentation on data quality. (CONTENT)

A quick note. This isn’t copy, it’s not trying to sell anything, but it’s here to show the value of clear writing.

The Client

I wrote this script for IT Services at Oxford University as part of a training package to promote the concept of Data Quality. The client wanted staff to know what could go wrong when handling data.

The Business Objective

IT Services wanted an easy-to-understand script which illustrated common problems that occur when data flows around a large organisation. They also wanted viewers to know how those issues could be detected and resolved.

The University processes a lot of data automatically, so any errors made in one area can corrupt data sets elsewhere. Most data errors happen because of human error, and training in this area can nip problems in the bud.

The intended audience were managers at the University: staff working on research, academic departments, or finance. Typically they wouldn’t be IT experts.

The Challenge

The presentation had to explain technical ideas in a simple way. Data quality is a complex field, and many managers at the University will be new to data analysis. 

The presentation had to outline the problems seen at the University, and explain the categories they fell into. It also had to show how checking could spot those errors, and help people improve their own data. 

The script was the audio component for an animated video, and the client wanted an entertaining and humourous script.

The Solution

I made sure I understood the concepts underpinning the data quality and arranged meetings with relevant people to check my learning.

I used simple language and easy-to-follow sentence structures to create a script which would be easy to follow. I created several examples to explain the types of data quality, and checked with the client to ensure these were correct. 

The piece also highlighted the benefits of having clean data.

Reading the script aloud and listening to a recording let me check it sounded decent.

I added a few simple jokes, and the client was happy with the piece’s tone.

The Deliverable

A three minute video, with animations, and my voice providing the narration.

Data Quality audio script

We must manage and check our data regularly, and there’s six dimensions to check.

  1.          Completeness
  2.          Timeliness
  3.          Consistency
  4.          Validity
  5.          Uniqueness
  6.          Accuracy

Let’s examine each dimension.

Completeness says how much of a dataset is present. Consider an e-course training record. We can say each record must hold the trainee’s first and last name, their SSO, the e-course’s name and ID, and the date the student completed the course.

You can check for blank fields by filtering.

If 98% of records have valid entries for each field, the data is 98% complete.

Validity says how well the data meets business rules. We can say ‘Start date’ values must be in “DD-MMM-YYYY” format and can’t be later than today (unless we allow time travel).

Check values are in agreed formats and meet business rules. The proportion of compliant fields gives you a percentage figure for compliance.

Uniqueness tells us how well the data avoids duplicates.

For example, we might say each student must only have one SSO (even if they’ve cloned themselves). We can measure uniqueness by looking for duplicate SSO values. The proportion of unique values gives us a uniqueness figure.

Timeliness lets us know if our data is on time.

A business rule might say new staff must have a record created in our systems within one day of their appointed start date. (This helps them get paid, which is always nice).

We can check timeliness by evaluating records against our agreed business rules. If it took two days to establish a staff member’s record, we’d have a problem, and this would affect our Timeliness score.

Consistency shows us if one area’s data agrees with data held somewhere else.

It would be inconsistent to have a staff member with a terminated ‘Person’ record if they still have open appointments and are still being paid by the University.

We check consistency by referencing one record against what appears for that record in other databases. You could scan someone’s records to make sure everything matches.

A proportion of records will match, and this gives a Consistency figure.

Accuracy reflects how well data represents the real world. For instance, we might hold a staff member’s address. If our records don’t match the actual address, there’s a problem.

We can check data manually, or through code.

The proportion of matching records gives you an Accuracy figure.

A problem with one dimension often means other dimensions won’t be right. If we capture an appointment start date in a US format (MM-DD-YYYY), we’ll affect Validity, and leave our appointment records incomplete (affecting Completeness). Team members might also create duplicate records (affecting Uniqueness).

Maintaining good data quality means we must look after all dimensions. Sorting out all six dimensions is our road to data quality.