Tutorial

This chapter gives a gentle introduction and overview to cutplace. It introduces a simple set of example data and shows how to write a simple cutplace interface definition (CID) for it which can be used to validate the data. Starting from there, the CID gets refined and improved to allow more rigorous validations which detect increasingly subtle errors in the data.

You can find the files for the examples in the examples folder of the cutplace source distribution. For your convenience they are also provided as links at the bottom of each data or CID file.

A simple set of customer data

In order to make use of cutplace, you need some data. For this tutorial, we will use a simple set of customer data:

customer_id,surname,first_name,born,gender
1,Beck,Tyler,1995-11-15,male
2,Gibson,Martin,1969-08-18,male
3,Hopkins,Chester,1982-12-19,male
4,Lopez,Tyler,1930-10-13,male
5,James,Ana,1943-08-10,female
6,Martin,Jon,1932-09-27,male
7,Knight,Carolyn,1977-05-25,female
8,Rose,Tammy,2004-01-12,female
9,Gutierrez,Reginald,2010-05-18,male
10,Phillips,Pauline,1960-11-09,female

See: customers.csv

Because this is hard to read here is how the file looks if you open it in a spreadsheet application such as Excel or Calc:

Describing data format and and field names

Now let’s try to describe the format of this file. From our initial pondering we already know the following facts:

  1. The data are provided as comma separated values (CSV).
  2. The fields in the data have a meaning we can assign a heading to.

Without much further ado, here’s how you can tell these facts to cutplace:

  Interface: customer  
     
  Data format  
D Format CSV
D Header 1
     
  Fields  
F customer_id  
F first_name  
F surname  
F date_of_birth  
F gender  

See: cid_customers_field_names_only.ods

Let’s take a closer look at this.

Row 1 is a heading that simply describes what the document is about: the interface for customer data.

Row 3 is a section heading to point out that the description of the data format is about to follow.

Row 4 describes the data format by stating that the Format is CSV. Did you notice the D in the first column? This is a hint for cutplace that the row contains information about the data format it should be able to process.

Row 5 adds another detail to the data format: data files have a header row that does not contain data yet. In this case there is only one such row. If there would no header at all, you could have omitted this row.

Row 7 again is a section heading, this time pointing out that descriptions about the field formats are about to follow.

The remaining rows then describe the fields that data files must have. So far we only describe the name of the field, for example customer_id.

Note

A field name must contain only ASCII letters, numbers and underscore (_). So blanks, foreign characters such as umlauts and punctuation marks are not allowed and will result in an error message when cutplace attempts to read the CID.

Some examples for valid field names:

  • customer_id
  • iso_5218_gender
  • CustomerId

For comparison, here are a few broken field names

  • customer-id - error: contains the punctuation mark “-”; use “_” instead.
  • customer id - error: contains a blank; use “_” instead.
  • 123easy - error: starts with a numeric digit; use a letter as first character.

Using comments

As already pointed out, if a row in the CID starts with an empty column, cutplace skips it without processing in any way. Actually, this would mean the same for cutplace:

D Format CSV
D Header 1
F customer_id  
F first_name  
F surname  
F date_of_birth  
F gender  

But it’s a lot harder to read, isn’t it?

How and where to store the CID

So how do you store this information? Cutplace is quite flexible here. The easiest way would be to use a spreadsheet application such as Excel or OpenOffice.org’s Calc and store it as *.xlsx or *.ods file. Alternatively you can use the text editor or your choice and store it as UTF-8 encoded *.csv, where columns are separated with a comma (,) and data items with blanks or commas are embedded between double quotes (”).

Concerning the location on your disk, cutplace does not impose any requirements on you.

For this tutorial, we assume both the data and CID files are stored in the same folder which and that your current console terminal session already changed to this folder (using for example the command cd).

Running cutplace for the first time

Now that we have both a data file and a CID file, we can finally take a look at how cutplace actually works.

Open a terminal and change into the folder where where the example data and CID files are located:

cd .../where/ever/examples

Next let’s try if cutplace has been installed properly:

cutplace --version

This should result in an output similar to:

cutplace 0.8.x

The actual version numbers may vary. If your version of cutplace is older then , consider upgrading to avoid compatibility issues with this tutorial.

If instead this results in an error message, refer to the chapter on Installation about how to setup cutplace.

In case everything worked out so far, let’s finally do what we came here for: validate that our data conform to the interface we just described:

cutplace cid_customers_field_names_only.ods customers.csv

This assumes you used Calc to create the CID. Users of Excel should replace the “.ods” with “.xlsx”, users of text editors with “.csv” respectively.

Summary so far

Let’s recap what we learned so far:

  • You can use cutplace to validate that data conform to a CID.
  • The CID is a file you can create with Calc, Excel or your favorite text editor.
  • As a minimum, the CID has to specify the data format and the name of the fields.
  • Rows describing the data format have to start with “D”.
  • Rows describing a field have to start with “F”.
  • Rows starting with an empty column will not be parsed by cutplace and can contain any information that is helpful for human readers to better understand the interface.

Adding examples

Most people find it easiest to get a general grasp of something by looking at an example. Cutplace supports this line of thinking by letting you add an examples for a field right after the name:

  Fields  
  Name Example
F customer_id 1
F surname Doe
F first_name Jane
F date_of_birth 1995-11-15
F gender female

See: cid_customers_with_examples.ods

Finding an example usually does not require much imagination. In this case we just took the values from the first customer and changed the name to the generic “Jane Doe”.

These examples are entirely optional. If you cannot find a good example for a field (like one containing a database BLOB) or you do not think a real world example does not add any value (like a field containing an encrypted password), feel free to leave it empty.

Allowing fields to be empty

So far, every data item had an actual value and none of it was empty. But what if for instance we do not know the date of birth for one of our customers? Consider the following example data file:

See: customers_without_date_of_birth.csv

Two customers do not have a date of birth: Kenneth Tucker in row 4 and Ester Newman in row 10.

Now try to validate these data with the same CID we used before:

cutplace cid_customers_field_names_only.ods customers_without_date_of_birth.csv

This time the output contains the following lines:

INFO:cutplace:validate "examples/customers_without_date_of_birth.csv"
ERROR:cutplace:  customers_without_date_of_birth.csv (R4C4): cannot accept field 'date_of_birth': value must not be empty

The essential part here is:

field 'date_of_birth' must match format: value must not be empty

When you describe a field to cutplace in the CID, it assumes that the data always provide a value for this field. Apparently this is not the case with the data provided, so cutplace complains about it.

But what if there are actually customers we do not yet know the date of birth yet? Not every business transactions requires the date of birth, so this is perfectly valid.

So we have to tell cutplace that this field actually can be empty. This can easily be done by adding another column to the field description, where fields that can be empty are marked with an X:

  Fields    
  Name Example Empty?
F customer_id 16  
F first_name Jane  
F surname Doe  
F date_of_birth 1995-11-15 X
F gender female X

See: cid_customers_with_empty_fields.ods

Now lets try again with the new CID:

cutplace cid_customers_with_empty_fields.ods customers_without_date_of_birth.csv

This time, no error messages show up and all the data are accepted:

INFO:cutplace:validate "customers_without_date_of_birth.csv"
INFO:cutplace:  accepted 10 rows

Limiting the length of field values

Right now we allow the fields to have any length. But what if the data should be processes by another program, which wants to insert the data in a database? Every field in the database has a limit on how many characters it can hold. If there are too many characters, the import will fail. And not always with an error message that makes it easy to backtrack where the broken data came from.

Fortunately, cutplace allows to describe length limits for fields:

  Name Example Empty? Length
F customer_id 16   2...
F first_name Jane   ...60
F surname Doe   ...60
F date_of_birth 1995-11-15 X 10
F gender female X 4...6

See: cid_customers_with_lengths.ods

Let’s take a closer look at these examples, especially at the meaning of the ellipsis (...) in some of the description of the lengths.

  • Let’s assume same the customer_id has to have at least 2 characters because one of them is a checksum in order to catch (most) mistyped numbers. In this case, the length is 2....

  • The first_name and surname can take at most 60 characters, maybe because someone said so way back in the 70s when COBOL ruled the world. To express this as length, use ...60.

  • The date_of_birth always takes exactly 10 characters because apparently we require it to use leading zeros.

    Wait a second, didn’t we state before that the date_of_birth can be empty? Shouldn’t we use 0...10 then? Actually no, because this would also accept dates with a length of 1, 2, 3 and so on until 9. The possibility that date_of_birth can have a length of 0 is already taken care of by the X in the Empty? column.

  • And finally, the gender can be male or female, so its length is between 4 and 6, which reads as 4...6.

To summarize: lengths are either exact values (like 10) or ranges with a lower and upper limit separated by a colon (like 4...6). Either the lower or upper limit can be omitted (like 2... or ...60).

In case you cannot decide yet on a reasonable limit on a certain field, just leave its entry in the Length column empty.

Now lets try again with the new CID:

cutplace cid_customers_with_lengths.ods customers_without_date_of_birth.csv

As expected, all the data are accepted again.

Field types and rules

So far, all the validations have been rather simple and generic. It’s time to reveal the big guns: types and rules.

Let’s take a closer look at the customer_id. Apparently, it’s a number. To be more specific, an integer number with no fractional part. Let’s say the same person who told us that a customer_id has at least two digits now informed us that due a design stemming from the 16 bit era, the highest customer_id is 65535 (the largest number one can represent with 16 bit). Here’s how to express this knowledge with cutplace:

  Name Example Empty? Length Type Rule
F customer_id 16   2: Integer 10...65535
F first_name Jane   :60    
F surname Doe   :60    
F date_of_birth 1995-11-15 X 10 DateTime YYYY-MM-DD
F gender female X 2:6 Choice male, female

See: cid_customers_with_types_and_rules.ods

The column Type can contain one of several available types. The column Rule can hold a text that gives further details about the Type.

Warning

Type names are case sensitive. So when you specify a type, make sure the letters match exactly concerning upper and lower case.

In case of customer_id, the type is Integer. In this case, the rule can specify a valid range. The syntax for the range is the same we’ve been using already for the Length column. So 10...65535 means “between 10 and 65535”.

For gender, the type is Choice which means that every value in this field must be in a list of possible choices specified with the rule. Here, possible choices are male and female.

Finally date_of_birth is of type DateTime. The rule describes the date format using place holders: DD (day), MM (month), YYYY (year including century) and YY (year without century). Any other character must show up literally, for example the . in the rule must show up as . in the value.

This tutorial showcases just a few of the types available.

See also

Decimal
A field format to describe decimal numbers with a fractional part.
Pattern
A field format to match patterns using asterisk (*) and question mark (?) as placeholders.
RegEx
A field format to match patterns using regular expressions.

Checking general conditions

So far we learned how to validate the general data format and values of separate fields. But what about conditions that are more sophisticated and require several fields or rows to validate them?

For example, we might want to validate that every customer has a unique customer_id.

For these kind of conditions cutplace supports checks. Here’s how it looks in practice:

  Checks    
  Description Type Rule
C customer must be unique IsUnique customer_id

See: cid_customers.ods

As you can see, checks require a “C” in the first column.

Next there is a description of the check. This should be a meaningful sentence because it shows up in error messages. It’s a good idea to word them as “Something must be something else” sentences.

The remaining two columns contain the type of the check and the rule. How the rule can look solely depends on the type of the check.

See also

DistinctCount
A check to validate that the number of distinct values in a field meets a specified condition.
IsUnique
A check to validate that a field value or a combination of field values is unique in each row compared with all other rows.

It is also possible to check for composite keys spawning several fields. As an example consider multiple branches of a company where each branch can have a customer with a customer_id of for example 16:

  Name Example Empty? Length Type Rule
F branch_id 123   1... Integer 1...99999
F customer_id 16   2... Integer 10...65535
F first_name Jane   ...60    
F surname Doe   ...60    
F date_of_birth 1995-11-15 X 10 DateTime YYYY-MM-DD
F gender female X 2...6 Choice male, female

To check that the customer_id is unique within each branch, use:

  Description Type Rule
C customer must be unique within branch IsUnique branch_id, customer_id

Possibly branches are 5 digit codes however in practice it might be known that there are at most 100 branches at the same time. To express this, use:

  Description Type Rule
C distinct branches must be within limit DistinctCount branch_id <= 100

Conclusion

You are now familiar with the basic concepts behind cutplace and should be able to use this for writing reasonably complete and sophisticated CIDs.