Tutorial¶

This chapter gives a gentle introduction and overview to cutplace. It introduces a simple set of example data and shows how to write a simple cutplace interface definition (CID) for it which can be used to validate the data. Starting from there, the CID gets refined and improved to allow more rigorous validations which detect increasingly subtle errors in the data.

You can find the files for the examples in the examples folder of the cutplace source distribution. For your convenience they are also provided as links at the bottom of each data or CID file.

A simple set of customer data¶

In order to make use of cutplace, you need some data. For this tutorial, we will use a simple set of customer data:

customer_id,surname,first_name,born,gender
1,Beck,Tyler,1995-11-15,male
2,Gibson,Martin,1969-08-18,male
3,Hopkins,Chester,1982-12-19,male
4,Lopez,Tyler,1930-10-13,male
5,James,Ana,1943-08-10,female
6,Martin,Jon,1932-09-27,male
7,Knight,Carolyn,1977-05-25,female
8,Rose,Tammy,2004-01-12,female
9,Gutierrez,Reginald,2010-05-18,male
10,Phillips,Pauline,1960-11-09,female

See: customers.csv

Because this is hard to read here is how the file looks if you open it in a spreadsheet application such as Excel or Calc:

Describing data format and and field names¶

Now let’s try to describe the format of this file. From our initial pondering we already know the following facts:

The data are provided as comma separated values (CSV).
The fields in the data have a meaning we can assign a heading to.

Without much further ado, here’s how you can tell these facts to cutplace:

	Interface: customer



	Data format

D	Format	CSV

D	Header	1



	Fields

F	customer_id

F	first_name

F	surname

F	date_of_birth

F	gender

See: cid_customers_field_names_only.ods

Let’s take a closer look at this.

Row 1 is a heading that simply describes what the document is about: the interface for customer data.

Row 3 is a section heading to point out that the description of the data format is about to follow.

Row 4 describes the data format by stating that the Format is CSV. Did you notice the D in the first column? This is a hint for cutplace that the row contains information about the data format it should be able to process.

Row 5 adds another detail to the data format: data files have a header row that does not contain data yet. In this case there is only one such row. If there would no header at all, you could have omitted this row.

Row 7 again is a section heading, this time pointing out that descriptions about the field formats are about to follow.

The remaining rows then describe the fields that data files must have. So far we only describe the name of the field, for example customer_id.

Note

A field name must contain only ASCII letters, numbers and underscore (_). So blanks, foreign characters such as umlauts and punctuation marks are not allowed and will result in an error message when cutplace attempts to read the CID.

Some examples for valid field names:

customer_id
iso_5218_gender
CustomerId

For comparison, here are a few broken field names

customer-id - error: contains the punctuation mark “-“; use “_” instead.
customer id - error: contains a blank; use “_” instead.
123easy - error: starts with a numeric digit; use a letter as first character.

Using comments¶

As already pointed out, if a row in the CID starts with an empty column, cutplace skips it without processing in any way. Actually, this would mean the same for cutplace:

D	Format	CSV

D	Header	1

F	customer_id

F	first_name

F	surname

F	date_of_birth

F	gender

But it’s a lot harder to read, isn’t it?

How and where to store the CID¶

So how do you store this information? Cutplace is quite flexible here. The easiest way would be to use a spreadsheet application such as Excel or OpenOffice.org’s Calc and store it as *.xlsx or *.ods file. Alternatively you can use the text editor or your choice and store it as UTF-8 encoded *.csv, where columns are separated with a comma (,) and data items with blanks or commas are embedded between double quotes (“).

Concerning the location on your disk, cutplace does not impose any requirements on you.

For this tutorial, we assume both the data and CID files are stored in the same folder which and that your current console terminal session already changed to this folder (using for example the command cd).

Running cutplace for the first time¶

Now that we have both a data file and a CID file, we can finally take a look at how cutplace actually works.

Open a terminal and change into the folder where where the example data and CID files are located:

cd .../where/ever/examples

Next let’s try if cutplace has been installed properly:

cutplace --version

This should result in an output similar to:

cutplace 0.8.x

The actual version numbers may vary. If your version of cutplace is older then 0.9.0, consider upgrading to avoid compatibility issues with this tutorial.

If instead this results in an error message, refer to the chapter on Installation about how to setup cutplace.

In case everything worked out so far, let’s finally do what we came here for: validate that our data conform to the interface we just described:

cutplace cid_customers_field_names_only.ods customers.csv

This assumes you used Calc to create the CID. Users of Excel should replace the “.ods” with “.xlsx”, users of text editors with “.csv” respectively.

Summary so far¶

Let’s recap what we learned so far:

You can use cutplace to validate that data conform to a CID.
The CID is a file you can create with Calc, Excel or your favorite text editor.
As a minimum, the CID has to specify the data format and the name of the fields.
Rows describing the data format have to start with “D”.
Rows describing a field have to start with “F”.
Rows starting with an empty column will not be parsed by cutplace and can contain any information that is helpful for human readers to better understand the interface.

Adding examples¶

Most people find it easiest to get a general grasp of something by looking at an example. Cutplace supports this line of thinking by letting you add an examples for a field right after the name:

	Fields

	Name	Example

F	customer_id	1

F	surname	Doe

F	first_name	Jane

F	date_of_birth	1995-11-15

F	gender	female

See: cid_customers_with_examples.ods

Finding an example usually does not require much imagination. In this case we just took the values from the first customer and changed the name to the generic “Jane Doe”.

These examples are entirely optional. If you cannot find a good example for a field (like one containing a database BLOB) or you do not think a real world example does not add any value (like a field containing an encrypted password), feel free to leave it empty.

Allowing fields to be empty¶

So far, every data item had an actual value and none of it was empty. But what if for instance we do not know the date of birth for one of our customers? Consider the following example data file:

See: customers_without_date_of_birth.csv

Two customers do not have a date of birth: Kenneth Tucker in row 4 and Ester Newman in row 10.

Now try to validate these data with the same CID we used before:

cutplace cid_customers_field_names_only.ods customers_without_date_of_birth.csv

This time the output contains the following lines:

INFO:cutplace:validate "examples/customers_without_date_of_birth.csv"
ERROR:cutplace:  customers_without_date_of_birth.csv (R4C4): cannot accept field 'date_of_birth': value must not be empty

The essential part here is:

field 'date_of_birth' must match format: value must not be empty

When you describe a field to cutplace in the CID, it assumes that the data always provide a value for this field. Apparently this is not the case with the data provided, so cutplace complains about it.

But what if there are actually customers we do not yet know the date of birth yet? Not every business transactions requires the date of birth, so this is perfectly valid.

So we have to tell cutplace that this field actually can be empty. This can easily be done by adding another column to the field description, where fields that can be empty are marked with an X:

	Fields

	Name	Example	Empty?

F	customer_id	16

F	first_name	Jane

F	surname	Doe

F	date_of_birth	1995-11-15	X

F	gender	female	X

See: cid_customers_with_empty_fields.ods

Now lets try again with the new CID:

cutplace cid_customers_with_empty_fields.ods customers_without_date_of_birth.csv

This time, no error messages show up and all the data are accepted:

INFO:cutplace:validate "customers_without_date_of_birth.csv"
INFO:cutplace:  accepted 10 rows

Limiting the length of field values¶

Right now we allow the fields to have any length. But what if the data should be processes by another program, which wants to insert the data in a database? Every field in the database has a limit on how many characters it can hold. If there are too many characters, the import will fail. And not always with an error message that makes it easy to backtrack where the broken data came from.

Fortunately, cutplace allows to describe length limits for fields:

	Name	Example	Empty?	Length

F	customer_id	16		2…

F	first_name	Jane		…60

F	surname	Doe		…60

F	date_of_birth	1995-11-15	X	10

F	gender	female	X	4…6

See: cid_customers_with_lengths.ods

Let’s take a closer look at these examples, especially at the meaning of the ellipsis (…) in some of the description of the lengths.

Let’s assume same the customer_id has to have at least 2 characters because one of them is a checksum in order to catch (most) mistyped numbers. In this case, the length is 2....
The first_name and surname can take at most 60 characters, maybe because someone said so way back in the 70s when COBOL ruled the world. To express this as length, use ...60.
The date_of_birth always takes exactly 10 characters because apparently we require it to use leading zeros.

Wait a second, didn’t we state before that the date_of_birth can be empty? Shouldn’t we use 0...10 then? Actually no, because this would also accept dates with a length of 1, 2, 3 and so on until 9. The possibility that date_of_birth can have a length of 0 is already taken care of by the X in the Empty? column.
And finally, the gender can be male or female, so its length is between 4 and 6, which reads as 4...6.

To summarize: lengths are either exact values (like 10) or ranges with a lower and upper limit separated by a colon (like 4...6). Either the lower or upper limit can be omitted (like 2... or ...60).

In case you cannot decide yet on a reasonable limit on a certain field, just leave its entry in the Length column empty.

Now lets try again with the new CID:

cutplace cid_customers_with_lengths.ods customers_without_date_of_birth.csv

As expected, all the data are accepted again.

Field types and rules¶

So far, all the validations have been rather simple and generic. It’s time to reveal the big guns: types and rules.

Let’s take a closer look at the customer_id. Apparently, it’s a number. To be more specific, an integer number with no fractional part. Let’s say the same person who told us that a customer_id has at least two digits now informed us that due a design stemming from the 16 bit era, the highest customer_id is 65535 (the largest number one can represent with 16 bit). Here’s how to express this knowledge with cutplace:

	Name	Example	Empty?	Length	Type	Rule

F	customer_id	16		2:	Integer	10…65535

F	first_name	Jane		:60

F	surname	Doe		:60

F	date_of_birth	1995-11-15	X	10	DateTime	YYYY-MM-DD

F	gender	female	X	2:6	Choice	male, female

See: cid_customers_with_types_and_rules.ods

The column Type can contain one of several available types. The column Rule can hold a text that gives further details about the Type.

Warning

Type names are case sensitive. So when you specify a type, make sure the letters match exactly concerning upper and lower case.

In case of customer_id, the type is Integer. In this case, the rule can specify a valid range. The syntax for the range is the same we’ve been using already for the Length column. So 10...65535 means “between 10 and 65535”.

For gender, the type is Choice which means that every value in this field must be in a list of possible choices specified with the rule. Here, possible choices are male and female.

Finally date_of_birth is of type DateTime. The rule describes the date format using place holders: DD (day), MM (month), YYYY (year including century) and YY (year without century). Any other character must show up literally, for example the . in the rule must show up as . in the value.

This tutorial showcases just a few of the types available.

Checking general conditions¶

So far we learned how to validate the general data format and values of separate fields. But what about conditions that are more sophisticated and require several fields or rows to validate them?

For example, we might want to validate that every customer has a unique customer_id.

For these kind of conditions cutplace supports checks. Here’s how it looks in practice:

	Checks

	Description	Type	Rule

C	customer must be unique	IsUnique	customer_id

See: cid_customers.ods

As you can see, checks require a “C” in the first column.

Next there is a description of the check. This should be a meaningful sentence because it shows up in error messages. It’s a good idea to word them as “Something must be something else” sentences.

The remaining two columns contain the type of the check and the rule. How the rule can look solely depends on the type of the check.

Conclusion¶

You are now familiar with the basic concepts behind cutplace and should be able to use this for writing reasonably complete and sophisticated CIDs.

Tutorial¶

A simple set of customer data¶

Describing data format and and field names¶

Using comments¶

How and where to store the CID¶

Running cutplace for the first time¶

Summary so far¶

Adding examples¶

Allowing fields to be empty¶

Limiting the length of field values¶

Field types and rules¶

Checking general conditions¶

Conclusion¶

cutplace

Navigation

Related Topics

	Description	Type	Rule

C	customer must be unique within branch	IsUnique	branch_id, customer_id