.. _tutorial:

========
Tutorial
========

This chapter gives a gentle introduction and overview to cutplace. It
introduces a simple set of example data and shows how to write a simple
cutplace interface definition (CID) for it which can be used to validate
the data. Starting from there, the CID gets refined and improved to
allow more rigorous validations which detect increasingly subtle errors
in the data.

You can find the files for the examples in the :file:`examples` folder of
the cutplace source distribution. For your convenience they are also
provided as links at the bottom of each data or CID file.

A simple set of customer data
=============================

In order to make use of cutplace, you need some data. For this tutorial,
we will use a simple set of customer data:

.. literalinclude:: ../examples/customers.csv

See: :download:`customers.csv <../examples/customers.csv>`

Because this is hard to read here is how the file looks if you
open it in a spreadsheet application such as Excel or Calc:

.. include:: ../docs/include/customers.rst

Describing data format and and field names
==========================================

Now let's try to describe the format of this file. From our initial
pondering we already know the following facts:

#. The data are provided as comma separated values (CSV).
#. The fields in the data have a meaning we can assign a heading to.

Without much further ado, here's how you can tell these facts to cutplace:

+-+-------------------+---+
+ +Interface: customer+   +
+-+-------------------+---+
+ +                   +   +
+-+-------------------+---+
+ +**Data format**    +   +
+-+-------------------+---+
+D+Format             +CSV+
+-+-------------------+---+
+D+Header             +1  +
+-+-------------------+---+
+ +                   +   +
+-+-------------------+---+
+ +**Fields**         +   +
+-+-------------------+---+
+F+customer_id        +   +
+-+-------------------+---+
+F+first_name         +   +
+-+-------------------+---+
+F+surname            +   +
+-+-------------------+---+
+F+date_of_birth      +   +
+-+-------------------+---+
+F+gender             +   +
+-+-------------------+---+

See: :download:`cid_customers_field_names_only.ods <../examples/cid_customers_field_names_only.ods>`

Let's take a closer look at this.

Row 1 is a heading that simply describes what the document is about: the
interface for customer data.

Row 3 is a section heading to point out that the description of the data
format is about to follow.

Row 4 describes the data format by stating that the ``Format`` is ``CSV``. Did
you notice the ``D`` in the first column? This is a hint for cutplace that the
row contains information about the data format it should be able to process.

Row 5 adds another detail to the data format: data files have a header row
that does not contain data yet. In this case there is only one such row. If
there would no header at all, you could have omitted this row.

Row 7 again is a section heading, this time pointing out that
descriptions about the field formats are about to follow.

The remaining rows then describe the fields that data files must have.
So far we only describe the name of the field, for example
``customer_id``.

.. note::
  A field name must contain only ASCII letters, numbers and underscore
  (_). So blanks, foreign characters such as umlauts and
  punctuation marks are not allowed and will result in an error message
  when cutplace attempts to read the CID.

  Some examples for valid field names:

  * ``customer_id``

  * ``iso_5218_gender``

  * ``CustomerId``

  For comparison, here are a few broken field names

  * ``customer-id`` - error: contains the punctuation mark "``-``";
    use "``_``" instead.

  * ``customer id``  - error: contains a blank; use "``_``" instead.

  * ``123easy`` - error: starts with a numeric digit; use a letter as
    first character.

Using comments
==============

As already pointed out, if a row in the CID starts with an empty column,
cutplace skips it without processing in any way. Actually, this would
mean the same for cutplace:

+-+-------------------+---+
+D+Format             +CSV+
+-+-------------------+---+
+D+Header             +1  +
+-+-------------------+---+
+F+customer_id        +   +
+-+-------------------+---+
+F+first_name         +   +
+-+-------------------+---+
+F+surname            +   +
+-+-------------------+---+
+F+date_of_birth      +   +
+-+-------------------+---+
+F+gender             +   +
+-+-------------------+---+

But it's a lot harder to read, isn't it?

How and where to store the CID
==============================

So how do you store this information? Cutplace is quite flexible here.
The easiest way would be to use a spreadsheet application such as Excel
or OpenOffice.org's Calc and store it as :file:`*.xlsx` or :file:`*.ods`
file. Alternatively you can use the text editor or your choice and store it
as UTF-8 encoded :file:`*.csv`, where columns are separated with a comma (,)
and data items with blanks or commas are embedded between double quotes (").

Concerning the location on your disk, cutplace does not impose any
requirements on you.

.. TODO: Recommendations how to name the CID files.

For this tutorial, we assume both the data and CID files are stored in
the same folder which and that your current console terminal session already
changed to this folder (using for example the command :command:`cd`).

Running cutplace for the first time
===================================

Now that we have both a data file and a CID file, we can finally take a look
at how cutplace actually works.

Open a terminal and change into the folder where where the example data
and CID files are located::

  cd .../where/ever/examples

Next let's try if cutplace has been installed properly::

  cutplace --version

This should result in an output similar to::

  cutplace 0.8.x

The actual version numbers may vary. If your version of cutplace is older then
|release|, consider upgrading to avoid compatibility issues with this
tutorial.

If instead this results in an error message, refer to the chapter on
:doc:`installation` about how to setup cutplace.

In case everything worked out so far, let's finally do what we came here
for: validate that our data conform to the interface we just described::

  cutplace cid_customers_field_names_only.ods customers.csv

This assumes you used Calc to create the CID. Users of Excel should
replace the ":file:`.ods`" with ":file:`.xlsx`", users of text editors with
":file:`.csv`" respectively.

Summary so far
==============

Let's recap what we learned so far:

* You can use cutplace to validate that data conform to a CID.

* The CID is a file you can create with Calc, Excel or your favorite
  text editor.

* As a minimum, the CID has to specify the data format and the name of
  the fields.

* Rows describing the data format have to start with "``D``".

* Rows describing a field have to start with "``F``".

* Rows starting with an empty column will not be parsed by cutplace and
  can contain any information that is helpful for human readers to
  better understand the interface.

Adding examples
===============

Most people find it easiest to get a general grasp of something by
looking at an example. Cutplace supports this line of thinking by
letting you add an examples for a field right after the name:

+-+--------------------+--------------+
+ +Fields              +              +
+-+--------------------+--------------+
+ +*Name*              +*Example*     +
+-+--------------------+--------------+
+F+customer_id         +**1**         +
+-+--------------------+--------------+
+F+surname             +**Doe**       +
+-+--------------------+--------------+
+F+first_name          +**Jane**      +
+-+--------------------+--------------+
+F+date_of_birth       +**1995-11-15**+
+-+--------------------+--------------+
+F+gender              +**female**    +
+-+--------------------+--------------+

See: :download:`cid_customers_with_examples.ods <../examples/cid_customers_with_examples.ods>`

Finding an example usually does not require much imagination. In this
case we just took the values from the first customer and changed the
name to the generic "Jane Doe".

These examples are entirely optional. If you cannot find a good example
for a field (like one containing a database BLOB) or you do not think
a real world example does not add any value (like a field containing an
encrypted password), feel free to leave it empty.

Allowing fields to be empty
===========================

So far, every data item had an actual value and none of it was empty.
But what if for instance we do not know the date of birth for one of
our customers? Consider the following example data file:

.. include:: ../docs/include/customers_without_date_of_birth.rst

See: :download:`customers_without_date_of_birth.csv <../examples/customers_without_date_of_birth.csv>`

Two customers do not have a date of birth: Kenneth Tucker in row 4 and
Ester Newman in row 10.

Now try to validate these data with the same CID we used before::

  cutplace cid_customers_field_names_only.ods customers_without_date_of_birth.csv

This time the output contains the following lines::

  INFO:cutplace:validate "examples/customers_without_date_of_birth.csv"
  ERROR:cutplace:  customers_without_date_of_birth.csv (R4C4): cannot accept field 'date_of_birth': value must not be empty

The essential part here is::

  field 'date_of_birth' must match format: value must not be empty

When you describe a field to cutplace in the CID, it assumes that the
data always provide a value for this field. Apparently this is not the
case with the data provided, so cutplace complains about it.

But what if there are actually customers we do not yet know the date of
birth yet? Not every business transactions requires the date of birth,
so this is perfectly valid.

So we have to tell cutplace that this field actually can be empty. This
can easily be done by adding another column to the field description,
where fields that can be empty are marked with an ``X``:

+-+--------------------+----------+----------+
+ +Fields              +          +          +
+-+--------------------+----------+----------+
+ +Name                +Example   +**Empty?**+
+-+--------------------+----------+----------+
+F+customer_id         +16        +          +
+-+--------------------+----------+----------+
+F+first_name          +Jane      +          +
+-+--------------------+----------+----------+
+F+surname             +Doe       +          +
+-+--------------------+----------+----------+
+F+date_of_birth       +1995-11-15+**X**     +
+-+--------------------+----------+----------+
+F+gender              +female    +**X**     +
+-+--------------------+----------+----------+

See: :download:`cid_customers_with_empty_fields.ods <../examples/cid_customers_with_empty_fields.ods>`

Now lets try again with the new CID::

  cutplace cid_customers_with_empty_fields.ods customers_without_date_of_birth.csv

This time, no error messages show up and all the data are accepted::

  INFO:cutplace:validate "customers_without_date_of_birth.csv"
  INFO:cutplace:  accepted 10 rows


Limiting the length of field values
===================================

Right now we allow the fields to have any length. But what if the data should be
processes by another program, which wants to insert the data in a database?
Every field in the database has a limit on how many characters it can hold. If
there are too many characters, the import will fail. And not always with an
error message that makes it easy to backtrack where the broken data came from.

Fortunately, cutplace allows to describe length limits for fields:

+-+--------------------+----------+------+------------+
+ +Name                +Example   +Empty?+**Length**  +
+-+--------------------+----------+------+------------+
+F+customer_id         +16        +      +**2...**    +
+-+--------------------+----------+------+------------+
+F+first_name          +Jane      +      +**...60**   +
+-+--------------------+----------+------+------------+
+F+surname             +Doe       +      +**...60**   +
+-+--------------------+----------+------+------------+
+F+date_of_birth       +1995-11-15+X     +**10**      +
+-+--------------------+----------+------+------------+
+F+gender              +female    +X     +**4...6**   +
+-+--------------------+----------+------+------------+

See: :download:`cid_customers_with_lengths.ods <../examples/cid_customers_with_lengths.ods>`

Let's take a closer look at these examples, especially at the meaning of
the ellipsis (...) in some of the description of the lengths.

* Let's assume same the ``customer_id`` has to have at least 2 characters
  because one of them is a checksum in order to catch (most) mistyped
  numbers. In this case, the length is ``2...``.

* The ``first_name`` and ``surname`` can take at most 60 characters, maybe
  because someone said so way back in the 70s when COBOL ruled the world.
  To express this as length, use ``...60``.

* The ``date_of_birth`` always takes exactly 10 characters because apparently
  we require it to use leading zeros.

  Wait a second, didn't we state before that the ``date_of_birth`` can be
  empty? Shouldn't we use ``0...10`` then? Actually no, because this would
  also accept dates with a length of 1, 2, 3 and so on until 9. The
  possibility that ``date_of_birth`` can have a length of 0 is already
  taken care of by the ``X`` in the *Empty?* column.

* And finally, the ``gender`` can be ``male`` or ``female``, so its length is between
  4 and 6, which reads as ``4...6``.

To summarize: lengths are either exact values (like ``10``) or ranges with a
lower and upper limit separated by a colon (like ``4...6``). Either the lower
or upper limit can be omitted (like ``2...`` or ``...60``).

In case you cannot decide yet on a reasonable limit on a certain field,
just leave its entry in the *Length* column empty.

Now lets try again with the new CID::

  cutplace cid_customers_with_lengths.ods customers_without_date_of_birth.csv

As expected, all the data are accepted again.

.. TODO: Provide an example data file that causes errors.

Field types and rules
=====================

So far, all the validations have been rather simple and generic. It's
time to reveal the big guns: types and rules.

Let's take a closer look at the ``customer_id``. Apparently, it's a
number. To be more specific, an integer number with no fractional part.
Let's say the same person who told us that a ``customer_id`` has at
least two digits now informed us that due a design stemming from the 16
bit era, the highest ``customer_id`` is 65535 (the largest number one
can represent with 16 bit). Here's how to express this knowledge with
cutplace:

+-+--------------------+----------+------+------+------------+----------------+
+ +Name                +Example   +Empty?+Length+**Type**    +**Rule**        +
+-+--------------------+----------+------+------+------------+----------------+
+F+customer_id         +16        +      +2:    +**Integer** +**10...65535**  +
+-+--------------------+----------+------+------+------------+----------------+
+F+first_name          +Jane      +      +:60   +            +                +
+-+--------------------+----------+------+------+------------+----------------+
+F+surname             +Doe       +      +:60   +            +                +
+-+--------------------+----------+------+------+------------+----------------+
+F+date_of_birth       +1995-11-15+X     +10    +**DateTime**+**YYYY-MM-DD**  +
+-+--------------------+----------+------+------+------------+----------------+
+F+gender              +female    +X     +2:6   +**Choice**  +**male, female**+
+-+--------------------+----------+------+------+------------+----------------+

See: :download:`cid_customers_with_types_and_rules.ods <../examples/cid_customers_with_types_and_rules.ods>`

The column *Type* can contain one of several available types. The column
*Rule* can hold a text that gives further details about the *Type*.

.. warning::

  Type names are case sensitive. So when you specify a type, make sure
  the letters match exactly concerning upper and lower case.

In case of ``customer_id``, the type is ``Integer``. In this case, the
rule can specify a valid range. The syntax for the range is the same
we've been using already for the *Length* column. So ``10...65535``
means "between 10 and 65535".

For ``gender``, the type is ``Choice`` which means that every value in
this field must be in a list of possible choices specified with the
rule. Here, possible choices are ``male`` and ``female``.

Finally ``date_of_birth`` is of type ``DateTime``. The rule describes
the date format using place holders: ``DD`` (day), ``MM`` (month),
``YYYY`` (year including century) and ``YY`` (year without century). Any
other character must show up literally, for example the ``.`` in the
rule must show up as ``.`` in the value.

This tutorial showcases just a few of the types available.

.. seealso::

  :ref:`field-format-decimal`
    A field format to describe decimal numbers with a fractional part.

  :ref:`field-format-pattern`
    A field format to match patterns using asterisk (*) and question
    mark (?) as placeholders.

  :ref:`field-format-regex`
    A field format to match patterns using regular expressions.


Checking general conditions
===========================

So far we learned how to validate the general data format and values
of separate fields. But what about conditions that are more sophisticated
and require several fields or rows to validate them?

For example, we might want to validate that every customer has a unique
``customer_id``.

For these kind of conditions cutplace supports *checks*. Here's how it
looks in practice:

+-+--------------------------------------+-------------+----------------------+
+ +Checks                                +             +                      +
+-+--------------------------------------+-------------+----------------------+
+ +*Description*                         +*Type*       +*Rule*                +
+-+--------------------------------------+-------------+----------------------+
+C+customer must be unique               +IsUnique     +customer_id           +
+-+--------------------------------------+-------------+----------------------+

See: :download:`cid_customers.ods <../examples/cid_customers.ods>`

As you can see, checks require a "C" in the first column.

Next there is a description of the check. This should be a meaningful
sentence because it shows up in error messages. It's a good idea to word
them as "Something must be something else" sentences.

The remaining two columns contain the type of the check and the rule. How
the rule can look solely depends on the type of the check.

.. seealso::

  :ref:`check-distinct-count`
    A check to validate that the number of distinct values in a field
    meets a specified condition.

  :ref:`check-is-unique`
    A check to validate that a field value or a combination of field
    values is unique in each row compared with all other rows.

It is also possible to check for composite keys spawning several fields. As an
example consider multiple branches of a company where each branch can have a
customer with a ``customer_id`` of for example ``16``:

+-+--------------------+----------+------+------+------------+----------------+
+ +Name                +Example   +Empty?+Length+Type        +Rule            +
+-+--------------------+----------+------+------+------------+----------------+
+F+**branch_id**       +123       +      +1...  +Integer     +1...99999       +
+-+--------------------+----------+------+------+------------+----------------+
+F+customer_id         +16        +      +2...  +Integer     +10...65535      +
+-+--------------------+----------+------+------+------------+----------------+
+F+first_name          +Jane      +      +...60 +            +                +
+-+--------------------+----------+------+------+------------+----------------+
+F+surname             +Doe       +      +...60 +            +                +
+-+--------------------+----------+------+------+------------+----------------+
+F+date_of_birth       +1995-11-15+X     +10    +DateTime    +YYYY-MM-DD      +
+-+--------------------+----------+------+------+------------+----------------+
+F+gender              +female    +X     +2...6 +Choice      +male, female    +
+-+--------------------+----------+------+------+------------+----------------+

To check that the customer_id is unique within each branch, use:

+-+--------------------------------------+-------------+-----------------------+
+ +Description                           +Type         +Rule                   +
+-+--------------------------------------+-------------+-----------------------+
+C+customer must be unique within branch +IsUnique     +branch_id, customer_id +
+-+--------------------------------------+-------------+-----------------------+

Possibly branches are 5 digit codes however in practice it might be known that
there are at most 100 branches at the same time. To express this, use:

+-+---------------------------------------+--------------+-----------------+
+ +Description                            +Type          +Rule             +
+-+---------------------------------------+--------------+-----------------+
+C+distinct branches must be within limit +DistinctCount +branch_id <= 100 +
+-+---------------------------------------+--------------+-----------------+


Conclusion
==========

You are now familiar with the basic concepts behind cutplace and should
be able to use this for writing reasonably complete and sophisticated
CIDs.