=======================================
Writing a Cutplace Interface Definition
=======================================

This chapter describes all aspects of writing a CID and can be used as
reference. In case you are looking for a gentle introduction, see the
:ref:`tutorial`.


Parts of a CID
==============

CID's for cutplace focus on the data specific parts. They describe tabular data
split in rows. Each row consists of fields. The number of fields per row must
be the same for all rows (except for optional header rows). The meaning of a
field at a certain column must be the same for each row. For example, once you
declare the field in column 3 to be a time, row 17 can not store the street
address in column 3 without violating the requirements for time fields.

1. The data format: The general format for data files, such as line separator,
   character encoding, quote character and so on. These properties concern the
   whole file and each data item in it.

2. Fields: These are the smallest unit of information in a data file. A file
   consists 0, 1 or many lines. Each line is divided into 1 or more fields,
   which carry the actual information. For files with structured data, a
   certain field retains its meaning even when inspected in a different line.
   For example, if the third field in line 1 is a phone number, so has to be
   the third field in line 2, 3, 17, 23459 and so on. Nevertheless, the
   actual phone number can differ from line to line.

3. Optional checks, which are rules that have to be met across the whole
   document or several fields in a row. For example, a customer ID might be
   supposed to be unique and therefor must occur only once within a file.

4. Comments are intended for human readers. Cutplace does not process them.
   Typically they describing the meaning of things or the motivation for
   certain decisions. Another use is the description the source of certain data
   items to simplify error analysis.

Below you will find a detailed description of each of these concepts
accompanied by many examples.

.. index:: data format

.. _data-formats:

Data formats
============

The data format describes general properties of the data. Here is an example:

Example data format

==  ==============  ===========
..  Property        Value
==  ==============  ===========
D   Format          Delimited
D   Encoding        UTF-8
D   Line delimiter  LF
D   Item delimiter  ,
==  ==============  ===========

This basically says that the data are provided as a text file using a
delimiter to separate field items from each other. Characters are encoded
using `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_, a character encoding
capable of representing all possible Unicode characters. Rows are separated
using linefeed characters (ASCII code 10) and field items are separated using
a comma (,).

The remainder of this section describes the supported formats and available
properties for them.

.. index:: single: data format; delimited

Delimited data
--------------

For data, both lines and columns are delimited by certain characters.

Example for delimited data using visible ASCII characters and Cyrillic (Unicode
0x0400-0x04ff)

==  ===================  =====================
..  Property             Value
==  ===================  =====================
D   Format               Delimited
D   Encoding             UTF-8
D   Line delimiter       CRLF
D   Item delimiter       ,
D   Quote character      "
D   Escape character     "
D   Decimal separator    .
D   Thousands separator  ,
D   Allowed characters   32...128, 1024...1280
==  ===================  =====================

In case Format is Delimited, the following properties have to be specified:

.. index:: pair: data format property; encoding

Encoding
    The character encoding. The most common values will be ASCII, CP1252
    (for many western countries), UTF-8 (for Unicode), CP850 (used by MS DOS
    in many western countries).

.. index:: pair: data format property; line delimiter

Line delimiter
    This describes which character or character sequence is used to mark the
    end of a line. Possible values are:

    * LF - "line feed", ASCII code 10, used by Unix based platforms and others,
      for example Mac OS X, Linux, Solaris BSD-variants and Amiga OS.

    * CRLF - "carriage return and linefeed", two characters with ASCII code 13
      and 10, used for example by Windows and MS DOS.

    * CR - "carriage return", ASCII code 13, used by Mac OS Classic.

    * Any - accepts any of the above as line delimiter

.. index:: pair: data format property; item delimiter

Item delimiter
    The character used to separated data items from each other, for example:

	==  ==================  ========  ===============================
	..  Property            Value     Note
	==  ==================  ========  ===============================
	D   Item delimiter      ``","``
	D   Item delimiter      ``44``    ASCII code of ``","``
	D   Item delimiter      ``0x2c``  ASCII code of ``","`` in hex format
	D   Item delimiter      ``,``     Deprecated syntax, enclose in double quotes instead
	D   Item delimiter      ``"\t"``  Escaped text indicating a tabulator
	D   Item delimiter      ``Tab``   Same as ``"\t"`` but using a more legible symbolic name
	==  ==================  ========  ===============================

.. index:: pair: data format property; quote character

Quote character
    The character used to surround items with that contain delimiters or while
    space, for example double quote (") or single quote (').

    TODO: How to specify "no quoting"?

.. index:: pair: data format property; escape character

Escape character
    The escape character necessary to use the quote character in item values.
    Possible values are: double quote (").

.. index:: pair: data format property; decimal separator

.. _decimal-separator:

Decimal separator
	The character to separate the fractional part of a number, for example
	in ``17.23``. Typical values are: dot (.) and comma (,). The default is
	dot (.).

.. index:: pair: data format property; thousands separator

.. _thousands-separator:

Thousands separator
    The character to optionally group digits in large numbers, for example in
    ``12,345,678``. Typical values are: comma (,), dot (.) and the space
    character. By default, no character can be used to group digits.

.. index:: pair: data format property; allowed characters

Allowed characters
    This range describing the characters allowed for data items. Each number
    represents the decimal Unicode value of a character that can be used. With
    the help of colons (:) you can easily specify several characters. For
    example, ``32...128`` means "between 32 and 128".

    You can find more information on how to specify ranges in :ref:`ranges`.

.. index:: single: data format; CSV

CSV (comma separated values) as a special case for delimited data. Despite
the name, CSV data regularly use other separators than comma (,), so cutplace
treats them the same as delimited data. In fact, you can specify it in CID:

==  ==============  =====
..  Property        Value
==  ==============  =====
D   Format          CSV
==  ==============  =====

For cutplace, there is no difference between "delimited" and "CSV".


.. index:: pair: data format; Excel

.. _format-excel:

Excel data
----------

Excel is a spreadsheet application and part of Microsoft Office.

Minimal example for Excel data

==  ========  =====
..  Property  Value
==  ========  =====
D   Format    Excel
==  ========  =====

Additionally there are a couple of optional properties.

A more advanced example for Excel data

==  ========  =====
..  Property  Value
==  ========  =====
D   Format    Excel
D   Header    2
D   Sheet     5
==  ========  =====

.. index:: pair: data format property; header

.. _header:

The property *header* describes how many rows should be skipped before the data
to validate start. It is optional and defaults to 0, meaning there is no header
and the first row already contains data.

.. index:: pair: data format property; sheet

.. _sheet:

The property *sheet* specifies from which sheet the data should be read. It is
only required in case a workbook contains more than one sheet and the data to
validate are located in the second or any later sheet. This property defaults
to 1 meaning the first sheet.

Excel uses special ways to internally store dates and times, so what you
see on the screen in generally is not what cutplace gets when it reads
Excel data. To avoid confusion, here's short list of how certain data
from Excel will look to cutplace:

.. index:: pair: Excel type; currency
.. index:: pair: Excel type; date
.. index:: pair: Excel type; time
.. index:: pair: Excel type; percent

Mapping between Excel types and cutplace

=============  =============  ===================
Excel type     cutplace type  rule
=============  =============  ===================
Currency       Decimal
Date           DateTime       YYYY-MM-DD 00:00:00
Date and time  DateTime       YYYY-MM-DD hh:mm:ss
Percent        Decimal
Time           DateTime       hh:mm:ss
=============  =============  ===================

.. warning::
  Excel cannot represent integer numbers exactly, so you better use
  *Decimal* instead of *Integer* in the CID. In case you do use *Integer*
  be prepared for weird validation errors and rounding issues.

.. index:: pair: data format; fixed

Fixed data
----------

Fixed data reserve a certain number of characters per field. No delimiters are
necessary.

Example for fixed data format

==  ==================  ======
..  Property            Value
==  ==================  ======
F   Format              Fixed
F   Encoding            CP1252
F   Line delimiter      LF
F   Allowed characters  0...
==  ==================  ======

.. index:: pair: data format; ODS

ODS data (open document spreadsheet)
------------------------------------

The Open Document Spreadsheet (ODS) file format is supported by several
application, for instance OpenOffice.org's Calc.

Minimal example for ODS data

==  ========  =====
..  Property  Value
==  ========  =====
F   Format    ODS
==  ========  =====

The properties header and sheet have the same meaning as described in
:ref:`format-excel`.

A more advanced example for ODS data

==  ========  =====
..  Property  Value
==  ========  =====
F   Format    ODS
F   Header    2
F   Sheet     5
==  ========  =====

.. index:: field format

.. _field-formats:

Field formats
=============

This section describes the different field formats.

Overview
--------

The field format section of the CID contains rows with the following columns:

#. The letter "F" to indicate that the remaining columns describe a field
   format.

#. The name of the field. It must start with an ASCII letter and continue with
   letters, numbers and underscores (_), for example
   ``customer_id``.

#. An optional example value for the field. This is for documentation purpose
   only and can be omitted for fields where there is no meaningful example (such
   as a field containing a BLOB). In case a value is specified though, it must
   be a valid example conforming to all the rules for this field.

#. A flag that indicates if the field is allowed to be empty. ``X`` means that
   the field can be empty, no text means that the field always must contain at
   least some data.

#. The optional length of the field in characters.  For separated formats, this is
   optional and takes the form ``lower_limit:upper_limit``.  For example,
   ``10...20`` means that values in this field must contains at least 10
   characters and at most 20. It is also possible to specify only a lower or
   upper limit, for example ``10...`` means at least 10 characters ans ``...20``
   means at least 20 characters.  Furthermore the length can be a single number
   with any colon (:), meaning that the length must match this number exactly.
   For fixed formats, this column takes a number that specifies the exact length
   of the field, for example ``50``.

#. The optional type of the field, for example ``Text``, ``Integer``, ``DateTime`` and
   others. Refer to the sections below for detailed descriptions of these types. If you
   do not specify a type, ``Text`` is used.

#. A rule depending on the type further describing the field.  For example, a
   field of type DateTime requires an exact date or time format such as
   ``DD.MM.YYYY``.

The remaining columns are not parsed by cutplace and can contain any text you
like, for example a description of the meaning of the field or details about
from where the data originate.

Simple examples for various field formats

==  =============  ==========  ======  ==========  ========  ==========
..  Name           Example     Empty   Length      Type      Rule
==  =============  ==========  ======  ==========  ========  ==========
F   customer_id    123456                          Integer   1...999999
F   surname        Miller              1...60      Text
F   date_of_birth  1969-11-03  X                   DateTime  YYYY-MM-DD
==  =============  ==========  ======  ==========  ========  ==========

.. index:: double: field format; Text

Text
----

The Text type describes a field that can contain any letters, digits and other
characters.

Examples for Text fields

==  =======  =======  =====  ======  ====  ====
..  Name     Example  Empty  Length  Type  Rule
==  =======  =======  =====  ======  ====  ====
F   surname  Miller          1...60  Text
==  =======  =======  =====  ======  ====  ====

.. index:: double: field format; Integer

Integer
-------

The Integer type describes a field that can contain decimal numbers without any
fractional part.

Examples for Integer fields

==  ======  =======  =====  ======  =======  =========
..  Name    Example  Empty  Length  Type     Rule
==  ======  =======  =====  ======  =======  =========
F   height  3798                    Integer  0...8848
F   weight  72              0...    Integer  0...
F   id      1337            5       Integer  1...99999
==  ======  =======  =====  ======  =======  =========

.. index:: double: field format; Decimal

.. _field-format-decimal:

Decimal
-------

The Decimal type describes a field that can contain decimal numbers
including a fractional part. Similar to Integer, the rule allows to
specify a range and implicitely a precision.

Examples for Decimal fields

==  ==========  =======  =====  ======  =======  ====================  ==================================
..  Name        Example  Empty  Length  Type     Rule                  Note
==  ==========  =======  =====  ======  =======  ====================  ==================================
F   balance     -123.45                 Decimal  -99999.99...99999.99
F   percentage  17.23                   Decimal  0...100.00
F   size        28.34                   Decimal  1...7.33, 8.4...183   same as 1.00...7.33, 8.40...183.00
F   something                           Decimal                        use default (see below)
==  ==========  =======  =====  ======  =======  ====================  ==================================

In case the various parts of the range differ in their scale and precision,
the respective maximum is used for the collected scale and precision. The
example field can have 3 digits before the dot (due the value ``183`) and
and 2 digits after the dot (due the value ``7.33``), resulting in a scale
of 5 digits with a precision of 2 digits.

In case no rule is specified (as with the example field "something"), a
default range between 9999999999999999999.999999999999 and
-9999999999999999999.999999999999 is assumed, meaning a scale of 31 digits
with a precision of 12 digits.

Technically the number of digits is limited only by the available memory.

.. index:: decimal separator

In case the numbers use a comma (",") or any other character to separate the
fractional part, set the data format property
:ref:`decimal separator <decimal-separator>` accordingly.

.. index:: thousands separator

In case the numbers use an additional separator to group digits, set the data
format property :ref:`thousands separator <thousands-separator>` accordingly.

.. index:: double: field format; Choice
.. _choice-field:

Choice
------

The Choice type describes a field that can contain on value out of a set of
possibly values.

Examples for Choice fields

==  ==========  =======  =====  ======  ======  ================================================
..  Name        Example  Empty  Length  Type    Rule
==  ==========  =======  =====  ======  ======  ================================================
F   color       red                     Choice  "red", "green", "blue"
F   iso_gender  male                    Choice  "male", "female", "unknown", "other"
F   department  sales                   Choice  "accounting", "development", "sales", "shipping"
==  ==========  =======  =====  ======  ======  ================================================

.. index:: double: field format; Constant
.. _constant-field:

Constant
--------

The Constant type describes a field that can contain a specific value - and
nothing else. The rule is a single Python token describing the expected value
as string or number.

Examples for Constant fields

==  ============  =======  =====  ======  ========  =======
..  Name          Example  Empty  Length  Type      Rule
==  ============  =======  =====  ======  ========  =======
F   department    sales                   Constant  "sales"
F   always_empty           X              Constant
F   kind_id                               Constant  3
F   size                                  Constant  1.23
==  ============  =======  =====  ======  ========  =======

The mark in the *Empty* flag must be set only if the constant value is empty.
To describe a field that can contain a constant value but might also be empty
use a :ref:`choice-field` field:

Example optional constant using Choice

==  ==========  =======  =====  ======  ==========  =======
..  Name        Example  Empty  Length  Type        Rule
==  ==========  =======  =====  ======  ==========  =======
F   department  sales    **X**          **Choice**  "sales"
==  ==========  =======  =====  ======  ==========  =======

.. index:: double: field format; DateTime

.. _field-format-datetime:

DateTime
--------

The DateTime type describes a field that can contain a date and/or time in a
specified format.

To describe a date, use the following place holders:

* DD: the day (a number between 1 and 31)

* MM: the numeric month (a number between 1 and 12)

* YYYY: the year including the century (a number between 1 and 9999)

* YY: the year without century

To describe a time, use the following place holders:

* hh: hours (a number between 0 and 23)

* mm: minutes (a number between 0 and 59)

* ss: seconds, a number between 0 and 61; note that 60 and 61 are valid values
  because of possible leap seconds.

Leading zeros are ignored. Any other characters will be interpreted as
separators and have to appear in the data as specified.

Examples for DateTime fields

==  ===============  ==========  =====  ======  ========  ==========
..  Name             Example     Empty  Length  Type      Rule
==  ===============  ==========  =====  ======  ========  ==========
F   date_of_birth    1969-11-03                 DateTime  YYYY-MM-DD
F   time_of_arrival  17:23                      DateTime  hh:mm
==  ===============  ==========  =====  ======  ========  ==========

.. index:: double: field format; Pattern

.. _field-format-pattern:

Pattern
-------

The Pattern type is similar to the Text type but additionally allows to use
special characters as place holders:

* "?" means "exactly 1 character".

* "*" means "none or any characters".

* "[sequence]" means any character in sequence.

* "[!sequence]" means any character not in sequence.

A ``sequence`` can be a list of characters such as ``[abc123]`` or a range
like ``[a-z]``.

Examples for pattern fields

==  ============  ===========  ======  ======  =======  ============
..  Name          Example      Empty   Length  Type     Rule
==  ============  ===========  ======  ======  =======  ============
F   dos_filename  cmd.exe              1...12  Pattern  ?*.*
F   branch_id     B123-abc-x3                  Pattern  B???-????-?*
==  ============  ===========  ======  ======  =======  ============

.. index:: double: field format; RegEx

.. _field-format-regex:

RegEx
-----

The RegEx type is similar to the Pattern type but allows more sophisticated
place holders by describing a regular expression. The syntax available is
described in the chapter on "Regular expression operations" of the Python
documentation, available from http://docs.python.org/library/re.html.

Examples for RegEx fields

==  =====  ================  =====  ======  =====  ================================================
..  Name   Example           Empty  Length  Type   Rule
==  =====  ================  =====  ======  =====  ================================================
F   email  some@example.com                 RegEx  ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$ [#fn1]_
==  =====  ================  =====  ======  =====  ================================================

.. index:: checks

Checks
======

Checks are rules that cannot be expressed easily with the rules available for
data formats and field formats. In general checks validate conditions that can
be only be met by looking at several fields in a row or the whole document. In
the CID, a row describing the check requires the following columns:

#. A human readable description of the check that will be used in the error
   message in case the check fails. Most of the time this will be a short
   sentence of the template "something must/have something". For instance,
   "``customer must be unique``".

#. The type of the check as described in one of the sections below, for
   example ``DistinctCount`` or ``IsUnique``.

#. A rule describing the actual check to perform. The contents of this field
   highly depend on the check type specified in the previous column. For
   example, the IsUnique check requires the field(s) to be checked for
   uniqueness like "``branch_id, customer_id``"

The remainder of this section describes the available checks in detail and
gives specific examples.

.. index:: pair: checks; DistinctCount

.. _check-distinct-count:

DistinctCount
-------------

Purpose: Validate that the number of different values for a certain field is
within expected limits.

The rule column describes the field to check and the limit is must meet.
Example check for a limited number of different values within a field shows how
to make sure that the data contain at most 5 different branch_ids.

Example check for a limited number of different values within a field.

==  ======================================  =============  =============
..  Description                             Type           Rule
==  ======================================  =============  =============
C   distinct branches must be within limit  DistinctCount  branch_id < 5
==  ======================================  =============  =============

To describe the rule you can use any comparison operator or mathematical
expression available to the Python language.

.. index:: pair: checks; IsUnique

.. _check-is-unique:

IsUnique
--------

Purpose: Validate that values for a field or a combination of fields occurs
only once. This enables to detect duplicate or contradicting data.

The "Rule" column describes the field that must contain only unique values.
Example check for unique values within a single field shows how to specify that
two customers must not have the same ID numbers.

Example check for unique values within a single field.

==  =======================  ========  ===========
..  Description              Type      Rule
==  =======================  ========  ===========
C   customer must be unique  IsUnique  customer_id
==  =======================  ========  ===========

It could also be possible that customers actually may have the same ID number
as long as they are assigned to different branches. In this case, only the
combination of branch_id and customer_id must be unique.  Example check for
unique values within a combination of fields shows how to describe a check for
this: simply list all the necessary fields, separated by a comma (,) sign.

Example check for unique values within a combination of fields.

==  =======================  ========  ======================
..  Description              Type      Rule
==  =======================  ========  ======================
C   customer must be unique  IsUnique  branch_id, customer_id
==  =======================  ========  ======================

Comments
========

Comments can show up in the CID at any line or column cutplace does not parse.
In particular this constitutes:

* Lines that have an empty first column. Remember that a D means details about
  the data format, F about the field format and C describes checks.

* Columns that are past the columns needed by cutplace. For example, in a line
  describing a data format property, cutplace parses only the first three (D,
  Property name, value). Because of that you can write any text starting with
  column number 4.

.. index:: ranges

.. _ranges:

Ranges
======

At several locations in the CID you can specify ranges. For example as value
for the "Allowed characters" property of a data format or as length of a field
format. Example ranges shows a couple of examples for ranges and explains their
meaning.

Example ranges.

====================  =======================================================================================================================================
Example               Description
====================  =======================================================================================================================================
``5...20``            Between 5 and 20
``6...``              At least 6
``...7``              At most 7. Sample accepted values are -5, 0, 4 or 7.  Sample rejected values would be 8, 17, or 723.
``8``                 Exactly 8, which is the only accepted value. Anything else is rejected.
``2, 4, 6, 8``        One of the values specified, meaning 2, 4, 6 or 8.  Anything else is rejected, including 3, 5 and 7.
``20...30, 40...50``  Everything between 20 and 30 or between 40 and 50. Sample accepted values are 20, 27, 43 and 50. Sample rejected values are 19, 31, 55.
====================  =======================================================================================================================================

Essentially ranges are one or more values (separated by a comma (``,``)) that
are either numeric constant or a lower and upper limit separated by an
ellipsis (``...``). You can omit the lower or upper limit, in which case
cutplace will use a sensible default depending on the context. For instance, a
length of ``...20`` will use 0 as lower limit, whereas a field format of type
``Integer`` with a rule of ``...20`` will use the smallest number possible 32
integer number which is -2147483648.

It is possible to use hexadecimal notation by starting the number with ``0x``,
for instance:

==================  ================
Example             Same as
==================  ================
``0x0f``            ``15``
``0x10``            ``16``
``0xabcd``          ``43981``
``10...0x10``       ``10...16``
``...-0xDeadBeef``  ``...-3735928559``
==================  ================

You can also use single letters to specify range values, which are treated
the same as the numeric ASCII or Unicode value:

========================  =====================
Example                   Same as
========================  =====================
``"A"..."Z"``             ``65...90``
``"A"..."Z", "a"..."z"``  ``65...90, 97...122``
========================  =====================

For unprintable letters and Unicode characters you can use Python escape
sequences:

====================  =============================================================
Example escaped text  Same as
====================  =============================================================
``"\t"``              ``9`` (tabulator)
``"\\"``              ``92`` (backslash)
``"\'"``              ``39`` (single quote)
``"\""``              ``34`` (double quote)
``u"\u00dc"``         ``220`` (the Unicode character 220, also known as "Umlaut U")
====================  =============================================================

Additionally there are a few symbolic names that are easier to read than the
rather cryptic escape sequences using a backslash:

=============  ============  ======
Symbolic name  Escaped text  Number
=============  ============  ======
``Cr``         ``"\r"``      13
``Ff``         ``"\f"``      12
``Lf``         ``"\n"``      10
``Tab``        ``"\t"``      9
``Vt``         ``"\v"``      11
=============  ============  ======

For reasons of backward compatibility with version 0.7 and earlier, you can also use
a colon (``:``) in place of the ellipsis (``...``). Additionally you can use the single
character representation of ellipsis (``"\u2026"``) in place of three dots.

.. rubric:: Footnotes

.. [#fn1] Validate that field value is an email address as described in `how to find or validate an email address <http://www.regular-expressions.info/email.html>`_