Application programmer interface

Overview

Additionally to the command line tool cutplace all functions are available as Python API. For a complete reference about all public classes and functions, refer to the Module Index.

This chapter describes how to perform a basic validation of a simple CSV file containing data about some customers. It also explains how to extend cutplace’s fields formats and checks to implement your own.

Logging

Cutplace uses Python’s standard logging module. This provides a familiar and powerful way to watch what cutplace is doing. However, it also requires to setup the logging properly in order to gain most from it.

For a quick start, set up your application’s log messages to go to the console and show only information, warning and errors, but no debug messages:

>>> import logging
>>> logging.basicConfig(level=logging.INFO)

Next trim cutplace’s logging to show only warnings and errors as you might not be particularly interested in whatever it is cutplace does during a validation:

>>> logging.getLogger('cutplace').setLevel(logging.WARNING)

This should be enough to get you going. To learn more about logging, take a look at logging chapter of the Python library documentation.

Basic usage

Reading a CID

The class cutplace.Cid represents a CID. In case you have a CID stored in a file and want to read it, use:

>>> import os.path
>>> import cutplace
>>>
>>> # Compute the path of a test file in a system independent manner,
>>> # assuming that the current folder is "docs".
>>> cid_path = os.path.join(os.pardir, 'examples', 'cid_customers.ods')
>>> cid = cutplace.Cid(cid_path)
>>> cid.field_names
['customer_id', 'surname', 'first_name', 'date_of_birth', 'gender']

This is the easiest way to describe an interface. The input document is human readable even for non coders and quite simple to edit and maintain. It also keeps declaration and validation in separate files.

Validating data

Now that we know how our data are supposed to look, we want to validate and optionally process them. The easiest way to do so are two simple functions called cutplace.validate() and cutplace.rows(). Both of them take to parameters: the path to a CID and the path to the data to validate or read. For example:

>>> valid_data_path = os.path.join(os.pardir, 'examples', 'customers.csv')
>>> cutplace.validate(cid_path, valid_data_path)

If the data are valid, cutplace.validate() seemingly does nothing. For broken data, it raises cutplace.error.DataError.

To also process the data after each row has been validated, use:

>>> for row in cutplace.rows(cid_path, valid_data_path):
...     pass  # We could also do something useful with the data in ``row`` here.

We could easily extend the loop body to process the data in some meaningful way such as inserting them in a database.

Instead of paths to files, both functions also take a cutplace.Cid and / or filelike object ready to read:

>>> import io
>>> cid = cutplace.Cid(cid_path)
>>> with io.open(valid_data_path, 'r', encoding=cid.data_format.encoding, newline='') as data_stream:
...     cutplace.validate(cid, data_stream)

If you need more control over the validation or reading process, take a look at the cutplace.Reader class. It provides a simple generator function cutplace.Reader.rows() that returns all data rows. If you are familiar with Python’s csv.reader(), you already know how to use it.

Dealing with errors

So far we only had to deal with valid data. But what happens if the data do not conform to the CID? Let’s take a look at it:

>>> import cutplace.errors
>>> broken_data_path = os.path.join(os.pardir, 'tests', 'data', 'broken_customers.csv')
>>> cutplace.validate(cid, broken_data_path)
Traceback (most recent call last):
    ...
cutplace.errors.FieldValueError: broken_customers.csv (R3C1): cannot accept field 'customer_id': value must be an integer number: 'abcd'

Apparently the first broken data item causes the validation to stop with an cutplace.errors.FieldValueError, which is a descendant of cutplace.errors.CutplaceError. In many cases this is what you want.

Sometimes however the requirements for an application will state that all valid data should be processed and invalid data should be put aside for further examination, for example by writing them to a log file. This is easy to implement using cutplace.rows() with the optional parameter on_error='yield'. With this enabled, the generator always returns a value even for broken rows. The difference however is that broken rows do not result in a list of values but in a result of type cutplace.errors.DataError. It is up to you to detect this and process the different kinds of results properly.

Here is an example that prints any data related errors detected during validation:

>>> broken_data_path = os.path.join(os.pardir, 'tests', 'data', 'broken_customers.csv')
>>> for row_or_error in cutplace.rows(cid, broken_data_path, on_error='yield'):
...     if isinstance(row_or_error, Exception):
...         if isinstance(row_or_error, cutplace.errors.CutplaceError):
...             # Print data related error details and move on.
...             print('%s' % row_or_error)
...         else:
...             # Let other, more severe errors terminate the validation.
...             raise row_or_error
...     else:
...         pass  # We could also do something useful with the data in ``row`` here.
broken_customers.csv (R3C1): cannot accept field 'customer_id': value must be an integer number: 'abcd'
broken_customers.csv (R4C5): cannot accept field 'gender': value is 'unknown' but must be one of: 'female' or 'male'
broken_customers.csv (R5C4): cannot accept field 'date_of_birth': date must match format YYYY-MM-DD (%Y-%m-%d) but is: '17.04.1954' (time data '17.04.1954' does not match format '%Y-%m-%d')

Note that it is possible for the reader to throw other exceptions, for example IOError in case the file cannot be read at all or UnicodeError in case the encoding does not match. You should not continue after such errors as they indicate a problem not related to the data but either in the specification or environment.

The on_error parameter can also take the values 'raise' (which is the default and raises a cutplace.errors.CutplaceError on encountering the first error as described above) and 'continue', which silently ignores any error and moves on with the next row. The latter can be useful during prototyping a new application when CID’s and data are in a constant state of flux. In production code on_error='continue' mainly represents a very efficient way to shoot yourself into the foot.

Processing data

As a first step, we should figure out where in each row we can find the first name and the surname. We need to do this only once so this happens outside of the processing loop. The names used to find the indices must match the names used in the CID:

>>> first_name_index = cid.field_index('first_name')
>>> surname_index =  cid.field_index('surname')

Now we can read the data just like before. Instead of a simple pass loop we obtain the first name from row and check if it starts with 'T'. If so, we compute the full name and print it:

>>> for row in cutplace.rows(cid, valid_data_path):
...   first_name = row[first_name_index]
...   if first_name.startswith('T'):
...      surname = row[surname_index]
...      full_name = surname + ', ' + first_name
...      print(full_name)
Beck, Tyler
Lopez, Tyler
Rose, Tammy

Of course nothing prevents you from doing more glamorous things here like inserting the data into a database or rendering them to a dynamic web page.

Partial validation

If performance is an issue, validation of field formats and row checks can be limited to a specified number of rows using the parameter validate_until. Any integer value greater than 0 specifies the number of rows after which validation should stop. None means that the whole input should be validated (the default) while the number 0 specifies that no row should be validated.

Functions that support validate_until are:

  • cutplace.validate()
  • cutplace.rows()
  • cutplace.Reader.__init__()

Pure validation functions such as cutplace.validate() completely stop processing the input after reaching the limit while reading functions such as cutplace.rows() keep producing rows - just without validating them.

A typical use case would be enabling full validation during testing and reducing validation to the first 100 rows in the production environment. Ideally this would detect all errors during testing (when performance is less of an issue) and quickly process the data in production while still detecting errors early in the data.

Putting it all together

To recapitulate and summarize the previous sections here is a little code fragment containing a complete example you can use as base for your own validation code:

>>> # Validate a test CSV file.
>>> import os.path
>>> from cutplace import Cid, Reader
>>> # Change this to use your own files.
>>> cid_path = os.path.join(os.pardir, 'examples', 'cid_customers.ods')
>>> data_path = os.path.join(os.pardir, 'examples', 'customers.csv')
>>> # Define the interface.
>>> cid = Cid(cid_path)
>>> # Validate the data.
>>> for row in cutplace.rows(cid, data_path):
...   pass # We could also do something useful with the data in ``row`` here.

In case you want to process the data, simply replace the pass inside the loop by whatever needs to be done.

In case you want to continue even if a row was rejected, use the optional parameter on_error='yield' as described earlier.

Writing data

To validate written data, use :py:class`cutplace.Writer`. A Writer needs a CID to validate against and an output to write to. The output can be any filelike object such as a file or an io.StringIO. For example:

>>> import io
>>> out = io.StringIO()

Now you can create a writer and write a valid row to it:

>>> writer = cutplace.Writer(cid, out)
>>> writer.write_row(['38000', '234', 'John', 'Doe', 'male', '08.03.1957'])

Attempting to write broken data results in an Exception derived from cutplace.errors.CutplaceError:

>>> writer.write_row(['not a number', 'Miller', 'John', '1978-11-27' ,'male']) 
Traceback (most recent call last):
FieldValueError: <io> (R1C2): field 'customer_id' must match format: value must be an integer number: 'not a number'

Note that after a CutplaceError you can continue writing. For any other Exception such as IOError it is recommended to stop writing and consider it an unrecoverable situation.

Once done, close both the writer and the output:

>>> writer.close()
>>> out.close()

As cutplace.Writer implements the context manager protocol, you can also use the with statement to automatically close() it when done.

Note that cutplace.Writer.close() performs cutplace checks and consequently can raise a cutplace.errors.CheckError.

Advanced usage

In the previous section, you learned how to read a CID and use it to validate data using a few simple API calls. You also learned how to handle errors detected in the data.

With this knowledge, you should be able to write your own small validation scripts that process the results. For instance, you could add your own code to log errors, send validation reports via email or automatically insert accepted rows in a data base. The Python standard library offers powerful modules for all these tasks.

In case you are already happy and found everything you need, you can stop reading this chapter and move on with implementing your tasks.

If however you need more flexibility, suffer from API OCPD or just want to know what else cutplace offers in case you might need it one day, the following sections describe the lower level hooks of cutplace API. They are more powerful and flexible, but also more difficult to use.

Building a CID in the code

In some cases it might be preferable to include the CID in the code, for instance for trivial interfaces that are only used internally. Here is an example of a simple CID for CSV data with 3 fields:

First, import the necessary modules:

>>> from cutplace import data
>>> from cutplace import errors
>>> from cutplace import fields
>>> from cutplace import interface

Next create an empty CID:

>>> cid = Cid()

As the CID will not be read from an input file, error messages would not be able to refer to any file in case of errors. To have at least some reference, we need to tell the CID that it is declared from source code:

>>> cid.set_location_to_caller()

That way, error messages will refer you to the Python module where this call happened.

Next we can add rows as read from a CID file using cutplace.Cid.add_data_format(), cutplace.Cid.add_field_format() and cutplace.Cid.add_check():

>>> # Use CSV as data format. This is the same as having a spreadsheet
>>> # with the cells:
>>> #
>>> # | F | Format         | Delimited |
>>> # | F | Item separator | ;   |
>>> cid.add_data_format_row([cutplace.data.KEY_FORMAT, data.FORMAT_DELIMITED])
>>> cid.add_data_format_row([cutplace.data.KEY_ITEM_DELIMITER, ';'])
>>>
>>> # Add a couple of fields.
>>> cid.add_field_format_row(['id', '', '', '1...5', 'Integer'])
>>> cid.add_field_format_row(['name'])
>>> cid.add_field_format_row(['date_of_birth', '', 'X', '', 'DateTime', 'YYYY-MM-DD'])
>>>
>>> # Make sure that the ``id`` field contains only unique values.
>>> cid.add_check_row(['id_must_be_unique', 'IsUnique', 'id'])
>>> cid.field_names
['id', 'name', 'date_of_birth']

If any of this methods cannot handle the parameters you passed, they raise a cutplace.errors.CutplaceError with a message describing what went wrong. For example:

>>> cid.add_check_row([]) 
Traceback (most recent call last):
InterfaceError: <source> (R1C2): check row (marked with 'c') must contain at least 2 columns

Adding your own field formats

Cutplace already ships with several field formats found in cutplace.fields module that should cover most needs. If however you have some very special requirements, you can write your own formats.

Simply inherit from cutplace.fields.AbstractFieldFormat and optionally provide a constructor to parse the rule parameter. Next, implement validated_value() which validates that the text in value conforms to rule. If not, raise a FieldValueError with a descriptive error message.

Here is a very simple example of a field format that accepts values of “red”, “green” and “blue”:

>>> class ColorFieldFormat(fields.AbstractFieldFormat):
...     def __init__(self, field_name, is_allowed_to_be_empty, length, rule, data_format):
...         super(ColorFieldFormat, self).__init__(field_name, is_allowed_to_be_empty, length, rule, data_format, empty_value='')
...
...     def validated_value(self, value):
...         # Validate that ``value`` is a color and return it.
...         assert value
...         if value not in ['red', 'green', 'blue']:
...             raise errors.FieldValueError('color value is %r but must be one of: red, green, blue' % value)
...         return value
>>> color_field = ColorFieldFormat('roof_color', False, '', '', cid.data_format)
>>> color_field.validated('red')
'red'

The value parameter is a string. Cutplace ensures that validated_value() will never be called with an empty value parameter, hence the assert value - it will cause an AssertionError if value is '' or None because that would mean that the caller is broken.

Of course you could have achieved similar results using ChoiceFieldFormat. However, a custom field format can do more. In particular, validated_value() does not have to return a string. It can return any Python type and even None.

Here’s a more advanced :py:class`ColorFieldFormat` that returns the color as a tuple of RGB values between 0 and 1:

>>> class ColorFieldFormat(fields.AbstractFieldFormat):
...     def __init__(self, field_name, is_allowed_to_be_empty, length, rule, data_format):
...         super(ColorFieldFormat, self).__init__(field_name, is_allowed_to_be_empty, length, rule, data_format, empty_value='')
...
...     def validated_value(self, color_name):
...         # Validate that ``color_name`` is a color and return its RGB representation.
...         assert color_name
...         if color_name == 'red':
...             result = (1.0, 0.0, 0.0)
...         elif color_name == 'green':
...             result = (0.0, 1.0, 0.0)
...         elif color_name == 'blue':
...             result = (0.0, 1.0, 0.0)
...         else:
...             raise errors.FieldValueError('color name is %r but must be one of: red, green, blue' % color_name)
...         return result

For a simple test, let’s see this field format in action:

>>> color_field = ColorFieldFormat('roof_color', False, '', '', cid.data_format)
>>> color_field.validated('red')
(1.0, 0.0, 0.0)
>>> color_field.validated('yellow')
Traceback (most recent call last):
...
cutplace.errors.FieldValueError: color name is 'yellow' but must be one of: red, green, blue

Before you learned that validated_value() never gets called with an empty value. So what happens if you declare a color field that allows empty values? For example:

>>> # Sets ``is_allowed_to_be_empty`` to ``True`` to accept empty values.
>>> color_field = ColorFieldFormat('roof_color', True, '', '', cid.data_format)
>>> color_field.validated('')
''
>>> # Not quiet a color tuple...

Well, that’s not quite what we want. Instead of an empty string, a reasonable default RGB tuple would be a lot more useful. Say, (0.0, 0.0, 0.0) to represent black.

Fortunately field formats can just specify that by using the empty_value parameter in the constructor. When passed to the super constructor in AbstractFieldFormat, everything will be taken care of. So here’s a slightly modified version:

>>> class ColorFieldFormat(fields.AbstractFieldFormat):
...     def __init__(self, field_name, is_allowed_to_be_empty, length, rule, data_format):
...         super(ColorFieldFormat, self).__init__(field_name, is_allowed_to_be_empty, length, rule, data_format,
...                 empty_value=(0.0, 0.0, 0.0)) # Use black as "empty" color.
...
...     def validated_value(self, color_name):
...         # (Exactly same as before)
...         assert color_name
...         if color_name == 'red':
...             result = (1.0, 0.0, 0.0)
...         elif color_name == 'green':
...             result = (0.0, 1.0, 0.0)
...         elif color_name == 'blue':
...             result = (0.0, 1.0, 0.0)
...         else:
...             raise cutplace.errors.FieldValueError('color name is %r but must be one of: red, green, blue' % color_name)
...         return result

Let’s give it a try:

>>> color_field = ColorFieldFormat('roof_color', True, '', '', cid.data_format)
>>> color_field.validated('red')
(1.0, 0.0, 0.0)
>>> color_field.validated('')
(0.0, 0.0, 0.0)

Adding your own checks

Writing checks is quite similar to writing field formats. However, the interaction with the validation is more complex.

Checks have to implement certain methods described in cutplace.checks.AbstractCheck. For each check, cutplace performs the following actions:

  1. When reading the CID, call the check’s __init__().
  2. When starting to read a set of data, call the checks’s reset().
  3. For each row of data, call the checks’s :py:meth:check_row().
  4. When done with a set of data, call the checks’s check_at_end().

The remainder of this section describes how to implement each of these methods.

As an example, we implement a check to ensure that each customer’s full name requires less than 100 characters. The field formats already ensure that first_name and last_name are at most 60 characters each. However, assuming the full name is derived using the expression:

last_name + ', ' + first_name

this could lead to full names with up to 122 characters.

To implements this check, start by inheriting from cutplace.checks.AbstractCheck:

>>> from cutplace import checks
>>> class FullNameLengthIsInRangeCheck(checks.AbstractCheck):
...     """Check that total length of customer name is within the specified range."""

Next, implement a constructor to which cutplace can pass the values found in the CID. For example, for our check the CID would contain:

  Description Type Rule
C full name must have at most 100 characters FullNameLengthIsInRange ...100

When cutplace encounters this line, it will create a new check by calling FullNameLengthIsInRangeCheck.__init__(), passing the following parameters:

  • description='customer must be unique', which is just a human readable description of the check to refer to it in error messages
  • rule='...100', which describes what exactly the check should do. Each check can define its own syntax for the rule. In case of FullNameLengthIsInRange the rule describes a cutplace.ranges.Range.
  • available_field_names=['branch_id', 'customer_id', 'first_name', 'last_name', 'gender', 'date_of_birth'] (as defined in the CID and using the same order)
  • location being the cutplace.errors.Location in the CID where the check was defined.

The constructor basically has to do 3 things:

  1. Call the super constructor
  2. Perform optional initialization needed by the check that needs to be done only once and not on each new data set. In most cases, this involves parsing the rule parameter and obtain whatever information the checks needs from it.
  3. Call self.reset(). This is not really necessary for this check, but in most cases it will make your life easier because you can avoid redundant initializations in the constructor.

To sum it up as code:

>>> from cutplace import ranges
>>> class FullNameLengthIsInRangeCheck(checks.AbstractCheck):
...     """Check that total length of customer name is within the specified range."""
...     def __init__(self, description, rule, available_field_names, location=None):
...         super(FullNameLengthIsInRangeCheck, self).__init__(description, rule, available_field_names, location)
...         self._full_name_range = ranges.Range(rule)
...         self.reset()

Once cutplace is done reading the CID, it moves on to data. For each set of data it calls the checks’ reset() method. For our simple check, no actions are needed so we are good because reset() already does nothing.

When cutplace validates data, it reads them row by row. For each row, it calls validated() on each cell in the row. In case all cells are valid, it collects them in a dictionary which maps the field name to its native value. Recall the interface from the Tutorial, which defined the following fields:

  Name Example Empty? Length Type Rule
F customer_id 16   2: Integer 10...65535
F first_name Jane   :60    
F surname Doe   :60    
F date_of_birth 1995-11-15 X 10 DateTime YYYY-MM-DD
F gender female X 2:6 Choice male, female

Now consider a data row with the following values:

+———–+———-+——-+————-+=======+ +Customer id+First name+Surname+Date of birth+Gender + +===========+==========+=======+=============+=======+ +96 +Andrew +Dixon +1913-10-02 +male + +———–+———-+——-+————-+——-+

The row map for this row would be:

row_map = {
    'customer_id': 96,
    'first_name': 'Andrew',
    'last_name': 'Dixon',
    'date_of_birth': time.struct_time(tm_year=1913, tm_mon=10, tm_mday=2, ...)
    'gender': 'male',
}

With this knowledge, we can easily implement a check_row() that computes the full name and checks that it is within the required range. If not, it raises a CheckError:

>>> def check_row(self, row_map, location):
...     first_name = row_map['first_name']
...     surname = row_map['surname']
...     full_name = surname + ', ' + first_name
...     full_name_length = len(full_name)
...     try:
...         self._full_name_range.validate('full name', full_name_length)
...     except cutplace.RangeValueError as error:
...         raise cutplace.errors.CheckError('full name length is %d but must be in range %s: %r' \
...                 % (full_name_length, self._full_name_range, full_name))

And finally, there is cutplace.checks.AbstractCheck.check_at_end() which is called when all data rows have been processed. Note that check_at_end() does not have any parameters that contain actual data. Instead you typically would collect all information needed by check_at_end() in check_row() and store them in instance variables. For an example, take a look at the source code of cutplace.checks.IsUniqueCheck.

Because our FullNameLengthIsInRangeCheck does not need to do anything here, we can omit it and keep inherit an empty implementation from cutplace.checks.AbstractCheck.check_at_end().

Using your own checks and field formats

Now that you know how to write our own checks and field formats, it would be nice to actually utilize them in a CID. For this purpose, cutplace lets you import plugins that can define their own checks and field formats.

Plugins are standard Python modules that define classes based on cutplace.fields.AbstractCheck and cutplace.fields.AbstractFieldFormat. For our example, create a folder named ~/cutplace_plugins and store a Python module named myplugins.py in it with the following contents:

"""
Example plugins for cutplace.
"""
from cutplace import checks
from cutplace import errors
from cutplace import fields
from cutplace import ranges


class ColorFieldFormat(fields.AbstractFieldFormat):
    """
    Field format representing colors as their names.
    """
    def __init__(self, field_name, is_allowed_to_be_empty, length, rule, data_format):
        # HACK: Use super() in a way that works both in Python 2 and 3. If the code only has to work with Python 3,
        # use the cleaner `super().__init__(...)`.
        # FIXME: super(ColorFieldFormat, self).__init__(
        fields.AbstractFieldFormat.__init__(self,
            field_name, is_allowed_to_be_empty, length, rule, data_format,
            empty_value=(0.0, 0.0, 0.0))  # Use black as "empty" color.

    def validated_value(self, color_name):
        assert color_name
        if color_name == "red":
            result = (1.0, 0.0, 0.0)
        elif color_name == "green":
            result = (0.0, 1.0, 0.0)
        elif color_name == "blue":
            result = (0.0, 1.0, 0.0)
        else:
            raise errors.FieldValueError("color name is %r but must be one of: red, green, blue" % color_name)
        return result


class FullNameLengthIsInRangeCheck(checks.AbstractCheck):
    """
    Check that total length of customer name is within the specified range.
    """
    def __init__(self, description, rule, available_field_names, location=None):
        super(FullNameLengthIsInRangeCheck, self).__init__(description, rule, available_field_names, location)
        self._full_name_range = ranges.Range(rule)
        self.reset()

    def check_row(self, field_name_to_value_map, location):
        full_name = field_name_to_value_map["last_name"] + ", " + field_name_to_value_map["first_name"]
        full_name_length = len(full_name)
        self._full_name_range.validate("length of full name", full_name_length, location)

The CID can now refer to ColorFieldFormat as Color (without FieldFormat) and to FullNameLengthIsInRangeCheck as FullNameLengthIsInRange (without Check). For example:

  Interface: colors          
             
  Data format          
D Format CSV        
D Header 1        
             
  Fields          
  Name Example Empty? Length Type Rule
F item tree        
F color green     Color  

See: cid_colors.ods

Here is a data file where all but one row conforms to the CID:

Item,Color
flower,red
tree,green
sky,blue
gras,green

See: colors.csv

To tell cutplace where the plugins folder is located, use the command line option --plugins. Assuming that your myplugins.py is stored in ~/cutplace_plugins you can run:

cutplace --plugins ~/cutplace_plugins cid_colors.ods colors.csv

The output is:

ERROR:cutplace:field error: colors.csv (R5C2): field 'color' must match format: color name is 'yellow' but must be one of: red, green, blue
colors.csv: rejected 1 of 5 rows. 0 final checks failed.

If you are unsure which plugins exactly cutplace imports, use --log=info. For example, the output could contain:

INFO:cutplace:import plugins from "/Users/me/cutplace_plugins"
INFO:cutplace:  import plugins from "/Users/me/cutplace_plugins/myplugins.py"
INFO:cutplace:    fields found: ['ColorFieldFormat']
INFO:cutplace:    checks found: ['FullNameLengthIsInRangeCheck']