pycerberus is a framework to check user data thoroughly so that you can protect your application from malicious (or just garbled) input data.
pycerberus is just a Python library which uses setuptools so it does not require a special setup. It has no dependencies besides the standard Python library. Python 2.3-2.6 is supported.
In every software you must check carefully that untrusted user input data matches your expectations. Unvalidated user input is a common source of security flaws. However many checks are repetitive and validation logic tend to be scattered all around the code. Because basic checks are duplicated, developers forget to check also for uncommon edge cases. Eventually there is often also some code to convert the input data (usually strings) to more convenient Python data types like int or bool.
pycerberus is a framework that tackles these common problems and allows you to write tailored validators to perform additional checks. Furthermore the framework also has built-in support for less common (but important) use cases like internationalization.
The framework itself is heavily inspired by FormEncode by Ian Bicking. Therefore most of FormEncode’s design rationale is directly applicable to pycerberus. However several things about FormEncode annoyed me so much that I decided to write my own library when I needed one for my SMTP server project pymta.
pycerberus separates validation rules (“Validators”) from the objects they validate against. It might be tempting to derive the validation rules from restrictions you specified earlier (e.g. from a class which is mapped by an ORM to a database). However that approach completely ignores that validation typically depends on context: In an API you have typically a lot more freedom in regard to allowed values compared to a public web interface where input needs to conform to a lot more checks. With a model where you declare the validation explicitly, this is possible. Also it is quite easy writing some code that generates a bottom line of validation rules automatically based on your ORM model and add additional restrictions depending on the context.
As pycerberus is completely context-agnostic (not being bundled with a specific framework), you can use it in many different places (e.g. web applications with different frameworks, server applications, check parameters in a library, ...).
Further reading: FormEncode’s design rationale - most of the design ideas are also present in pycerberus.
Currently (March 2010, version 0.3) pycerberus is at a very basic stage - though with very solid foundations. The API for single validators is basically complete, i18n support is built in and there is decent documentation covering all important aspects. You can check multiple values (e.g. a web form) easily using a validation Schema (“compound validator”).
The future development will focus on repeating fields (list of values). After that, I’ll try to increase the number of built-in validators for specific domains (e.g. correct email address validation, validating host names, localized numbers). Another interesting topic will be integration into different frameworks like TurboGears and trac.
However I have to say that I’m pretty satisfied with the current status so adding more features to pycerberus won’t be my #1 priority in the next months. The current API and functionality was well-suited even when validating input parameters of a SMTP server so I think most use cases should be actually covered.
In pycerberus “Validators” are used to specify validation rules which ensure that the input matches your expectations. Every basic validator validates a just single value (e.g. one specific input field in a web application). When the validation was successful, the validated and converted value is returned. If something is wrong with the data, an exception is raised:
from pycerberus.validators import IntegerValidator
IntegerValidator().process('42') # returns 42 as int
pycerberus puts conversion and validation together in one call because of two main reasons:
Every validation error will trigger an exception, usually an InvalidDataError. This exception will contain a translated error message which can be presented to the user, a key so you can identify the exact error programmatically and the original, unmodified value:
from pycerberus.errors import InvalidDataError
from pycerberus.validators import IntegerValidator
try:
IntegerValidator().process('foo')
except InvalidDataError, e:
details = e.details()
details.msg() # u'Please enter a number.'
details.key() # 'invalid_number'
details.value() # 'foo'
details.context() # {}
You can configure the behavior of the validator when instantiating it. For example, if you pass required=False to the constructor, most validators will also accept None as a valid value:
IntegerValidator(required=True).process(None) # -> validation error
IntegerValidator(required=False).process(None) # None
Validators support different configuration options which are explained along the validator description.
All validators support an optional context argument (which defaults to an emtpy dict). It is used to plug validators into your application and make them aware of the overall system state: For example a validator must know which locale it should use to translate an error message to the correct language without relying on some global variables:
context = {'locale': 'de'}
validator = IntegerValidator()
validator.process('foo', context=context) # u'Bitte geben Sie eine Zahl ein.'
The context variable is especially useful when writing custom validators - locale is the only context information that pycerberus itself cares about.
After all, using only built-in validators won’t help you much: You need custom validation rules which means that you need to write your own validators.
pycerberus comes with two classes that can serve as a good base when you start writing a custom validator: The BaseValidator only provides the absolutely required set of API so you have maximum freedom. The Validator class itself is inherited from the BaseValidator and defines a more sophisticated API and i18n support. Usually you should use the Validator class.
The BaseValidator implements only the minimally required methods. Therefore it does not put many constraints on you. Most users probably want to use the Validator class which already implements some commonly used features.
Return all messages which are defined by this validator as a key/message dictionary. Alternatives you can create a class-level dictionary which contains these keys/messages.
You must declare all your messages here so that all keys are known after this method was called.
Calling this method might be costly when you have a lot of messages and returning them is expensive. You can reduce the overhead in some situations by implementing message_for_key()
This is the method to validate your input. The validator returns a (Python) representation of the given input value.
In case of errors a InvalidDataError is thrown.
The Validator is the base class of most validators and implements some commonly used features like required values (raise exception if no value was provided) or default values in case no value is given.
This validator splits conversion and validation into two separate steps: When a value is process()``ed, the validator first calls ``convert() which performs some checks on the value and eventually returns the converted value. Only if the value was converted correctly, the validate() function can do additional checks on the converted value and possibly raise an Exception in case of errors. If you only want to do additional checks (but no conversion) in your validator, you can implement validate() and simply assume that you get the correct Python type (e.g. int).
Of course if you can also raise a ValidationError inside of convert() - often errors can only be detected during the conversion process.
By default, a validator will raise an InvalidDataError if no value was given (unless you set a default value). If required is False, the default is None. All exceptions thrown by validators must be derived from ValidationError. Exceptions caused by invalid user input should use InvalidDataError or one of the subclasses.
In order to prevent programmer errors, an exception will be raised if you set required to True but provide a default value as well.
Perform additional checks on the value which was processed successfully before (otherwise this method is not called). Raise an InvalidDataError if the input data is invalid.
You can implement only this method in your validator if you just want to add additional restrictions without touching the actual conversion.
This method must not modify the converted_value.
pycerberus uses simple_super so you can just say ‘self.super()’ in your custom validator classes. This will call the super implementation with just the same parameters as your method was called.
Validators need to be thread-safe as one instance might be used several times. Therefore you must not add additional attributes to your validator instance after you called Validator’s constructor. To prevent unexperienced programmers falling in that trap, a ‘’Validator’’ will raise an exception if you try to set an attribute. If you don’t like this behavior, you can set ‘_is_internal_state_frozen’ to False before calling Validator’s constructor.
Now it’s time to put all together. This validator demonstrates most of the API as explained so far:
class UnicodeValidator(Validator):
def __init__(self, max=None):
self.super()
self._max_length = max
def messages(self):
return {
'invalid_type': _(u'Validator got unexpected input (expected string, got %(classname)s).'),
'too_long': _(u'Please enter at maximum %(max_length) characters.')
}
# Alternatively you could also declare a class-level variable:
# messages = {...}
def convert(self, value, context):
try:
return unicode(value, 'UTF-8')
except Exception:
classname = value.__class__.__name__
self.error('invalid_type', value, context, classname=classname)
def validate(self, converted_value, context):
if self._max_length is None:
return
if len(converted_value) > self._max_length:
self.error('too_long', converted_value, context, max_length=self._max_length)
The validator will convert all input to unicode strings (using the UTF-8 encoding). It also checks for a maximum length of the string.
You can see that all the conversion is done in convert() while additional validation is encapsulated in validate(). This can help you keeping your methods small.
In case there is an error the error() method will raise an InvalidDataError. You select the error message to show by passing a string constant key which identifies the message. The key can be used later to adapt the user interface without relying the message itself (e.g. show an additional help box in the user interface if the user typed in the wrong password).
The error messages are declared in the messages(). You’ll notice that the message strings can also contain variable parts. You can use these variable parts to give the user some additional hints about what was wrong with the data.
Modern applications must be able to handle different languages. Internationalization (i18n) in pycerberus refers to validating locale-dependent input data (e.g. different decimal separator characters) as well as validation errors in different languages. The former aspect is not yet covered by default but you should be able to write custom validators easily.
All messages from validators included in pycerberus are translated in different languages using the standard gettext library. The language of validation error messages will be chosen depending on the locale which is given in the state dictionary,
i18n support in pycerberus is a bit broader than just translating existing error messages. i18n becomes interesting when you write your own validators (based on the ones that come with pycerberus) and your translations need to play along with the built-in ones:
All i18n support in pycerberus aims to provide custom validators with a nice, simple-to-use API while maintaining the flexibility that serious applications need.
If you want to get translated error messages from a validator, you set the correct ‘’context’‘. formencode looks for a key named ‘locale’ in the context dictionary:
validator = IntegerValidator()
validator.process('foo', context={'locale': 'en'}) # u'Please enter a number.'
validator.process('foo', context={'locale': 'de'}) # u'Bitte geben Sie eine Zahl ein.'
Usually you don’t have to know much about how pycerberus uses gettext internally. Just for completeness: The default domain is ‘pycerberus’. By default translations (.mo files) are loaded from pycerberus.locales, with a fall back to the system-wide locale dir ‘’/usr/share/locale’‘.
To translate messages from a custom validator, you need to declare them in the messages() method and mark the message strings as translatable:
from pycerberus.api import Validator
from pycerberus.i18n import _
class MyValidator(Validator):
def messages(self):
return {
'foo': _('A message.'),
'bar': _('Another message.'),
}
# your validation logic ...
Afterwards you just have to start the usual gettext process:
Assume your custom validator is a subclass of a built-in validator but you don’t like the built-in translation. Of course you can replace pycerberus’ mo files directly. However there is also another way where you don’t have to change pycerberus itself:
class CustomValidatorThatOverridesTranslations(Validator):
def messages(self):
return {'empty': _('My custom message if the value is empty'),
'custom': _('A custom message')}
# ...
This validator will use a different message for the ‘empty’ error and you can define custom translations for this key in your own .po files.
The gettext framework is configurable, e.g. in which directory your .mo files are located and which domain (.mo filename) should be used. In pycerberus this is configurable by validator:
class ValidatorWithCustomGettextOptions(Validator):
def messages(self):
return {'custom': _('A custom message')}
def translation_parameters(self, context):
return {'domain': 'myapp', 'localedir': '/home/foo/locale'}
# ...
These translation parameters are passed directly to the ‘’gettext’’ call so you can read about the available options in the gettext documentation. Your parameter will be applied for all messages which were declared in your validator class (but not in others). So you can modify the parameters for your own validator but keep all the existing parameters (and translations) for built-in validators.
Sometimes you don’t want to use gettext. For instance you could store translations in a relational database so that your users can update the messages themselves without fiddling with gettext tools:
class ValidatorWithNonGettextTranslation(FrameworkValidator):
def messages(self):
return {'custom': _('A custom message')}
def translate_message(self, key, native_message, translation_parameters, context):
# fetch the translation for 'native_message' from somewhere
translated_message = get_translation_from_db(native_message)
return translated_message
You can use this mechanism to plug in arbitrary translation systems into gettext. Your translation mechanism is (again) only applied to keys which were defined by your specific validator class. If you want to use your translation system also for keys which were defined by built-in validators, you need to re-define these keys in your class as shown in the previous section.
Especially in web development you often get multiple values from a form and you want to validate all these values easily. This is where “compound validators” / “schemas” come into play. A schema contains multiple validators, one validator for every field. There’s nothing special about these validators - they are just validators like the ones I explained in the previous section. Every field validator only cares about a single value and does not see the rest of the values.
You can define a schema like this:
from pycerberus.schema import SchemaValidator
from pycerberus.validators import IntegerValidator, StringValidator
schema = SchemaValidator()
schema.add('id', IntegerValidator())
schema.add('name', StringValidator())
Afterwards the schema behaves most like all basic validators - instead of a single input value they just get a dictionary:
validated_values = schema.process({'id': '42', 'name': 'Foo Bar'})
If you declared a validator for a key which is not present in the input dict, the validator will get its ‘empty’ value instead:
id_required = SchemaValidator()
id_required.add('id', IntegerValidator(required=False))
id_required.process({}) # -> {'id': None}
id_optional = SchemaValidator()
id_optional.add('id', IntegerValidator(required=True))
id_optional.process({}) # raises an Exception because id None is not acceptable
Do not mix up the ‘default’ value with the ‘empty’ value:
IntegerValidator(default=42)
The ‘default’ value in this case is 42 but the ‘empty’ value is still None.
Please note that Schemas are ‘secure by default’ which means that the returned dictionary contains only values that were validated. If you did not add a validator for a specific key, this key won’t be included in the result.
If you need to ensure that no values with unknown keys are passed to the schema (even if those would be just dropped), you can call the method ‘’set_allow_additional_parameters(False)’‘. After that the schema will raise an exception if it finds any unknown keys.
Schemas can be an important part in your application security. Also they define some kind of interface (which parameters does your application expect). Besides the algorithmic way to build a schema there is a ‘declarative’ way so that you can review and audit your schemas easily:
class MySchema(SchemaValidator):
id = IntegerValidator()
name = StringValidator()
# using it...
schema = MySchema()
It’s absolutely the same schema but the definition is way easier to read.
All schema validators are executed even if one of the previous validators failed. Because of that you can display the user all errors at once:
schema = SchemaValidator()
schema.add('id', IntegerValidator())
schema.add('name', StringValidator())
try:
schema.process({'id': 'invalid', 'name': None})
except InvalidDataError, e:
e.error_dict() # {'id': <id validation error>, 'name': <id validation error>}
e.error_for('id') # id validation error
Sometimes you need to validate multiple fields in a schema - e.g. you need to check in a ‘change password’ action that the password is entered the same twice. Or you need to check that a certain value is higher than another value in the form. That’s where formvalidators come into play.
formvalidators are validators like all other field validators but they get the complete field dict as input, not a single item. Also formvalidators are run after all field validators successfully validated the input - therefore you have access to reasonably sane values, already converted to a handy Python data type. Opposite to simple field validators, the validation process fails immediately if one formvalidator fails.
You can add formvalidators to a form like this:
class NumbersMatch(Validator):
def validate(self, fields, context):
if fields['a'] != fields['b' ]:
self.error('no_match', fields, context=context)
schema.add_formvalidator(NumbersMatch)
Of course there is also a declarative way to use form validators:
class MySchema(SchemaValidator):
# ...
formvalidators = (NumbersMatch, )
Validation schemas are an important piece of information: On the one hand they can serve as a kind of API specification (which parameters are accepted by your application) and on the other hand they are important for security audits (which constraints are put on your input values). Obviously this is something that you want to get right - duplicating this information only increases the likelyhood of bugs.
The issue becomes especially annoying when you have a web application with a complex form (e.g. a new user registration process) that you want to split in multiple steps on different pages so that your users won’t drop out immediately when they see the huge form. It is good HTTP/ReST design practice to keep state on the client side. Therefore you pass fields from previous pages in hidden input fields to the next and for the final page it looks like there was one big form. This also has the advantage that you can shuffle the fields on the different pages without changing real logic.
With that approach your pretty much settled - however you need a separate validation schema for every single page which is a huge duplication. With pycerberus you can avoid that by using ‘’schema inheritance’‘:
class FirstPage(SchemaValidator):
id = IntegerValidator()
formvalidators = (SomeValidator(), )
class SecondPage(FirstPage):
# this schema contains also 'id' validator
name = StringValidator()
# formvalidators are implicitely appended so actually this schema has
# these formvalidators: (SomeValidator(), AnotherValidator(), )
formvalidators = (AnotherValidator(), )
class FinalPage(SecondPage):
# this schema contains also 'id' and 'name' validators
age = IntegerValidator()
# This page contains again both formvalidators
As you can see, every page adds some validators while keeping the old ones. This eliminates the duplication problem described above,
What happens if SecondPage declares a different validator for ‘id’? In this case it will just replace the ‘’IntegerValidator()’’ declared by ‘’FirstPage’‘!
So far I did not bother setting up a mailing list. If you have questions, please send an email to Felix.Schwarz@oss.schwarz.eu. When there are some users for pycerberus, I’ll create a mailing list.
pycerberus is licensed under the MIT license. As there are no other dependencies (besides Python itself), you can easily use pycerberus in proprietary as well as GPL applications.