Character Encodings

When writing applications with WebStack, you should try and use Python's Unicode objects as much as possible. However, there are a number of places where plain Python strings can be involved:

When Web pages (and other types of content) are sent to and from users of your application, the text will be in some kind of character encoding. For example, in English-speaking environments, the US-ASCII encoding is common and contains the basic letters, numbers and symbols used in English, whereas in Western Europe encodings like ISO-8859-1 and ISO-8859-15 are typically used, since they contain additional letters and symbols in order to support other languages. Often, UTF-8 is used to encode text because it covers most languages simultaneously and is therefore flexible enough for many applications.

When URLs are received in applications, in order for some of the request parameters to be interpreted, the situation is a bit more awkward. The original text is encoded in US-ASCII but will contain special numeric codes that indicate character values in the original text encoding - see the description of query strings for more information.

Recommendations

The following recommendations should help you avoid issues with incorrect characters in the Web pages (and other content) that you produce:

Use Unicode Objects for Textual Content

Handling text in specific encodings using normal Python strings can be difficult, and handling text in multiple encodings in the same application can be highly error-prone. Fortunately, Python has support for Unicode objects which let you think of letters, numbers, symbols and all other characters in an abstract way.

import codecs

class MyResource:

encoding = "utf-8"

def respond(self, trans):
stream = trans.get_request_stream() # only reads strings
unicode_stream = codecs.getreader(self.encoding)(stream) # reads Unicode objects

[Some activity...]

out = trans.get_response_stream() # writes strings and Unicode objects

Use Strings for Binary Content

If you are reading and writing binary content, Unicode objects are inappropriate. Make sure to open files in binary mode, where necessary.

Use Explicit Encodings and Be Consistent

Although WebStack has some support for detecting character encodings used in requests, it is often best for your application to exercise control over which encoding is used when inspecting request parameters and when producing responses. The best way to do this is to decide which encoding is most suitable for the data presented and received in your application and then to use it throughout.

One approach which works acceptably for smaller applications is to define an attribute (or a global) which is conveniently accessible and which can be used directly with various transaction methods. Here is an outline of code which does this:

from WebStack.Generic import ContentType

class MyResource:

encoding = "utf-8" # We decide on "utf-8" as our chosen
# encoding.
def respond(self, trans):
[Do various things.]

fields = trans.get_fields_from_body(encoding=self.encoding) # Explicitly use the encoding.

[Do other things with the Unicode values from the fields.]

trans.set_content_type(ContentType("text/html", self.encoding)) # The output Web page uses the encoding.

[Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]

Use EncodingSelector to Set the Default Encoding

An arguably better approach is to use selectors (as described in "Selectors - Components for Dispatching to Resources"), typically in a "site map" arrangement (as described in "Deploying a WebStack Application"), specifically using the EncodingSelector:

from WebStack.Generic import ContentType

class MyResource:

def respond(self, trans):
[Do various things.]

fields = trans.get_fields_from_body() # Encoding set by EncodingSelector.

[Do other things with the Unicode values from the fields.]

trans.set_content_type(ContentType("text/html")) # The output Web page uses the default encoding.

[Produce the response, making sure that self.encoding is used to convert Unicode to raw strings.]

def get_site_map():

return EncodingSelector(MyResource(), "utf-8")

Tell Encodings to Other Components

When using other components to generate content (see "Integrating with Other Systems"), it may be the case that such components will just write the generated content straight to a normal stream (rather than one wrapped by a codecs module function). In such cases, it is likely that for textual content such as XML or related formats (XHTML, SVG, HTML) you will need to instruct the component to use your chosen encoding; for example:

        # In the respond method, xml_document is an xml.dom.minidom.Document object...
xml_document.toxml(self.encoding)

This will then generate the appropriate characters in the output and specify the correct encoding for the XML document.