S E

 

A Python Stream Editor

by

Frederic Rentsch

 

Version 2.3

© 2006

 

 

*

 


    1. SE—the Stream Editor

    2. Substitution Definitions

        2.1. The SE Object

        2.2. Defining Substitutions

            2.2.1. Search and Replace

            2.2.2. Deletion

            2.2.3. Regular Expression Targets

            2.2.4. Replacing Something with the Contents of a File

        2.3. Substitution Sets

        2.4. Notation

            2.4.1. The ASCII Notation

            2.4.2. Handling Space Inside Definitions

            2.4.3. Target Repeat

            2.4.4. Special Characters

    3. Working with Definition Sets

        3.1. Substitution Precedence

        3.2. Definition Files

        3.3. Merging Substitution Sets

        3.4. Redefinitions

            3.4.1. Exception

            3.4.2. Target Lock

        3.5. Filters (A short and well-known story with a happy ending)

            3.5.1. Filtering by Deletion

            3.5.2. Filtering by Extraction

            3.5.3. Null-Edit Modes: PASS and EAT

        3.6. Multiple-Pass Runs

            3.6.1 Cascading Runs

            3.6.2. Nesting Calls

        3.7. Setting Marks

            3.7.1. Temporary Runtime Marks

            3.7.2. Split Marks

        3.8 Dynamic Targeting

    4. Working with SE Objects

        4.1. The SE Object

            4.1.1. Maximum Target Length

            4.1.2. Data Paths

            4.1.3. Protecting and Altering Existing Files

            4.1.4. Backup File Extension

            4.1.5. Intermediate Data

            4.1.6. Clipping the Cascade Display

            4.1.7. Saving an Editor's Definitions

        4.2. The Translator Object

        4.3. Editing a Translator Interactively

            4.3.1. Adding Definitions

            4.3.2. Deleting Definitions

            4.3.3. Reversing a Translator

    5. Input, Output

        5.1. Input Types and Output Options

            5.1.1. Input: String

            5.1.2. Input: File Name

            5.1.3. Input: File Object

    6. The Message Log

 

    7. SE Light

        7.1. Type Casting Back and Forth

   

    8. Some Examples

        8.1. Programming

            8.1.1. Revising Names

            8.1.2. Deleting Tracing Statements

            8.1.3. Listing Names of Classes, Functions, Methods, Imports, Globals ...

        8.2. HTM

            8.2.1. A Link Extractor

            8.2.2. A Tag Stripper

        8.3. Expanding

            8.3.1. A Two-Pass Expanding Editor

            8.3.2. Combining two One-Pass Editors

        8.4. Siphoning (Reasonably) Current Stock Quotes from an Internet Site

        8.5. Two-Step Conversions through a Generic Format

        8.6. A Day in the Life of a Stenographer

 

    9. Closing Remarks

 

 

 

 

1. SE—the Stream Editor

 

We are all familiar with the search-and-replace function of a text editor. Search and replace is what SE does. But unlike text editors, it can do an arbitrary number of different replacements in one pass and it is designed to integrate into Python programs.

       SE stands for Stream Editor. It handles all 256 octets. It does any (practical) number of substitutions in one pass. It can do any (practical) number of consecutive passes. It is simple and intuitive for both quick hacks and production applications.

      Its simplicity is the result of keeping all programming-language constructs out of the interface, having one single Editor object with no operational methods and no function library to memorize. SE is all about strings. It processes strings and a string—the simplest data structure there is—is the argument of the Editor’s constructor. The simplicity of the design translates to extremely concise code whether in a program or on the command line, and also translates to a high degree of building-block type modularity in combining both translation definition files when constructing Editor objects and combining  Editor objects at run time.

      Considering that searching and replacing is all the module does, its paucity of functions may not convey the impression of much capability. The impression misses the point because we tend to associate capability with massive arsenals of objects, methods and functions. SE is an enabling tool that offers instant solutions to a host of relatively simple yet annoying problems inhabiting the vacuum just beyond basic language functionality and beneath the specialized high-power functionality of production tools. Its power lies in a surprising range of application techniques that are intuitively obvious as well as trivial to express, even if combined to considerable functional complexity.

      This manual discusses the most important of these techniques.

     

 

 

 

2. Substitution Definitions

 

 

2.1. The SE Object

 

>>> import SE

>>> SE_Object = SE.SE (substitute_definitions)

>>> output = SE_Object (input, [output])

 

The definitions, input and output all may be either strings or files. Type identification is automatic. Strings are ideal for quick fixes or interactive development. Files are ideal for production. The details of IO typing is covered in chapter 5.

 

Terminology: The SE object is also referred to as Editor.

 

 

 

2.2. Defining Substitutions

 

 

2.2.1. Search and Replace

 

A substitution definition states a target and a substitute, the two separated by an equal sign: old=new. If the target is found in an input stream the substitute is written to the output stream. In between matching targets the input stream gets copied as is to the output stream.

 

>>> Old_New = SE.SE ('old=new')

>>> Old_New ("If 'old' reads 'new' we know it works.")

"If 'new' reads 'new' we know it works."

 

 

2.2.2. Deletion

 

Replacing a target with nothing amounts to a deletion:

 

>>> SE.SE ('HORROR=')('This >HORROR< must be deleted!')

'This >< must be deleted!'

 

 

2.2.3. Regular Expression Targets

 

If a target is placed between tildes ('~') it is compiled as a regular expression.

 

>>> Hide_Account_Number = SE.SE ('~123-45678-[0-3][012ABF]?~=***')

>>> Hide_Account_Number ('Client: 123-45678, Account Numbers: 123-45678-00, 123-45678-01 and 123-45678-0B')

'Client: 123-45678, Account Numbers: ***, *** and ***'

 

A couple of points to be aware of when using regular expressions are covered in the last paragraph of chapter (3.1 (a), 3.1. (b))

 

Terminology: a target that is not a regular expression is referred to as a fixed target. It is part of as a fixed definition.

  

 

2.2.4. Replacing Something with the Contents of a File

 

A substitute can be the contents of a file. To differentiate a file name from a literal substitute, the file name is placed in angled brackets.

 

>>> Letter = SE.SE ('LETTERHEAD=<correspondence/private/letterhead.1> INTRO=<correspondence/confirm_date> BYE=<correspondence/bye_bye>')

>>> print Letter ('LETTERHEAD\n\nINTRO\n\n\nSounds good. Looking forward.\n\nBYE')

(... mail-merged letter ...)

 

 


2.3. Substitution Sets

 

 

The first three of the previous examples use single definitions. The last one states three. An arbitrary number of definitions can be stated in a single string with white space between them. They constitute a substitution set. The definitions can be written in any order. A substitution run processes the entire set.

 

>>> Nicer = SE.SE ('cop=officer joker=person chickenshit=triviality guy=gentleman yell=smile')

>>> Nicer ('The guy yelled at the cop: What kind of a joker are you anyway to bother with such chickenshit?')

'The gentleman smiled at the officer: What kind of a person are you anyway to bother with such triviality?'

 

 

 

2.4. Notation

 

 

2.4.1. The ASCII Notation

 

SE operates on streams of bytes each of which can have a value between 0 and 255. Most byte values associate with a typable character by some standard. Typing a character into a definition inserts that character's ascii code.

      So typable characters represent themselves. But a few characters have a special meaning (2.4.4.). The equal sign, for instance, separates the target from the substitute. Or white-space characters separate one definition from the next. If one of the special characters forms part of either a target or a substitute, it cannot be typed per se. Some byte values, again, don't have typable characters assigned. There needs to be a means to write these untypable values into a definition, this: all octets can be written as parenthesized ascii codes: (0) to (255) or hex (x0) to (xff)

 

>>> Unix2Windows = SE.SE ('(10)=(13)(10)')             # Either decimal

>>> Windows2Unix = SE.SE ('(xd)(xa)=(xa)')             # Or hex (case-insensitive)

>>> Line_Unwrap  = SE.SE ('(13)(10)=(32) (10)=(32)')

 

Line Unwrap will do both Windows and Unix.

 

Python has its own method for writing ascii (e.g. \x07) and, in addition, back-slashed letters for the control characters (\a). The functional difference is that the Python notation gets decoded before SE receives it, whereas the SE notation gets past the Python interpreter unaffected.

 

    SE notation         Python notation     Function

    (7)    (x7)        '\a'   '\x07'        bell (alert)

    (8)    (x8)        '\b'   '\x08'        back space

   (12)    (xc)        '\f'   '\x0c'        page feed

   (10)    (xa)        '\n'   '\x0a'        line feed

   (13)    (xd)        '\r'   '\x0d'        carriage return

    (9)    (x9)        '\t'   '\x09'        tab

   (11)    (xb)        '\v'   '\x0b'        vertical tab

 

The Python backslash notation is tricky, because it functions in a minimal mode: If the character following a backslash is unambiguous the backslash represents itself and stays. If, on the other hand, an ambiguous character follows, its disambiguation consumes the backslash. SE ascii is more straighforward and works always (with one (very rare) exception explained in 2.4.4).

      Parenthesized groups have a different meaning in regular expressions. Consequently SE ascii cannot be used in regular expressions

 

 

2.4.2. Handling Space inside Definitions

 

Working with texts, a means is needed that allows typing space characters inside a definition. The ascii format (32) would work, but is awkward and destroys legibility. Better is to brace the entire definition in double quotation marks. Note: the entire definition. Not just one side.

 

>>> spaces   = ' "          =ten spaces" '

>>> spread   = ' "spread=s p r e a d" '

>>> unspread = ' "u n s p r e a d"=unspread '  # Will not compile as intended. Quote entire definition.

 

Double quotes span multiple lines:

 

>>> Addresses = SE.SE ('''

"TB=Mr. Tony Blair

Downing Street 10

London

"

"GB=Mr. George Bush

The White House

Washington DC

" ''')

>>> print Addresses ('\nGB\nTB')

 

Mr. George Bush

The White House

Washington DC

 

Mr. Tony Blair

Downing Street 10

London

 

Newline characters ('\n') and tab ('\t') don't look like white space, but they are. (So would be '\v', '\f' and '\r' if we needed them.)

 

>>> SE_Object = SE.SE ('word\tanother_word\nnew_line=whatever')

>>> SE_Object ('word\tanother_word\nnew_line')  # Doesn’t work as expected

'word\tanother_word\nwhatever'

 

The compiler gets a single word word and doesn't know what to do with it. The same occurs again with the next word another_word. The words are ignored and the fact recorded in the object’s message log. (6.)

 

>>> SE_Object = SE.SE (' "word\tanother_word\nnew_line=whatever" ')   # Quotes work

>>> SE_Object = SE.SE (' word(9)another_word(10)new_line=whatever ')  # SE ascii works quoted or unquoted

>>> C_Block_Comment_Eater = SE.SE (' "~[\n\s]*/\*(.|\n)*?\*/~=" ')

 

Double-quoting regular expressions may not be necessary. The habit to quote saves the time finding out.

 

 

2.4.3. Target Repeat

 

The primary function of the equal sign is that of separating target and substitute. Other equal signs may follow, but their function changes to that of a place holder for the target.

 

>>> Double_Space = SE.SE ('(10)=(10)(10)  (13)(10)=(13)(10)(13)(10)')  # Does both Windows and Unix

>>> Double_Space = SE.SE ('(10)===        (13)(10)===')  # Same thing

                                ^^                 ^^

With regular expressions the target place holder '=' is a necessity, because the match is not known in advance. The following writes the match to the output and adds a newline:

 

>>> image_files = ' ~([Hh][Tt][Tt][Pp]:)?//([A-Za-z]:)?[A-Za-z_/][A-Za-z_0-9/.]*.[bBJjGg][mMPpIi][pPGgFf]"?~==(10)'

 

The target place holder '=' combines with other characters:

 

>>> Long_Lat = SE.SE ('"New York==: (-73.96, 40.76)" "Tokyo==: (139.77, 35.68)"')  

>>> Long_Lat ('New York, Tokyo')

'New York: (-73.96, 40.76), Tokyo: (139.77, 35.68)'

 

 

2.4.4. Special Characters

 

Two characters have a special meaning:

 

'='   (61)  (x3d)    # 2.2.1. Separator between target and substitute. Leftmost position.

'='   (61)  (x3d)    # 2.4.3. Target place holder anywhere in substitute.

' '   (32)  (x20)    # 2.3.   Separator between definitions (also '\t', '\n' (all c.isspace ()))

 

These characters identify the nature of the character group that follows:

 

'#'   (35)  (x23)    #        Starts a comment.

'"'   (34)  (x22)    # 2.4.2. Starts a definition containing spaces.

'~'   (47)  (x2f)    # 2.2.4. Starts a regular expression target.

'<'  (123)  (x7b)    # 2.2.3. Starts a substitution file name.

'('   (40)  (x28)    # 2.4.1. Starts an ascii notation (only in conjunction with a valid ascii number).

 

Special characters meant literally can always be ascii-ized. This may be cumbersome and illegible. ‘(0)’, for instance, compiles to ascii 0. Meant verbatim the following expression would work: ‘(40)0)’.  More legible is a backslash:‘\(40)’. However, literalizing backslashes work only in conjunction with characters that are special per se or start a group with a meaning other than plain text. Their existence is temporary. After taking their clue the compiler discards them. If, on the other hand, a backslash has no effect on the character that follows, the backslash stands for itself and stays.

      This sounds awfully intimidating. And it would be, if the system didn’t offer interactive facilities that juxtapose cause and effect. The only thing to memorize is that there can be snags with special characters. If, for whatever reason, you prefer back-slashing over SE ascii and are unsure, consider your character unspecial and see how it comes out, either by means of the method show (show_translators = 1) (4.2.) or by processing short test strings, or both.

 

Just for the record here’s a summary of tricky cases. To memorize them is a waste of time. Skim over them and focus on the two exceptions to interchangeability following the summary. (->)

 

 '\ =space'                   not valid (white space cannot be back-slashed)

 

 '#=comment'                  not valid (a comment)

 '\#=comment'                 valid: #->comment

 '(35)=comment'               valid: #->comment

 

 '~=tilde'                    valid: ~->tilde      (not a regular epxression)

 '~~=2-tilde'                 valid: ~~->2-tilde   (not a regular epxression)

 '(126)=tilde'                valid: ~->tilde

 '(126)(126)=2-tilde'         valid: ~~->2-tilde

 '\~~=2-tilde'                valid: ~~->2-tilde   

 

 '\~[a-z]~=not(32)regex'      valid: ~[a-z]~->not regex  (matches literally as written)

 '(126)[a-z]~=not(32)regex'   valid: ~[a-z]~->not regex

 

 '"=double quote"'            not valid: no target

r'\"=double quote"'           valid unintended: "->double

 '""=double quote"'           valid: "->double quote

 '\""=double quote"'          valid: "->double quote    (Not a raw string: Python consumes the backslash)

r'\""=2-double-quotes"'       valid: "”->2-double-quotes

 '(34)=double-quote"'         valid: "->double-quote"

 '"(34)=double quote"'        valid: "->double quote

 

r'\=backslash'                not valid: Backslash literalizes the separator

r'\\=backslash'               not valid: Backslash literalizes the separator: no target

 '(92)=backslash'             valid: \->backslash  # SE ascii required

 

 '\=\==\=\=:(32)equality'     valid: ==->==: equality   (show () displays |==|->|\=\=: equality|)

 '\=\===:(32)equality'        valid: ==->==: equality   (show () displays |==|->|=: equality|. Unslashed '=' stands for target)

 '(x3d)(x3d)==:(32)equality'  valid: ==->==: equality

 

 '\(88)=X'                    valid: (88)->X   (Backslash cancels ascii-ization. show () displays ->|\(88)|->|X|)
 '(88\)=X'                    valid: (88\)->X  (Backslash has no function and stays) 
r'\\(88)=X'                   valid: \(88)->X  (show () displays ->|\\(88)|->|X|)

 ':B:=<b>'                    valid: :B:->(...contents of file b...)    (show () displays ->|<b>|)
 ':B:=\<b>'                   valid: :B:-><b>                           (b is not a file name (show () displays ->|\<b>|)    

 ':B:=<b\>'                   valid: :B:->(...contents of file b\...)   (Expect 'no-such-file error')    

 

 '(257)\#~"~=unambiguous' valid: (257)\#~"~->unambiguous

 

Exception 1: SE ascii  ‘(61)’ cannot literalize the target place holder ‘=’, because with regular expressions lurking, the target is not known at compile time and so the backslash-equality combo needs to be available at run time.

 

 'assignment=='               valid: assignment->assignment  (Second equal sign is always a target place holder...)

 'assignment=(x3d)'           valid: assignment->assignment  (... even ascii-ized)

 'assignment=\=’              valid: assignment->=           (Okay. show () displays ->|\=|)

 

Exception 2: A backslash ending a target cannot be typed , because it would literalize the separator. Here SE ascii is a requirement.

 

r'C:\temp\=TEMPDIR'           not valid: no separator, no target

r'C:\temp(92)=TEMPDIR'        valid: C:\temp\->TEMPDIR

 

 

 

 

3. Working with Definition Sets

 

 

 

3.1. Substitution Precedence

 

 

The substitution mechanism works like this: A read pointer moves downstream byte by byte and checks whether, beginning at its position, there are matches among the defined substitution targets. If no match is found, the byte at the pointer is written to the output stream and the pointer advanced by one byte. If a match is found, the substitute associated with the matching target is written to the output stream and the read pointer advanced by the length of the matched target. If multiple targets match, the longest one applies. This logic is the only one that resolves overlapping targets sensibly. It also means that substitutions don't stack on top of one another in the manner of a C compiler expanding #define macros.

 

>>> overlapping_targets = 'be=BE being=BEING been=BEEN bee=BEE belong=BELONG long=LONG longer=LONGER'

>>> story = "There was a bee belonging to hive nine longing to be a beetle and thinking that being a bee was okay, but she had been a bee long enough and wouldn't be one much longer."

>>> SE.SE (overlapping_targets)(story)

"There was a BEE BELONGing to hive nine LONGing to BE a BEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEE LONG enough and wouldn't BE one much LONGER."

 

The translation is formally correct, although BELONGing, LONGing and BEEtle exemplify hits that might result in collateral damage in a text. It shows that doing free text is hazardous if freedom of error is critical. Some techniques help to avoid stray hits. More on this as we go along.

 

Regular expressions may introduce precedence contests at run time, because they may match targets that other definitions also match. Precedence resolution rules with identical matches are: fixed-over-regex and last-defined-regex-over-previously-defined-regex.

 

>>> SE.SE ('~[aeiou]+~=vowels  "~a+~=string of a"') ('aaaaa')             # Regex contest: Last defined takes

'string of a'

>>> SE.SE ('"~a+~=string of a" ~[aeiou]+~=vowels') ('aaaaa')              # Swapped, the other one takes

'vowels'

>>> SE.SE (' aaaaa=5a  "~a+~=string of a" ~[aeiou]+~=vowels') ('aaaaa')   # Regex-fixed contest: fixed takes

'5a'

 

The rule fixed over regex allows to state exceptions to sweeping matches.

 

>>> print SEL.SEL (' "~[aeiou][aeiou]~=[=]"')('Between brackets you see double vowels')

Betw[ee]n brackets y[ou] s[ee] d[ou]ble vowels

 

>>> print SEL.SEL (' aa== ee== ii== oo== uu== "~[aeiou][aeiou]~=[=]"')('Between brackets you see double vowels but no repeating pairs')

Between the brackets y[ou] see d[ou]ble vowels but no rep[ea]ting p[ai]rs

 

Regular expressions should not be used as a shorter alternative to a manageable number of fixed definitions. Beside potentially clouding the precedence situation, they are also slower. They are best used so they won't compete for targets.

 

 

 

3.2. Definition Files

 

 

>>> french_ctime = '''

      Sun=dimanche, Mon=lundi, Tue=mardi, Wed=mercredi,

      Thu=jeudi, Fri=vendredi, Sat=samedi,

      Jan=janvier  Feb=février Mar=mars   Apr=avril  May=mai  Jun=juin

      Jul=juillet  Aug=août  Sep=septembre Oct=octobre Nov=novembre Dec=décembre'''

>>> French_Ctime = SE.SE (french_ctime)

>>> French_Ctime (time.ctime())

'mardi, mars 28 15:55:50 2006'

 

Having to write such a definition list just to get ctime in French would be impractical to the point of rendering the system unusable. Definition sets can be stored in text files, either edited from scratch or generated from an SE object using its method save () (4.1.7.)

 

>>> file ('se/french_ctime.se', 'wa').write (french_ctime)   # either this way ...

>>> French_Ctime.save ('se/french_ctime.se')                 # or this way (4.1.7.)

 

Naming the file in a definition string is equivalent to writing its contents.:

 

>>> SE.SE ('se/french_ctime.se')(time.ctime ())

'lundi, avril 10 11:05:43 2006'

 

The definitions can be arranged in any way as long as white space separates them. Comments can be added. A comment starts with ‘#’ and ends at the end of the line.

      Naming files is recursive. Definition files may name other files. Special-purpose sets can be assembled in seconds from basic theme modules:

     

$ echo 'astro/stars_coord.se astro/messier_coord.se astro/geography/cities_coord.se" > astro/all_coordinates.se

 

A library of files for frequently used translations can be built little by little and each file extended when a definition appears to be missing, or tweaked when it slips up.

 

 

 

3.3. Merging Substitution Sets

 

 

Since the definitions' format is a single string with individual definitions separated from one another by white space and since the order of the definitions doesn't matter, merging sets is a simple matter of joining strings.

 

>>> Ids_To_Symbol = SE.SE ('finance/se/cusip2symbol.se finance/se/isin2symbol.se finance/se/sec2symbol.se')

 

This example shows three definition files which translate investment-title ids to their respective symbols. There are several id standards in use. Each file defines one of them. Lining their names up as shown will merge their definitions into one single set that exchanges ids for symbols whatever standard an id happens to belong to. Note that such ids can be relied on to be unique in any data (except American stock symbols in texts), so that a reliable system can be built with large substitution sets.


File names and definitions combine freely.

 

>>> sentence = "Trinitrotoluene was added to the 'Chemicals Register Part 19-A' of the United States Department of Transportation."

>>> SE.SE ('se/common_abbreviations.se')(sentence)

"Trinitrotoluene was added to the 'Chemicals Register Part 19-A' of the U.S. DoT."

 

>>> SE.SE ('se/common_abbreviations.se Trinitrotoluene=TNT "Chemicals Register=CR"')(sentence)

"TNT was added to the 'CR Part 19-A' of the U.S. DoT."

 

Beginning and ending sets with white space ensures that no two definitions fuse into something dysfunctional when splicing sets.

 

 

 

3.4. Redefinitions

 

 

Defining a substitute for a target that has been defined previously is not a mistake. The new definition simply replaces the previous one.

 

>>> Translate = SE.SE ('old=antique old=aged')

>>> Translate ('old people')

'aged people'

 

This does look pretty un-straightforward, but only for the sake of illustration. Re-definitions can be useful for defining ad hoc exceptions from definition files.

 

 

3.4.1. Exception

 

>>> SE.SE ('se/unabbreviations.se') ('personal/cv.txt', '')   # '' means output string (5.1.)

'... October 1958: Bachelor of Arts ... March 1976: Bachelor of Arts fellowship ...'

 

se/unabbreviations.se would expand B.A. as Bachelor of Arts. This misfires when B.A. means British Academy. There must be a way to handle exceptions quickly and without altering boilerplate files. Defining an exception following the file name is fast and preserves the file. Order matters in this case. The exception must follow the rule.

 

>>> SE.SE ('se/unabbreviations.se "B.A. fellow=British Academy fellow"') ('personal/cv.txt', '')

'... October 1958: Bachelor of Arts ... March 1976: British Academy fellowship ...'

 

 

3.4.2. Target Lock

 

>>> Line_Unwrap  = SE.SE ('(13)(10)=(32) (10)=(32)')

 

The Line Unwrapper in 2.4.1. wouldn’t be very useful because it unwraps an entire text when it should unwrap only paragraphs. We must somehow tell it to preserve line feeds that are part of a paragraph break. Locking a target is a simple matter of replacing it with itself:

 

>>> Line_Unwrap = SE.SE ('(13)(10)=(32) (10)=(32) (13)(10)(13)(10)=(13)(10)(13)(10) (10)(10)=(10)(10)')

 

If we expect indented paragraphs we would protect those too. The target-repeat place holder '=' comes in handy.

 

>>> Line_Unwrap  = SE.SE ('(13)(10)=(32) (10)=(32) (13)(10)(13)(10)== (10)(10)== "~\r?\n[ \t]~=="')

 

Locked targets acquire precedence over action targets quite naturally. Since they differ in adjacent characters they are longer.

 

>>> SE.SE ('foundation/anonymize_member_names.se "Barbara Kycenuk=="') ('foundation/board/Feb-19', '')

'List of donors: $1000: ***. $500: ***, ***. $100: ***, ***. Signed Barbara Kycenuk, President'

 

 

 

3.5. Filters (A short and well-known story with a happy ending)

 

 

3.5.1. Filtering by Deletion

 

>>> kingdom = 'woods, rivers, castle, king, princess, mountains, terrifying dragon, people, towns, lakes'

>>> whacking_the_dragon = ' "terrifying dragon=" '

>>> Heroic_Task = SE.SE (whacking_the_dragon)

>>> Heroic_Task (kingdom)

'woods, rivers, castle, king, princess, mountains, , people, towns, lakes'

 

"Well done!" the king lauds. "Good riddance!"

"No big deal." the prince brags. "I usually do bigger ones don't you know. Anyway, where's my date?"

"Not so fast, young man. Your task is to pick her out together with the dead dragon. My cook gets the dragon. You get the princess. Okay?"

"Is that all?"

"And just how many princesses might be satisfactory?"

"I'm sorry. I meant to ask is it all I have to do?"

"It is all you have to do and you have one hour to do it."

 

 

3.5.2. Filtering by Extraction

 

>>> Heroic_Task = SE.SE ('"terrifying dragon=dead dragon" woods= rivers= castle= king= mountains= people= towns= lakes=')

>>> Heroic_Task (kingdom)

', , , , princess, , dead dragon, , , '

 

"There you are. Forty-seven seconds flat."

"I will make allowance for your juvenile rashness. You must be aware the a kingdom has more than a half dozen ... system components. A lot more. I intend to run my entire kingdom through your filter."

"Well ... may I ... have the inventory?"

"Certainly. Ask the Chancellor. It has grown to four volumes, I believe."

" ... Four ... vo ..."

"Four volumes! ... One hour!"

 

We skip the prince's mental processes and rejoin him fifty-nine minutes later:

 

>>> whacking_the_dragon = ' "terrifying dragon=dead dragon" '

>>> everyone_stay_home = ' '.join (['(x%x)=' % n for n in range (256)])

>>> print everyone_stay_home [:30], '...', everyone_stay_home [-35:]

(x0)= (x1)= (x2)= (x3)= (x4)=  ...  (xfb)= (xfc)= (xfd)= (xfe)= (xff)=

>>> except_the_princess = ' princess== (32)== '

>>> Heroic_Task = SE.SE (whacking_the_dragon + everyone_stay_home + except_the_princess)

>>> test_string = "I don't give a damn what kind of kingdom this is and what's in it as long as there's a terrifying dragon and a princess."

>>> Heroic_Task (test_string)

'                    dead dragon   princess'

 

 

3.5.3. Null-Edit Modes: PASS and EAT

 

>>> Null_Editor = SE.SE ('')

 

Here is an Editor that doesn't do any editing. What should come out?

 

>>> Null_Editor ('We expect the input stream to come out unaltered.')

'We expect the input stream to come out unaltered.'

 

And so it does. As chapter 3.1. explains, SE's null-edit mode lets unmatched input pass. This mode is suitable for translations and that's why it is the default mode. It is also suitable for deletion filters, but not for extraction filters. The prince's second task is to write an extraction filter and he does it coaxing the pass-mode into a no-pass-mode by preconditioning the Translator with a wall-to-wall set of deletions. This suits his purpose: marrying up. It would suit our purpose if we saved everyone_stay_home in a file, named it say eatall.se, and pulled it in first thing every time we need an extraction filter. From a system-design point of view it would be an unsatisfactory subterfuge and unlikely the most efficient one.

      So we introduce the null-edit mode eat as an opposite to the default mode pass. If instead of the file name eatall.se, as it were, we place the keyword <EAT> a no-pass-mode translator results. The opposite keyword <PASS> exists but is redundant, because all it does is not change the default mode.

 

>>> SE.SE ('<EAT>')('The keyword <EAT> without definitions zaps everything.')

''

>>> Heroic_Task = SE.SE (whacking_the_dragon + ' <EAT> ' + except_the_princess)

 

The defined substitutions, of course, work the same in either mode.

 

 

 

3.6. Multiple-Pass Runs

 

 

3.6.1. Cascading Runs

 

>>> Heroic_Task (test_string)

'                    dead dragon   princess'

 

The king shakes his head. His daughter is very pretty and so he thinks he should raise the stakes a little: “I'm afraid this won't do! I want no extra spaces. You flunk. But I happen to be in a lenient mood today and so I will grant you another half hour.”

 

>>> deflate = ' "~ +~= " '

>>> Deflator = SE.SE (deflate)

>>> essentials_with_garbage = Heroic_Task (test_string)

>>> Deflator (essentials_with_garbage)

' dead dragon princess'

 

The prince resorts to two runs in succession. The first one extracts what is wanted, but also extracts a lot of extra spaces as a consequence of keeping the princess at a safe distance from the cadaver. The extra spaces need to be removed in a second run. The prince then decides to chain two runs in case he the king would object to two separate filters: This is how he does it: he writes the definitions of both translations into the same string and separates them with a free-standing vertical bar character. This symbol starts a pass with the definitions preceding it. The output of that pass is then the input of the next pass defined by the definitions following the vertical bar.

 

>>> Heroic_Tasks = SE.SE (whacking_the_dragon + ' <EAT> ' + except the princess + ' | ' + deflate )

>>> Heroic_Tasks (test_string)

' dead dragon princess'

 

There is not systematic limit to the number of passes in a translation cascade.

 

>>> SE.SE ('A=B | B=C | C=D | D=E | E=F')('ABCDEF')

'FFFFFF'

 

 

3.6.2. Nesting Calls

 

>>> essentials_with_garbage = Heroic_Task (test_string)

>>> Deflator (essentials_with_garbage)

' dead dragon princess'

 

Quite obviously the variable essentials_with_garbage is not necessary to convey intermediate data from one call to the next if we nest the two calls:

 

>>> Deflator (Heroic_Task (test_string))

' dead dragon princess'

 

If cascading runs are the pre-compile implementation of the building-block paradigm, nested calls are its post-compile implementation. Nested calls are possible because an SE Editor by default type-matches output to input.

 

>>> Generic_To_Symbol = SE.SE ('generic_to_symbol.se')

>>> Cusip_To_Generic = SE.SE ('cusip_to_generic.se')

>>> Generic_To_Symbol (Cusip_To_Generic ('statement.txt'))

  'statement.txt.~se.~se'

 

 The nested calls are functionally equivalent to

 

>>> SE.SE ('cusip_to_generic.se | generic_to_symbol.se')('statement.txt')

  'statement.txt.~se'

 

 

 

3.7. Setting Marks

 

 

Setting marks is a pre-processing technique that translates certain features ill-suited for the task at hand to features more obliging to its processing capabilities, or that adds features to mark locations which the intended process needs to find but would have difficulty finding without the mark.

 

 

3.7.1. Temporary Runtime Marks

 

"Well done for the spaces, except the leading one." concedes the king. "But what am I to make of a dead dragon princess? No no no! This is getting from bad to worse. I want no extra spaces at all and at least a comma to keep my daughter away from this ghastly beast. I should really whip you out of my country at this point, but you have ten minutes left. So I will let you give it another last try."

 

>>> Heroic_Tasks = SE.SE ('<EAT> "terrifying dragon=+MARK+dead dragon+MARK+" princess=+MARK+=+MARK+ | +MARK+= +MARK++MARK+=,(32)' )

>>> Heroic_Tasks (test_string)

'dead dragon, princess'

>>> Heroic_Tasks ('home/kingdom/chancellery/inventory', '')

'princess, dead dragon'

 

... happily ever thereafter.

 


3.7.2. Split Marks

 

Splitting strings on section points of various shapes is a simple matter after converting each one to the same separator, chosen to be distinct from anything in the text:

 

>>> inventory_line = 'Crank 2 x 1 6000-2RS1 10 26 8 460 lbs 0.019 * 2 = 0.038 GBP 124.60 (1)'

>>> #                 ^^^^^ ||||| ^^^^^^^^^             |||       | ^ | ^^^^^ ||| ^^^^^^        # ^^^ = data, ||| = split

>>> #                 item        part number                quantity   weight    price                 

>>> Inventory_Line_Splits = SE.SE ('lbs=| GBP=| \==| *=| "~[0-9]+\s*x\s*[0-9]+~=|')

>>> inventory_line_with_split_marks = Inventory_Line_Splits (inventory_line)

>>> print inventory_line_with_split_marks  # Verify

Crank | 6000-2RS1 10 26 8 460 | 0.019 | 2 | 0.038 | 124.60 (1)

>>> item_split = [x.strip () for x in inventory_line_with_split_marks.split ('|')]

>>> pn = item_split [1].split () [0]

>>> price = item_split [5].split (None, 1) [0]

>>> print 'Item: %s - PN: %s - Quantity: %s - Weight: %s kg – Price: %s' % (item_split [0], pn, item_split [3], item_split [4], price)

Item: Crank - PN: 6000-2RS1 - Quantity: 2 - Weight: 0.038 kg – Price: 124.60

 

Marks are temporary and so can be anything that isn’t part of the data. Bytes with an ascii value from 0 to 6 (smaller than controls) make good marks in texts.

 

 

 

3.8. Dynamic Targeting

 

 

Dynamic targeting consists in building, compiling and running substitution definitions for targets extracted from the source at run time.

      The inventory catalog in the previous example lists weights in pounds and prices in British currency. A new catalog should be made with weights in kilograms and prices in Euro and the old figures in parentheses. Here is how it can be done using SE:  For each line we extract the patterns with the relevant figures. With these patterns as targets we build substitution definitions and finally run the line through an Editor made with these definitions..

 

>>> inventory_line = 'Crank 2 x 1 6000-2RS1 10 26 8 460 lbs 0.019 * 2 = 0.038 GBP 124.60 (1)'

>>> #                              Targets to extract:  ^^^^^^^^^     ^^^^^^^ ^^^^^^^^^^

>>> Targets = SE.SE (r' <EAT> "~(lbs)|=|GBP) +[0-9]+(\.[0-9]+)?~==(10)" ')

>>> definitions = Targets (inventory_line).split ('\n')   # definitions: ['lbs 0.019', '= 0.038', 'GBP 124.60', '']

>>> figure = float (definitions [0].split ()[1]); definitions [0] = '"%s=kg %.3f (lbs %.3f)"' % (definitions [0], figure * lbs_to_kg, figure)

>>> figure = float (definitions [1].split ()[1]); definitions [1] = '"\\%s=\\= kg %.3f (lbs %.3f)"' % (definitions [1], figure * lbs_to_kg, figure)

>>> figure = float (definitions [2].split ()[1]); definitions [2] = '"%s=EUR %.2f (GBP %.2f)"' % (definitions [2], figure * gbp_to_eur, figure)

>>> # definitions: ['"lbs 0.019=kg 0.009 (lbs 0.019)"', '"\\= 0.038=\\= kg 0.017 (lbs 0.038)"', '"GBP 124.60=EUR 180.67 (GBP 124.60)"', '']

>>> new_line = SEL.SEL (' '.join (definitions)) (inventory_line)

>>> print new_line

Crank 2 x 1 6000-2RS1 10 26 8 460 kg 0.009 (lbs 0.019) * 2 = kg 0.017 (lbs 0.038) EUR 180.67 (GBP 124.60) (1)

 

The Editor object becomes obsolete after one call and so it doesn’t need to be assigned to a variable.

 

 

 

4. Working with SE Objects

 

 

 

4.1. The SE Object

 

 

The method show () displays the object's runtime settings:

 

>>> SE_Object.show ()

 

SEL.DEFINITION_PATH  > None (module attribute)

 

<SE.SE instance at 0x00F8AEB8>

 

Compiling

  MAX_TARGET_LENGTH  > 1024

 

Processing

  INPUT_PATH         > None

  OUTPUT_PATH        > None

  FILE_HANDLING_FLAG > 0    (No existing file may be replaced)

  BACKUP_EXTENSION   > .bak (Replaced files take this extension)

 

Developing

  KEEP_CASCADE       > 0 (no)

  CASCADE_DATA_CLIP  > 128

 

The topmost item SEL.DEFINITION_PATH is a module attribute, because it needs to be available when the SE object is created. It supplies a path component to definition files that don’t have their own path (slash). By default is doesn’t supply anything. A pathless name is supposed to refer to the current working directory.

 

>>> SE.set (definition_path = 'I:/Expeditions/Africa/Sahara/2003/se')

>>> Species_Latin = SE.SE ('reptiles.se insects.se coleopteran.se rodents.se canids.se birds.se')

 

The other items are instance variables . They can be changed with the method set (keyword arguments).

 

>>> SE_Object.set (max_target_length = 2000)

 

The keywords are the same as the names of the settings spelled in lower case letters. The keyword reset will restore the defaults of all settings that are not simultaneously getting changed.

 

>>> SE_Object.set (reset = 1, max_target_length = 2000)

 

 

4.1.1. Maximum Target Length

 

MAX_TARGET_LENGTH  > 1024   (Default: 1024

 

Longer targets will compile without warning, but may not translate correctly. The setting actually determines the length of the buffer for processing files.

 

 

4.1.2. Data Paths

 

INPUT_PATH    > None      (Default: None)

OUTPUT_PATH   > None      (Default: None)

 

>>> SE_Object.set (input_path = patents/portable_aura_energizer/19.3',

output_path = patents/applications/portable_aura_energizer/20.0')

>>> SE_Object ('cip_17', 'cip_17')

'patents/portable_aura_energizer/20.0/cip')

 

The path settings apply only to file names without a path component. Any file name with its own path (slash) is deemed an exception to the setting and is used as is. The setting stays put.

 

>>> SE_Object ('scratchpad/retrofits')

'scratchpad/retrofits.~se'

 

>>> SE_Object ('retrofits', 'temp/retrofits')

'temp/retrofits'

 

 

4.1.3. Protecting and Altering Existing Files

 

FILE_HANDLING_FLAG   > 0   (Default is 0: no existing data files may be changed

 

>>> file_name = SE_Object ('some_file_name', 'exisiting_file')

'exisiting_file'

>>> file_name = SE_Object ('some_file_name', 'exisiting_file')

'~QK11342'

 

The second time around the named output file exists. The file handling flag doesn’t allow to change its name. A artificial name is used and returned.

 

>>> SE_Object.set (file_handling_flag = 2)

>>> SE_Objet ('some_file_name', 'exisiting_file')

'exisiting_file'

 

Done!

      Flag 1: The output is appended to named output

      Flag 2: Named output may be overwritten. If the name of the output file is the same as the name of the input file, a translation in place results.

      Flag 3: A translation in place results if no output file is named.

      Actually, no data ever gets overwritten, only file names. If altered in any way, the existing file lends its names to the new file while taking on a backup extension. (Next chapter).

      The file handling flag can be set at from the start. The constructor has a keyword argument file_handling_flag = n.

 

>>> New_Names = SE.SE ('MODULE_X=MODEX "Module X=Modex" Module_X=Modex MODULE_0=M-0', 2)

>>> def get_pys_and_htms (collector_list, directory, file_names):

        for name in file_names:

           if name.endswith ('.py') or name.endswith ('.htm'):

              collector_list.append (name)

>>> file_names = []

>>> os.path.walk ('src/modex', get_pys_and_htms, file_names)

>>> for name in file_names:

        New_Names (name)

'src/modex/module_x.py'

'src/modex/module_0.py'

'src/modex/doc/module_x.htm'

'src/modex/doc/module_x_examples.htm'

 

All originals are preserved with the backup extension. (Next chapter).

 

 

4.1.4. Backup File Extension

 

BACKUP_EXTENSION  > .bak    (Default: '.bak')

 

If an output file would overwrite an existing file, the new file takes on the name of the source file and the source file gets the backup extension appended to its name. No data ever gets overwritten. The default .bak is a fairly common extension for backup files. One might want to change it to avoid confusion with other applications' backups.

      Cleaning up obsolete backups is left entirely to the user. If he neglects it, he is bound to accumulate lots of backups with an ever increasing number of extensions (file_name.bak.bak.bak.bak...) The oldest are the longest. The command rm file_name*bak.bak.bak would delete all backups save the two most recent ones.

 

 

4.1.5. Intermediate Data

 

KEEP_CASCADE  > 0     (Default: delete)

 

Setting 1 (not 0) will preserve an entire translation cascade for debugging purposes. If the flag is set and a run has been done, show () will display the intermediate translations.

 

>>> Heroic_Tasks.set (keep_cascade = 1)

>>> Heroic_Tasks (kingdom)

'princess, dead dragon'

>>> Heroic_Tasks.show ()

(... runtime settings ...)

 

Translation Cascade

----------------------------------------------------------------------------------

  woods, rivers, castle, king, princess, mountains, terrifying dragon, people, towns, lakes

0 --------------------------------------------------------------------------------

  +MARK+princess+MARK++MARK+dead dragon+MARK+

1 --------------------------------------------------------------------------------

  princess, dead dragon

 

The translation cascade shows all translation stages from the original through the intermediate stages to the final output. The numbers are the indexes of the Translator list. By default each line is clipped at 128 characters.

      If input is a string, intermediate data are also strings. If input is a file name or a file object, intermediate data are files. The display starts with the respective file name.

 

Another similar debugging technique consists in breaking the cascade prematurely. Calling a run, the keyword cascade_break = n takes the number of passes to do.

 

>>> Heroic_Tasks (kingdom, cascade_break = 1)

'+MARK+princess+MARK++MARK+dead dragon+MARK+'

 

 

4.1.6. Clipping the Translation Cascade Display

 

CASCADE_DATA_CLIP  > 128    (Default: 128)

 

To maintain legibility with substantial output volumes the cascade lines get clipped at this length.

 

 

4.1.7. Saving an Editor's Definitions

 

For his CD project "The Royal Upstart" the prince takes the opportunity to save his Heroic_Tasks:

 

>>> Heroic_Tasks.save ('upward_mobility/whacking_a_dragon.se')

$ cat upward_mobility/whacking_a_dragon.se

# SE Definitions Tue May 23 11:43:25 2006

# upward_mobility/whacking_a_dragon.se

<EAT>

# Multi-Byte Targets

 princess=+MARK+princess+MARK+

 "terrifying dragon=+MARK+dead dragon+MARK+"

   |

# SE Definitions Tue May 23 11:43:25 2006

# upward_mobility/whacking_a_dragon.se

# Multi-Byte Targets

 +MARK+=

 "+MARK++MARK+=, "

 

(Not shown are lines thrown in to help organizing edits.). 

 

Now a candidate prince facing a dragon can make his Kingdom_Fixer from the CD like this:

 

>>> Kingdom_Fixer = SE.SE ('upward_mobility/whacking_a_dragon.se')

 

Translator objects (next chapter) can be saved individually. They have their own save () method which works identically.

 

 


4.2. The Translator Object

 

 

>>> Heroic_Tasks.Translators

[<SE.SE instance at 0x02821AA8>, <SE.SE instance at 0x00F78418>]

 

The SE object contains a list of Translator objects. Each Translator processes the data stream and passes its output on to the next Translator in the list. Let’s make a one-pass Editor to demonstrate its methods.

 

>>> Demo = SE.SE ('''

"ABC=(=): multi byte "

"~[A-Z]+~=(=): soft re "

"~A[BCD]+~=(=): hard re "

"X=(=): single byte "

FILE=<file_name>''' )

 

>>> Demo.Translators [0].show ()

SE.Translator <SE.Translator instance at 0x00E15490>

NULL_MODE > 0 (PASS: unmatched data passes)

Single-Byte Targets

    1: |X|->|(=): single byte |

Multi-Byte Targets

    2: |ABC|->|(=): multi byte |

    3: |FILE|->|<file_name>|

Hard Regex Targets

    4: |A[BCD]+|->|(=): hard re |

Soft Regex Targets

    5: |[A-Z]+|->|(=): soft re |

 

The display shows the definitions alphabetically ordered in four categories. The categories are ordered on processing speed fast to slow. Soft regexes range slowest. Those cannot be put into an array indexed on the target's initial, because they match more than one initial. Some hard-to-interpret hard expressions may also go the soft list.

      Calling the editor's method show () with the argument 1—show (show_translators=1)—will display its Translator(s) too. This is particularly useful in conjunction with the cascade display, juxtaposing cause and effect.

 

 

 

4.3. Editing a Translator Interactively

 

 

>>> T = Demo.Translators [0]

 

 

4.3.1. Adding Definitions

 

>>> T.add ('''

"Y=(=): a new single byte "

"X=(=): revised single byte "

"XYZ=(=): a new multi byte "

go_away=

(=parenthesis  )=parenthesis''')


>>> T.show ()

SE.Translator <SE.Translator instance at 0x00E15490>

NULL_MODE > 0 (PASS: unmatched data passes)

Single-Byte Targets

    1: |(|->|parenthesis|

    2: |)|->|parenthesis|

    3: |X|->|(=): revised single byte |

    4: |Y|->|(=): a new single byte |

Multi-Byte Targets

    5: |ABC|->|(=): multi byte |

    6: |FILE|->|<file_name>|

    7: |XYZ|->|(=): a new multi byte |

    8: |go_away|->||

Hard Regex Targets

    9: |A[BCD]+|->|(=): hard re |

Soft Regex Targets

   10: |[A-Z]+|->|(=): soft re |

 

 

4.3.2. Deleting Definitions

 

>>> T.drop (3,7)

 

Deletions go by the numbers which show () displays. They are sequential numbers that get reassigned after each deletion or addition. So show () needs to be called prior to each deletion. If drop () is called without argument, the show () display comes up automatically.

 

>>> T.show ()

SE.Translator <SE.Translator instance at 0x00E15490>

NULL_MODE > 0 (PASS: unmatched data passes)

Single-Byte Targets

    1: |(|->|parenthesis|

    2: |)|->|parenthesis|

    3: |Y|->|(=): a new single byte |

Multi-Byte Targets

    4: |ABC|->|(=): multi byte |

    5: |FILE|->|<file_name>|

    6: |go_away|->||

Hard Regex Targets

    7: |A[BCD]+|->|(=): hard re |

Soft Regex Targets

    8: |[A-Z]+|->|(=): soft re |

 

 

4.3.3. Reversing a Translator

 

>>> T_Reversed = T.reverse (report_irreversibles = 1)

Skipping irreversible definitions:

   5: |FILE|->|<file_name>|                                                                 

   6: |go_away|->||

   7: |A[BCD]+|->|(=): hard re |

   8: |[A-Z]+|->|(=): soft re |

   1: |\(|->|parenthesis|

   2: |)|->|parenthesis|

 

Irreversible are: substitute files, regular expressions, deletions and multiple targets with identical substitutes. The null-edit mode stays the same.

 

>>> T_Reversed.show ()

SE.Translator <SE.Translator instance at 0x0285E030>

NULL_MODE > 0 (PASS: unmatched data passes)

Multi-Byte Targets

   1: |(ABC): multi byte |->|ABC|

   2: |(Y): a new single byte |->|Y|

 

With long lists of reversible definitions, the reverse method is a great help:

 

>>> Generic_To_Isin_T = SE.SE ('se/isin_to_generic.se').Translators [0].reverse ()

>>> Generic_To_Isin_T.save ('se/generic_to_isin.se')

 

Or as a one-liner:

 

>>> SE.SE ('se/isin_to_generic.se').Translators [0].reverse ().save ('se/generic_to_isin.se')

 

SE objects don't have the method reverse (). With cascades running backwards and the likelihood of irreversible definitions, a properly functioning reversal is too shaky a prospect. Reversing each translator separately and reassembling them backwards is a fairly straightforward task that ensures that eventual problems don't go unnoticed. A Translator list can be attached to an SE object hacker’s style:

 

>>> Reversed_SE_Object ('')

>>> Reversed_SE_Object.Translators = [Reversed_T2, Reversed_T1]

>>> Reversed_SE_Object.save ('se/reversed_whatever.se')

 

 

 

 

5. Input, Output

 

 

>>> translation = SE_Object (input, [output])

 

Either input and output may be a string, an open file object or a file name.

      Without the second argument the output will match the type of the input. If input is a file name, an output file name will be auto-generated by appending the extension ~se to the input file name.

      A returned file name is for a file that has been written to the disk.

      Providing an output parameter overrides type-matching, so that each input type can be combined with any output type.

 

 


5.1. Input Types and Output Options

 

 

5.1.1. Input: String

 

Strings are ideal for speedy interactive work and system development.

 

>>> SE_Object ('string')                        -> a string

>>> SE_Object (' name_of_existing_file ')       -> same as string. Leading or ending space means: edit name, not file

>>> SE_Object ('string', 'file_name')           -> the file_name

>>> SE_Object ('string', (open file object))    -> the open file object

 

 

5.1.2. Input: File Name

 

File names are appropriate for routine work and volume data. The SE object has an existing-file-change flag that can be set to handle any contingency.

 

>>> SE_Object ('file_name')                     -> 'file_name.~se'  if file-change-permission is 0

>>> SE_Object ('file_name')                     -> 'file_name' if file-change permission flag > 0

>>> SE_Object ('file_name, file_name')          -> 'file_name' if file-change-permission > 0

>>> SE_Object ('file_name, file_name_2')        -> 'file_name_2' if file_name_2 does not exist or flag > 0

>>> SE_Object ('file_name, file_name_2')        -> '~???????.~se' (automatic name) if file_name_2 exists and flag is 0

>>> SE_Object ('file_name', '')                 -> a string

>>> SE_Object ('file_name', (open file object)) -> the open file object

 

 

5.1.3. Input: File Object

 

A programmer might at some point see the need to pass a file object through a translator. He would insert an SE translation at the appropriate location in his code, hand it his object, get one returned and be on his way without having to go through unnecessary disk reads and writes.

      Working with file objects the programmer retains full control. File cursors are taken as they come in and left as they are. Nested calls will not work with file objects, because the file cursor needs to be reset explicitly between calls. (f.seek (0))

 

>>> SE_Object ((open file object))                       -> a new open file object

>>> SE_Object ((open file object), (open file object 2)) -> returns the file object 2

>>> SE_Object ((open file object), 'file_name')          -> 'file_name'

>>> SE_Object ((open file object), (open file object))   -> returns a string

 

 

 

 

6. The Message Log

 

 

All objects have a log which accumulates compiler and processor reports, chiefly about compiling and IO snags. The SE method show_log () displays whatever has accumulated since the object’s creation. Calling with argument 1 will show the Translators’ logs as well.

 

>>> SE_Object = SE.SE (' /se/htm2iso.se ==equal')

>>> SE_Object.show_log ()

Mon Jun 26 13:27:16 2006 - Compiler - Ignoring single word '/se/htm2iso.se'. Not an existing file '/se/htm2iso.se'.

Mon Jun 26 13:27:16 2006 - Compiler - Ignoring single word '==equal'. Not an existing file '==equal'.

>>> SE_Object (99)

>>> SE_Object.show_log ()

Mon Jun 26 13:34:47 2006 - Compiler - Ignoring single word '/se/htm2iso.se'. Not an existing file '/se/htm2iso.se'.

Mon Jun 26 13:34:47 2006 - Compiler - Ignoring single word '==equal'. Not an existing file '==equal'.

Mon Jun 26 13:34:53 2006 - Editor - Illegal argument type: 99

 

 

 

 

 

7. SE Light

 

 

SE's extensive debugging facilities are dead weight for proven production routines and trivial little tasks.

 

>>> import SEL

>>> SEL.SEL ('genius=expert ingenious=dedicated admirable=agreeable', file_handling_flag = 3)('job_application')

'job_application'

 

SEL offers the processing functionality of SE without any of its development functions (show, add, drop, save, reverse). The methods set () and show_log () are available, though.

 

 

 

7.1. Type Casting Back and Forth

 

 

If the SE.SE constructor receives an SE Light object instead of a substitutions list, it becomes an interactive copy of the SEL object. To go the other way, the SE object has a method SEL () which returns a light copy of itself.

 

>>> SE_Light      = SE_With_Bells.SEL ()

>>> SE_With_Bells = SE.SE (SE_Light)

 

 

 

 

8. Some Examples

 

 

 

8.1. Programming

 

 

8.1.1. Revising Names

 

As the number of attributes, constants and variables grows in the course of a coding project, keeping track of them tends to become increasingly difficult as their number grows. When mix-ups start to happen, a sweeping revision of names aiming at more suggestive accuracy may help.

 

>>> function_names = '''

    "ie (=interest_earned ("

    "i_e (=interest_earned ("

    "intr (=interest_earned ("

    "icost (=interest_cost ("

    "i_cost (=interest_cost ("

    "int_3 (=interest_accrued ("

    "int_proj (=interest_projected ("

    "est_intr (=interest_estimated ("

    "intr_ (=interest_average ("

    "intr_pa_ (=interest_estimated_per_annum ("

    "intr_pa (=interest_per_annum (" '''

>>> Revise_Function_Names = SEL.SEL (function_names)

>>> Revise_Function_Names.set (file_handling_flag = 3, input_path = 'project_7/preliminary/src')

>>> for name in file_names: Revise_Function_Names (name)

'project_7/preliminary/src/_setup.py'

'project_7/preliminary/src/_constants.py'

'project_7/preliminary/src/_setup.py'

'project_7/preliminary/src/_utilities.py'

'project_7/preliminary/src/P7.PY'

 

In-place translations don’t hold up workflow unduly. If something goes awry, the old files are still there with the backup extenstion added to their names. SE never deletes any source data. It just adds another backup extension to all prior backups.

 

 

8.1.2. Deleting Tracing Statements


By the time a  program is finished, it may have accumulated a lot of print statements that got commented in and out as needed to trace executionsomething like this:

 

   # STOCKS_.PY

   (...)

   keep = '\n'.join (['"~>%s<.*?[0-9][0-9]%%~==(10)"' % s for s in symbols])           

   ## print 'keep', keep

   eat_tags = ' "~<.*?>~= " >= '

   ## print 'eat_tags', eat_tags

   Data_Extractor = SE.SE (EAT + keep + RUN + eat_tags)

   ## Data_Extractor.set (keep_cascade = 1, cascade_data_clip = 500)

   url = 'http://finance.yahoo.com/q/cq?d=v1&s=' + '+'.join (symbols)

   (... etc ...)

 

These statements need to be removed in the end. In anticipation of the necessity a distinctive comment pattern was used (‘## ‘). 

 

>>> Finisher = SEL.SEL ('"~\n[ \t]*## .*~="')

>>> Finisher.set (input_path = 'finance/src', file_handling_flag = 2)

>>> for file_name in ('stocks_.py', 'statement_.py', 'balance_.py'):  # Underscore would identify development file 

       Finisher (file_name, file_name.replace ('_','')  # No underscore: release

'finance/src/stocks.py'

'finance/src/statement.py'

'finance/src/balance.py'

 

 

8.1.3. Listing Names of Classes, Functions, Methods, Imports, Globals ...

 

>>> List = SEL.SEL ('''<EAT>

      "~[ \t]*class (.|\n)+?:~==(10)"

      "~[ \t]*def (.|\n)+?:~==(10)"

      "~[ \t]*import .+~==(10)"

      "~[ \t]*from .+~==(10)"

      "~[ \t]*global .+~==(10)"''')

>>> print List ('src/Python/SE.PY')

src/Python/SE.PY.~se

 

The following is the complete structure of SE.PY as captured in src/Python/SE.PY.~se with a few quick hand edits.

 

import SEL

import sys

 

def set (reset = NO, definition_path = None):

def show ():

def about ():

def version ():

 

class SE (SEL.SEL):

   def __init__ (self, substitutions, file_handling_flag = 0):

   def SEL (self):

   def show (self, show_translators = NO):

   def set (self, reset = NO,

            input_path         = None,

            output_path        = None,

            file_handling_flag = None,

            max_target_length  = None,

            cascade_data_clip  = None,

            backup_extension   = None,

            keep_cascade       = None,

            verbose            = None ):

   def __call__ (self, input, output = None, cascade_break = None):

   def _do_file (self, in_file, out_file_path, cascade_break = None):

   def do_string (self, s, output = None, cascade_break = None):

   def save (self, file_name):

 

class Translator (SEL.Translator):

   def __init__ (self, TL = None):

   def reverse (self, report_irreversibles = NO):

   def show (self):

      def show_single_byte (show_count):

      def show_multi_byte (show_count):

      def show_hard_re (show_count):

      def show_soft_re (show_count):

   def add (self, definitions):

   def drop (self, *target_numbers):

   def save (self, file_name):

      def save_single_byte ():

      def save_multi_byte ():

      def save_hard_re ():

      def save_soft_re ():

def _char_to_ascii (c):

def _translator_to_definition (x, initials, specials):

def _target_translator_to_definition (t):

def _substitute_translator_to_definition (s):

def _target_translator_to_display (t):

def _substitute_translator_to_display (s):

def _visualize_re (s):

 

 

 

8.2. HTM

 

 

8.2.1. A Link Extractor

 

>>> Link_Extractor = SEL.SEL ('<EAT> ~[hH][tT][tT][pP][A-Za-z0-9/_\-.,:]+~==(10)')

>>> f = urllib.urlopen ('http://www.mozilla.com/firefox/central/')

>>> print Link_Extractor (f.read ())

http://www.mozilla.com/firefox/search.html

http://www.mozilla.com/firefox/search.html

http://www.mozilla.com/firefox/search.html

https://addons.mozilla.org/

https://addons.mozilla.org/

(... many more ...)

 

 

8.2.2. A Tag Stripper

 

>>> Tag_Stripper = SEL.SEL ('"~<(.|\n)*?>~= " se/htm2iso.se | "~\n[ \t\n]*~=(10)" "~ +~= "')

 

htm2iso.se is a definition file decoding the HTM ampersand escapes. The second pass strips all empty lines and deflates white space..

      Anything that has a distinct beginning and a distinct ending can be stripped this way or extracted , even if it spans many lines.

 

 

 

8.3. Expanding

 

 

8.3.1. A Two-Pass Expanding Editor

 

>>> paths = 'FROM=photo/pictures TO=projects/sun_dial/pictures'

>>> File_Mover = SEL.SEL ('"~[^ ,]*|.*$~=mv FROM/= TO/=" ,=(10) ,(32)=(10) | ' + paths)

>>> print File_Mover ('general_view.jpg, dial.jpg, time_zones.bmp, time_equation.data')

mv photo/pictures/general_view.jpg projects/sun_dial/pictures/general_view.jpg

mv photo/pictures/dial.jpg projects/sun_dial/pictures/dial.jpg

mv photo/pictures/time_zones.bmp projects/sun_dial/pictures/time_zones.bmp

mv photo/pictures/time_equation.data projects/sun_dial/pictures/time_equation.data

 

                                   

8.3.2. Combining two One-Pass Editors

 

Revisiting 2.2.3: If the assembled merge components contain generic place holders for case-related data, an Editor made for the purpose will finish the document in a second pass. We start with the body of the letter which the target TEXT will pull in. The rest is routine templates. This is the letter:

 

>>> letter = 'PL_OPEN\n\n\nTEXT\n\nBYE'

 

PL_OPEN: is the routine template (letterhead, address and date). BYE is the sign-off. The names of the templates would associate with mnemonic  targets in the file corr/se/merge_components.se.

 

$ cat corr.se.merge_components.se

(...)

PL_OPEN=<corr/se/private_open.se>

BYE=<corr/se/private_signoff.se>

(... many more ...)

 

A first Editor assembles the merge components:

 

>>> Merge_Expander = SE.SE ('corr/se/merge_components.se TEXT=<corr/letters/invitation-feb-3.txt>’)

 

Next we define the place holders which we know are contained in the templates, for instance like this:

 

>>> place_holders = '''

"@DATE@=January 18, 2006"

"@NAME@=Skip Spank"

 @SUBJECT=

"@STREET@=Main Street 1000 W"

"@CITY@=Jiggs"

 @STATE@=NY

 @ZIP@=12345-6789

 

With these we create another Editor:

 

>>> Specs_Expander = SEL.SEL (place_holders)

 

Finally we call a two-pass run using two nested calls:

 

>>> Specs_Expander (Merge_Expander (letter), 'corr/private/invitation-feb-3')

'corr/private/invitation-feb-3'

 

When we need to define the place holders and don’t exactly know which ones the merge brings together we can extract them like this:

 

>>> print SEL.SEL ('<EAT> ~@[A-Z0-9]+@~==(10)')(Merge_Expander (letter))

@DATE@

@NAME@

@STREET@

@CITY@

@ZIP@

@SUBJECT@

@NAME@

 

 

 

8.4. Siphoning (Reasonably) Current Stock Quotes from an Internet Site

                                   

 

>>> def get_stock_quotes (symbols):

      import urllib

      url = 'http://finance.yahoo.com/q/cq?d=v1&s=' + '+'.join (symbols)

      htm_page = urllib.urlopen (url)

      import SEL

      keep = '"~[A-Z]+ [JFMAJSOND].+?%~==(10)"  "~[A-Z]+ [0-9][0-2]?:[0-5][0-9][AP]M.+?%~==(10)"'          

      Data_Extractor = SEL.SEL ('<EAT> ' + keep)

      Tag_Stripper = SEL.SEL ('"~<(.|\n)*?>~= " se/htm2iso.se | "~\n[ \t\n]*~=(10)" "~ +~= "')

      data = Data_Extractor (Tag_Stripper (htm_page.read ()))

      htm_page.close ()

      return data

>>> print get_stock_quotes (('GE','IBM','AAPL', 'MSFT', 'AA', 'MER'))

GE 3:17PM ET 33.15 0.30 0.90%

IBM 3:17PM ET 76.20 0.47 0.61%

AAPL 3:22PM ET 55.66 0.66 1.20%

MSFT 3:22PM ET 23.13 0.37 1.57%

AA 3:17PM ET 31.80 1.61 4.82%

MER 3:17PM ET 70.24 0.82 1.15%

 

Here’s how to hack it: 1. Get an htm source for some symbol.  2. Run it through the Tag Stripper and see if you can define a regular expression that matches all of your data. keep does that in this case. If it is not possible, try again with the unstripped data which puts more locating features at your disposal. Depending on which, you strip the tags first or last. In either case, you have a two-pass run. This example does two passes with two nested single-pass Editors.

      keep has two definitions. The first one catches listings after trading hours. Those have the name of the week day follow the symbol. The regex targets their initials. When the function was run some other time and returned nothing, the reason turned out to be that during trading hours they show a different format: the symbol is followed by the time of the day. So another regex was added to catch that format. Both formats could certainly be put into one regex, with a certain speed benefit even. The point to retain is that a single regex that has to do ‘everything’ gets conceptually intractable at some point towards universality. SE encourages building up functionality incrementally by assembling one by one regexes that are simple to conceive and to test. 

      The last definition of the Tag_Stripper (‘... “~ +~= “’) would make a CSV file, if the substitute is a suitable CSV separator (comma, tab, ...) instead of a space (‘... “~ +~=,“’) or (‘... “~ +~=(9)“’)

 

>>> Tag_Stripper = SEL.SEL ('"~<(.|\n)*?>~= " se/htm2iso.se | "~\n[ \t\n]*~=(10)" "~ +~=,"') # comma

 

GE,3:20PM ET,33.6115,0.0885,0.26%

IBM,3:20PM ET,78.41,0.42,0.54%

AAPL,3:25PM, ET,58.50,1.03,1.79%

 

 

 

8.5. Two-Step Conversions through a Generic Format

 

 

Working with several sets of symbols all of which identify the same set of items one would like to be able to translate each set into and out of each other one. The number of translators required is a factorial function of the number of sets. In other words, past three sets the number of one-to-one correspondences quickly grows to overwhelming magnitude. If on the other hand one translates each set into and out of the same generic set, that generic set can connect each set with each other one. The number of translators required now grows in proportion to the number of sets. The follwing example elaborates the technique on stock symbols.

 

>>> symbol_to_isin = 'se/symbol_to_generic.se (32)=(10) | se/generic_to_isin.se'

>>> Symbol_To_Isin = SE.SE (symbol_to_isin)

>>> print Symbol_To_Isin ('ABX ADS MMM')

CA0679011084

DE0005003404

US88579Y1010

 

>>> symbol_to_name = se/symbol_to_generic.se | se/generic_to_name.se (32)=(10)'

>>> Symbol_To_Name = SE.SE (symbol_to_name)

>>> print Symbol_To_Name ('ABX ADS MMM')

Barrick Gold

Adidas

3M Company

 

>>> symbol_to_cusip = 'se/symbol_to_generic.se | se/generic_to_cusip.se (32)=(10)'

>>> Symbol_To_Cusip = SE.SE (symbol_to_cusip)

>>> print Symbol_To_Cusip ('ABX ADS MMM')

ABX

ADIDAS

88579Y101

 

A second benefit of having a generic format has to do with the fact that standard sets in reality never coincide exactly, either because some items simply are undefined in some sets or some of our lists are incomplete (the rule). Ids also change on us occasionally. If we design our own generic ids, we can make it a superset of all real sets and we can be sure it stays put. If some standard set adds or changes an id, we need to update two files: into and out of the generic set. A system of direct translations would gradually decay because its maintenance is prohibitive.

      Making the point: The last example, symbol to cusip, shows a need for system maintenance and it also shows what needs to be done. Barrick Gold (ABX) is missing in symbol_to_generic.se, because the symbol remains unchanged. Adidas (ADS) is missing in generic_to_cusip.se, because it got as far as generic but no further. We make a note to add Barrick Gold to symbol_to_generic and Adidas to generic_to_cusip. That's all we need to do to keep the entire system coherent.

      If a conversion is called for which we have never done before, writing the new translator is a matter of thirty seconds:

 

$ echo "se/cusip_to_generic.se | se/generic_to_name.se" > se/cusip_to_name.se

 

Revisiting 3.3. Merging Sets:

 

>>> Ids_To_Symbol = SE.SE ('finance/se/cusip2symbol.se  finance/se/isin2symbol.se  finance/se/sec2symbol.se  finance/se/valor2symbols.se')

 

With a two-step system the preceding line would call five runs in succession. We get by with two runs if we do all formats to generic first and next generic to symbols, like this:

 

>>> Ids_To_Symbol = SE.SE ('cusip2generic.se isin2generic.se sec2generic.se valor2generics.se | generic2symbol.se')

 

 

 

8.6. A Day in the Life of a Stenographer

 

 

 

$ cat board/drafts/minutes-Apr-1-05

... WW's motn to complement t MP w t discretionry suspensn of ...

>>> SE.SE ('se/steno_expander.se')('board/drafts/minutes-Apr-1-05', '')

"... Chairman Walter Whiteknuckle's motion to complement the Motivation Plan with the discretionary suspension of overweight employees' elevator privilege is applauded, but Mr. Patrick Ficklepenney (Finance) points out that the motivational effect, while predictably excellent at the Manhattan office, might prove negligible at most field offices, the majority of which don't have a second floor and of these none has an elevator. Dr. Yves-Jerome P. Harcourt-DeVries Jr. (Legal) calls attention to the practical difficulty of acquiring personal weight data by legal means. Mr. Patrick Ficklepenney (Finance): What's wrong with illegal means? Mr. Hanshorst Schloumeier (Engineering): I'd say it isn't any easier to get. Anyway I suspect zat ze real difficulty is ze Legal Department's location in Manhattan ..."

 

>>> cut_embarrassing_stuff = '"Mr. Patrick Ficklepenney (Finance): What's wrong with illegal means? ="  "I'd say it isn\'t any easier to get. Anyway ="'

>>> cut_the_ bs = ' " zat = that " " ze = the " '

>>> SE.SE ('se/steno_expander.se | ' + cut_embarrassing_stuff + cut_the_bs)('board/drafts/minutes-Apr-1-05', 'board/minutes-Apr-1-05')

   'board/minutes-Apr-1-05'

 

This example concludes this doc hopefully in an entertaining manner. While the steno_expander is plausible to some degree, the second pass would be quite an absurd alternative to cleaning up the text going through it, a task which a proof reader has to do anyway.

      Without much practical merit the example may nevertheless inspire the insight that not all data processing tasks require rigorous performance. Editing a text is precisely one example of fault tolerance. Here, the objective is to save much time on repetitive interaction, not the little time it takes a proof reader to smooth out a few rough spots.

 

 

 

 

9. Closing Remarks

 

 

The unsurpassable functional simplicity of a stream editor is probably the reason why stream-editing isn’t considered a data processing technique in its own right. As a rule data is more than just a stream. It usually has structure the meaning of which a stream editor utterly disregards. On the other hand the conceptual simplicity of stream editing promises a corresponding simplicity of use.

      SE is an experiment attempting to make this simplicity of use available. The benefit comes at the price of placing the responsibility for handling context appropriately with the user. Fortunately, the simplicity of the paradigm reflects on conceptualizing its use. Fortunately again, SE addresses programmers who handle context as part of their job. In addition SE offers all the help it can in terms of system transparency. (4, 5)

      A second design goal in favor of the user was modularity. It was achieved by keeping all programming-language-specific constructs out of the interface, sticking with the simplest format possible: single strings. (2.2). Making file names stand for file contents moreover creates an extremely flexible, recursively expandable building-block system with respect to substitution definitions. (3.2.) and with respect to data files it allows nestable calls. (7.3.1. 3.6.2)

 

 

 

 

 

 

Frederic Rentsch

 

October 13, 2006