Marc XML / Marc OAI parser

Module for parsing and high-level processing of MARC XML records.

About format and how the class work; Standard MARC record is made from three parts:

  • leader - binary something, you can probably ignore it
  • controlfileds - marc fields < 10
  • datafields - important information you actually want

Basic MARC XML scheme uses this structure:

<record xmlns=definition..>
    <leader>optional_binary_something</leader>
    <controlfield tag="001">data</controlfield>
    ...
    <controlfield tag="010">data</controlfield>
    <datafield tag="011" ind1=" " ind2=" ">
        <subfield code="scode">data</subfield>
        <subfield code="a">data</subfield>
        <subfield code="a">another data, but same code!</subfield>
        ...
        <subfield code"scode+">another data</subfield>
    </datafield>
    ...
    <datafield tag="999" ind1=" " ind2=" ">
    ...
    </datafield>
</record>

<leader> is optional and it is parsed into MARCXMLRecord.leader as string.

<controlfield>s are optional and parsed as dictionary into MARCXMLRecord.controlfields, and dictionary for data from example would look like this:

MARCXMLRecord.controlfields = {
    "001": "data",
    ...
    "010": "data"
}

<datafield>s are non-optional and are parsed into MARCXMLRecord.datafields, which is little bit more complicated dictionary. Complicated is mainly because tag parameter is not unique, so there can be more <datafield>s with same tag!

scode (subfield code) is always one character (ASCII lowercase), or number.

Example dict:

MARCXMLRecord.datafields = {
    "011": [{
        "ind1": " ",
        "ind2": " ",
        "scode": ["data"],
        "scode+": ["another data"]
    }],

    # real example
    "928": [{
        "ind1": "1",
        "ind2": " ",
        "a": ["Portál"]
    }],

    "910": [
        {
            "ind1": "1",
            "ind2": " ",
            "a": ["ABA001"]
        },
        {
            "ind1": "2",
            "ind2": " ",
            "a": ["BOA001"],
            "b": ["2-1235.975"]
        },
        {
            "ind1": "3",
            "ind2": " ",
            "a": ["OLA001"],
            "b": ["1-218.844"]
        }
    ]
}

As you can see in 910 record example, sometimes there are multiple records in a list!

Warning

NOTICE, THAT RECORDS ARE STORED IN ARRAY, NO MATTER IF IT IS JUST ONE RECORD, OR MULTIPLE RECORDS. SAME APPLY TO SUBFIELDS.

Example above corresponds with this piece of code from real world:

<datafield tag="910" ind1="1" ind2=" ">
<subfield code="a">ABA001</subfield>
</datafield>
<datafield tag="910" ind1="2" ind2=" ">
<subfield code="a">BOA001</subfield>
<subfield code="b">2-1235.975</subfield>
</datafield>
<datafield tag="910" ind1="3" ind2=" ">
<subfield code="a">OLA001</subfield>
<subfield code="b">1-218.844</subfield>
</datafield>

OAI

To prevent things to be too much simple, there is also another type of MARC XML document - OAI format.

OAI documents are little bit different, but almost same in structure.

leader is optional and is stored in MARCXMLRecord.controlfields["LDR"], but also in MARCXMLRecord.leader for backward compatibility.

<controlfield> is renamed to <fixfield> and its “tag” parameter to “label”.

<datafield> tag is not named datafield, but <varfield>, “tag” parameter is “id” and ind1/ind2 are named i1/i2, but works the same way.

<subfield>s parameter “code” is renamed to “label”.

Real world example:

<oai_marc>
<fixfield id="LDR">-----nam-a22------aa4500</fixfield>
<fixfield id="FMT">BK</fixfield>
<fixfield id="001">cpk19990652691</fixfield>
<fixfield id="003">CZ-PrNK</fixfield>
<fixfield id="005">20130513104801.0</fixfield>
<fixfield id="007">tu</fixfield>
<fixfield id="008">990330m19981999xr-af--d------000-1-cze--</fixfield>
<varfield id="015" i1=" " i2=" ">
<subfield label="a">cnb000652691</subfield>
</varfield>
<varfield id="020" i1=" " i2=" ">
<subfield label="a">80-7174-091-8 (sv. 1 : váz.) :</subfield>
<subfield label="c">Kč 182,00</subfield>
</varfield>
...
</oai_marc>

Full documentation

Description of simplified MARCXML schema can be found at http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

Full description of MARCXML with definition of each element can be found at http://www.loc.gov/standards/marcxml/mrcbxmlfile.dtd (19492 lines of code)

Description of MARC OAI can be found at http://www.openarchives.org/OAI/oai_marc.xsd

class aleph.marcxml.Corporation[source]

Bases: aleph.marcxml.Corporation

Some informations about corporations (fields 110, 610, 710, 810).

Properties:
.name .place .date
class aleph.marcxml.MARCXMLRecord(xml=None)[source]

Class for serialization/deserialization of MARCXML and MARC OAI documents.

This class parses everything between <root> elements. It checks, if there is root element, so please, give it full XML.

Internal format is described in module docstring. You can access internal data directly, or using few handy methods on two different levels of abstraction:

No abstraction at all

You can choose to access data directly and for this use, there is few important properties:

leader string

leader of MARC XML document

oai_marc bool

True/False, depending if doc is OAI doc or not

controlfields dict

Controlfields stored in dict.

datafields dict of arrays of dict of arrays of strings ^-^

Datafileds stored in nested dicts/arrays.

controlfields is simple and easy to use dictionary, where keys are field identificators (string, 3 chars, all chars digits). Value is always string.

datafields is little bit complicated; it is dictionary made of arrays of dictionaries, which consists of arrays of strings and two special parameters.

It sounds horrible, but it is not that hard to understand:

.datafields = {
    "011": ["ind1": " ", "ind2": " "]  # array of 0 or more dicts
    "012": [
        {
            "a": ["a) subsection value"],
            "b": ["b) subsection value"],
            "ind1": " ",
            "ind2": " "
        },
        {
            "a": [
                "multiple values in a) subsections are possible!",
                "another value in a) subsection"
            ],
            "c": [
                "subsection identificator is always one character long"
            ],
            "ind1": " ",
            "ind2": " "
        }
    ]
}

Notice ind1/ind2 keywords, which are reserved indicators and used in few cases thru MARC standard.

Dict structure is not that hard to understand, but kinda long to access, so there is also little bit more high-level abstraction access methods.

Lowlevel abstraction

To access data little bit easier, there are defined two methods to access and two methods to add data to internal dictionaries:

.addControlField(name, value)
.addDataField(name, i1, i2, subfields_dict)

Names imho selfdescribing. subfields_dict is expected en enforced to be dictionary with one character long keys and list of strings as values.

Getters are also simple to use:

.getControlRecord(controlfield)
.getDataRecords(datafield, subfield, throw_exceptions)

.getControlRecord() is basically just wrapper over .controlfields and works same way as accessing .controlfields[controlfield]

.getDataRecords(datafield, subfield, throw_exceptions) return list of MarcSubrecord objects* with informations from section datafield subsection subfield.

If throw_exceptions parameter is set to False, method returns empty list instead of throwing KeyError.

*As I said, function returns list of MarcSubrecord objects. They are almost same thing as normal strings (they are actually subclassed strings), but defines few important methods, which can make your life little bit easier:

.getI1()
.getI2()
.getOtherSubfiedls()

.getOtherSubfiedls() returns dictionary with other subsections, as subfield requested by calling .getDataRecords().

Highlevel abstractions

There is also lot of highlevel getters:

.getName()
.getSubname()
.getPrice()
.getPart()
.getPartName()
.getPublisher()
.getPubDate()
.getPubOrder()
.getFormat()
.getPubPlace()
.getAuthors()
.getCorporations()
.getDistributors()
.getISBNs()
.getBinding()
.getOriginals()
addControlField(name, value)[source]
addDataField(name, i1, i2, subfields_dict)[source]

Add new datafield into self.datafields.

name – name of datafield i1 – value of i1/ind1 parameter i2 – value of i2/ind2 parameter subfields_dict – dictionary containing subfields in this format:

{
“field_id”: [“subfield data”,], ... “z”: [“X0456b”]

}

field_id can be only one characted long!

Function takes care of OAI MARC.

getAuthors()[source]
Returns:list – authors represented as Person objects
getBinding()[source]
Returns:list – array of strings with bindings ([“brož.”]) or blank list
getControlRecord(controlfield)[source]

Return record from given controlfield. Returned type: str.

getCorporations(roles=['dst'])[source]
Parameters:roles (list, optional) – specify which types of corporations you need. Set to [“any”] for any role, [“dst”] for distributors, etc.. See http://www.loc.gov/marc/relators/relaterm.html for details.
Returns:listCorporation objects specified by roles parameter.
getDataRecords(datafield, subfield, throw_exceptions=True)[source]

Return content of given subfield in datafield.

Parameters:
  • datafield (str) – Section name (for example “001”, “100”, “700”)
  • subfield (str) – Subfield name (for example “a”, “1”, etc..)
  • throw_exceptions (bool) – If True, KeyError is raised if method couldnt found given datafield/subfield. If false, blank array [] is returned.

Returns list of MarcSubrecord. MarcSubrecord is practically same thing as string, but has defined .getI1() and .getI2() properties. Believe me, you will need to be able to get this, because MARC XML depends on them from time to time (name of authors for example).

getDistributors()[source]
Returns:list – distributors represented as Corporation object
getFormat(undefined='')[source]
Returns:str – dimensions of the book (‘23 cm’ for example)
getI(num)[source]

This method is used mainly internally, but it can be handy if you work with with raw MARC XML object and not using getters.

Returns:str – current name of i1/ind1 parameter based on self.oai_marc.
getISBNs()[source]
Returns:list – array with ISBN strings
getName()[source]
Returns:str – Name of the book.
Raises:KeyError – when name is not specified.
getOriginals()[source]
Returns:list – of original names
getPart(undefined='')[source]
Returns:str – Which part of the book series is this record.
getPartName(undefined='')[source]
Returns:str – Name of the part of the series.
getPrice(undefined='')[source]
Returns:str – Price of the book (with currency).
getPubDate(undefined='')[source]
Returns:str – date of publication (month and year usually)
getPubOrder(undefined='')[source]
Returns:str – information about order in which was the book published
getPubPlace(undefined='')[source]
Returns:str – name of city/country where the book was published
getPublisher(undefined='')[source]
Returns:str – name of the publisher (“Grada” for example)
getSubname(undefined='')[source]
Returns:str – Subname of the book.
toXML()[source]

Convert object back to XML string.

Returns:str – string which should be same as original, parsed input, if
everything works as expected
class aleph.marcxml.MarcSubrecord(arg, ind1, ind2, other_subfields)[source]

Bases: str

This class is used to stored data returned from .getDataRecords() method from MARCXMLRecord.

It looks kinda like overshot, but when you are parsing the MARC XML, values from subrecords, you need to know the context in which the subrecord is put.

Specifically the i1/i2 values, but sometimes is usefull to have acces even to the other subfields from this subrecord.

This class provides this acces thru .getI1()/.getI2() and .getOtherSubfiedls() getters. As a bonus, it is also fully convertable to string, in which case only the value of subrecord is preserved.

arg str

value of subrecord

ind1 char

indicator one

ind2 char

indicator two

other_subfields dict

dictionary with other subfields from the same subrecord

getI1()[source]
getI2()[source]
getOtherSubfiedls()[source]
class aleph.marcxml.Person[source]

Bases: aleph.marcxml.Person

This class represents informations about persons as they are defined in MARC standards.

name str
second_name str
surname str
title str
aleph.marcxml.resorted(values)[source]

Sort values, but put numbers after alphabetically sorted words.

This function is here for outputs, to be diff-compatible with aleph.

Example:

>>> sorted(["b", "1", "a"])
['1', 'a', 'b']
>>> resorted(["b", "1", "a"])
['a', 'b', '1']
Parameters:values (iterable) – any iterable object/list/tuple/whatever.
Returns:list of sorted values, but with numbers after words

Table Of Contents

Previous topic

Aleph lowlevel API

Next topic

ISBN validation module

This Page