User’s Guide, Chapter 11: Corpus Searching¶
One of music21’s important features is its capability to help users examine large bodies of musical works, or corpora.
Music21 comes with a substantial corpus called the core corpus. When you download music21 you can immediately start working with the files in the corpus directory, including the complete chorales of Bach, many Haydn and Beethoven string quartets, three books of madrigals by Monteverdi, thousands of folk songs from the Essen and various ABC databases, and many more.
To load a file from the corpus, simply call corpus.parse and assign that file to a variable:
from music21 import *
bach = corpus.parse('bach/bwv66.6')
Users can also build their own corpora to index and quickly search their own collections on disk including multiple local corpora, for different projects, that can be accessed individually.
This user’s guide will cover more about the corpus’s basic features soon. This chapter focuses on music21’s tools for extracting useful metadata - titles, locations, composers names, the key signatures used in each piece, total durations, ambitus (range) and so forth.
This metadata is collected in metadata bundles for each corpus. The corpus module has tools to search these bundles and persist them disk for later research.
Types of corpora¶
Music21 works with three categories of corpora, made explicit via the
corpus.Corpus
abstract class.
The first category is the core corpus, a large collection of musical works packaged with most music21 installations, including many works from the common practice era, and inumerable folk songs, in a variety of formats:
coreCorpus = corpus.CoreCorpus()
len(coreCorpus.getPaths())
2567
Note
If you’ve installed a “no corpus” version of music21, you can still access the core corpus with a little work. Download the core corpus from music21’s website, and install it on your system somewhere. Then, teach music21 where you installed it like this:
>>> coreCorpus = corpus.CoreCorpus()
>>> coreCorpus.manualCoreCorpusPath = 'path/to/core/corpus'
Music21 also has the notion of a virtual corpus: a collection of musical works to be found at various locations online which, for reasons of licensing, haven’t been included in the core corpus. There are not too many files in there, but it is something we hope to expand. Here’s one such path:
virtualCorpus = corpus.VirtualCorpus()
virtualCorpus.getPaths()[0]
'http://kern.ccarh.org/cgi-bin/ksdata?l=cc/bach/cello&file=bwv1007-01.krn&f=xml'
Finally, music21 allows for local corpora: bodies of works provided and configured by individual music21 users for their own research. Local corpora behave identically to the core and virtual corpora, and can be searched and cached in the same manner:
localCorpus = corpus.LocalCorpus()
You can add and remove paths from a local corpus with the
addPath()
and removePath()
methods:
localCorpus.addPath('~/Desktop')
localCorpus.directoryPaths
('/Users/josiah/Desktop',)
localCorpus.removePath('~/Desktop')
By default, a call to corpus.parse
or corpus.search
will look
for files in any corpus, core, local, or virtual.
Simple searches of the corpus¶
When you search the corpus, music21 examines each metadata object in the metadata bundle for the whole corpus and attempts to match your search string against the contents of the various search fields saved in that metadata object.
You can use corpus.search()
to search the metadata associated with
all known corpora, core, virtual and even each local corpus:
sixEight = corpus.search('6/8')
sixEight
<music21.metadata.bundles.MetadataBundle {2162 entries}>
To work with all those pieces, you can parse treat the MetadataBundle
like a list and call .parse()
on any element:
myPiece = sixEight[0].parse()
myPiece.metadata.title
'Quick Step 43d. Regt.'
This will return a music21.stream.Score
object which you can work
with like any other stream. Or if you just want to see it, there’s a
convenience .show()
method you can call directly on a MetadataEntry.
You can also search against a single Corpus
instance, like this one
which ignores anything in your local corpus:
corpus.CoreCorpus().search('6/8')
<music21.metadata.bundles.MetadataBundle {2162 entries}>
Because the result of every metadata search is also a metadata bundle,
you can search your search results to do more complex searches. Remember
that bachBundle
is a collection of all works where the composer is
Bach. Here we will limit to those pieces in 3/4 time:
bachBundle = corpus.search('bach', 'composer')
bachBundle
<music21.metadata.bundles.MetadataBundle {21 entries}>
bachBundle.search('3/4')
<music21.metadata.bundles.MetadataBundle {4 entries}>
Note
There are actually many more pieces by Bach in the music21 corpus, but many of them are without the metadata specifying him as a composer; his name is only in the filename. To get all the pieces by Bach use:
>>> allBach = corpus.search('bach')
This will search filenames as well. We will aim to get more complete metadata in the core corpus in the near future, and would appreciate community help to achieve this goal.
Metadata search fields¶
When you search metadata bundles, you can search either through every search field in every metadata instance, or through a single, specific search field. As we mentioned above, searching for “bach” as a composer renders different results from searching for the word “bach” in general:
corpus.search('bach', 'composer')
<music21.metadata.bundles.MetadataBundle {21 entries}>
corpus.search('bach', 'title')
<music21.metadata.bundles.MetadataBundle {20 entries}>
corpus.search('bach')
<music21.metadata.bundles.MetadataBundle {150 entries}>
So what fields can we actually search through? You can find out like this:
for field in corpus.Corpus.listSearchFields():
print(field)
alternativeTitle
ambitus
composer
date
keySignatureFirst
keySignatures
localeOfComposition
movementName
movementNumber
noteCount
number
opusNumber
pitchHighest
pitchLowest
quarterLength
tempoFirst
tempos
timeSignatureFirst
timeSignatures
title
This field will grow in the future now that the development team is seeing how useful this searching method can be! Now that we know what all the search fields are, we can search through some of the more obscure corners of the core corpus:
corpus.search('taiwan', 'locale')
<music21.metadata.bundles.MetadataBundle {27 entries}>
What if you are not searching for an exact match? If you’re searching
for short pieces, you probably don’t want to find pieces with exactly 1
note then union that set with pieces with exactly 2 notes, etc. Or for
pieces from the 19th century, you won’t want to search for 1801, 1802,
etc. What you can do is set up a “predicate callable” which is a
function (either a full python def
statement or a short lambda
function) to filter the results. Each piece will be checked against your
predicate and only those that return true. Here we’ll search for pieces
with between 400 and 500 notes, only in the core
corpus:
predicate = lambda x: 400 < x < 500
corpus.CoreCorpus().search(predicate, 'noteCount')
<music21.metadata.bundles.MetadataBundle {49 entries}>
You can also pass in compiled regular expressions into the search. In this case we will use a regular expression likely to find Handel and Haydn and perhaps not much else:
import re
haydnOrHandel = re.compile(r'ha.d.*', re.IGNORECASE)
corpus.search(haydnOrHandel)
<music21.metadata.bundles.MetadataBundle {176 entries}>
Unfortunately this really wasn’t a good search, since we mostly got folk songs with the title of “Shandy”. Best to use a ‘*^*’ search to match at the beginning of the word next time.