Comparing classical music interpretations

Posted on November 9, 2018 by Boris Smus

I built an audio player to easily compare multiple interpretations of the same piece. Here’s an interactive demo, and a video to give you a sense of how it works:

What does it mean to interpret classical music?

At first glance, sheet music is prescriptive: the composer has provided all of the notes, the dynamics (forte, piano), tempo (lento, presto) and changes in tempo (de/accelerando).

In practice, however, the interpreter has a lot of leeway. In some extreme cases, such as the Cadenza in solo concertos, the performer gets to improvize a melody based on a chord progression. Some pieces include ornamentation (eg. trills, etc) which are largely left up to the performer to interpret.

That said, cadenzas and ornaments are somewhat rare. In general, every piece is under-specified by the composer. This gives the performer a lot of leeway to express themselves through the performance, selecting tempo, phrasing, articulation and tone.

Example: Bach’s Goldberg Variations

The Goldberg Variations were composed by Johann Sebastian Bach in 1741, and then popularized by Glenn Gould in his debut album in 1955, transforming a work once considered esoteric into one of the most iconic piano recordings.

In 1981, a year before his death, Gould recorded the pieces again. After a long period of reclusion, he was able to revisit the variations and produce a completely different take. In an interview, he said:

…since I stopped playing concerts, about 20 years, having not played it in all that time, maybe I wasn’t savaged by any over-exposure to it…

Compare Gould’s 1955 and 1981 recordings

Both the 1955 and 1981 recordings are available on YouTube, of course. I found that listening to two distinct performances is not the same as having one integrated player. So I built one: a player specifically for comparing multiple interpretations of the same piece.

Here is a demo that lets you compare the first variation from the Goldberg Variations. Try it out here. You can use your keyboard to skip between interpretations (↑, ↓) just as easily as you can seek within a track (←, →). The mouse works as well. Note that I haven’t tested at all on mobile. Sorry, it’s just a prototype and I’m on paternity leave 😇

I also tried it on Mozart’s Requiem

I am a huge fan of Mozart’s Requiem, and once came across an online thread debating which conductor’s performance was the best. I soon found myself listening to a dozen or so different versions of the same piece. When I was a younger music appreciator, I would often wonder what the point of a conductor really was. I no longer have this question.

Just to give you a taste for how different the interpretations are, here’s an example of three conductors performing the Introitus, the first movement in the Requiem. Check it out here, but be patient as it may take a minute to load and decode the audio. Böhm’s brooding tempo and lumbering chorus (ugh) contrasts especially well with Levin’s crisp and minimalist take.

Technical details

For this prototype, I focused on creating a reasonable UI to play back and interact with multiple time-aligned performances of the same piece. An index file specifies metadata for each track, most importantly the URL to the label file and the URL to the audio file. Each label file is a text file with lines in the format START_TIME END_TIME BAR_NUMBER.

To create the label files, I manually annotated the waveform. Even with Audacity’s extremely useful label track feature, it was a lot of manual work to go through the score, and find each bar’s time range in each recording. At the end of the day, I had start and end times for each bar. For times that don’t fall exactly on bar lines, I linearly interpolate between the bar boundaries, which works reasonably well, but is sometimes a bit off. More granular timing references would address this better, but that currently means doing more manual labor. No thanks!

Science, help me automate this, please

An obvious question is how to automate the labor of synchronizing a recording to a score. In general, I think this is an unsolved problem, especially for complex tracks containing hundreds of instruments and varying levels of background noise.

An promising approach that could work for solo piano music might be to use something like Onsets and Frames to extract piano rolls and then apply something like a Dynamic Time Warp (DTW) in piano roll space. A more general approach might be to synthesize each bar into raw audio (from MIDI), and then align recordings to synthesized audio using something like DTW based on a Constant-Q transform (CQT).

My brief and ill-guided attempts to do something like this on real-world examples didn’t yield good enough results. Any ML/DSP experts want to take this on?

Source:

This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.

Evolution of NLP Techniques based on the Google Books Corpus

Posted on June 17, 2018 by Talia Kohen

Great Ideas in current Computer Science Research

Computer Science (CS) Research is an emergent and exciting area. Classical parts of CS are being reshaped to fit a more modern concept of computing. One domain that is experiencing a renaissance is Natural Language Processing (NLP). Classical NLP tasks are being expanded to include time-series information allowing us to capture evolutionary dynamics, and not just static information. For example, the word “bitch” was historically synonymous with a female dog, and more recently became (pejoratively) synonymous with the word “feminist.”

Fig1: The Trend of “Feminist” Over Time and Its Close Relatives

Traditional thesauruses do not contain information on when this synonymy was generated, nor the surrounding events that gave rise to this. This additional information about the historicity of the linguistic change is so innovative that it blurs the boundary between disparate disciplines: NLP and Computational Linguistics. This added dimension also allows us to challenge the foundations of traditional NLP research.

Language is the foundation of civilization. The story of the Tower of Babel in the Bible describes language as the uniting force among humanity, the key to its technological advancement and ability to become like G-d. Speaking one same language, Babel’s inhabitants were able to work together to develop a city and build a tower high enough to reach heaven. Seeing this, G-d mixes up their language, taking away the source of the inhabitants’ power by breaking down their mutual understanding. This story illustrates the power and cultural significance of universal language. Continue reading →

Web-based voice command recognition

Posted on March 14, 2018 by Boris Smus

Last time we converted audio buffers into images. This time we’ll take these images and train a neural network using deeplearn.js. The result is a browser-based demo that lets you speak a command (“yes” or “no”), and see the output of the classifier in real-time, like this:

Curious to play with it, see whether or not it recognizes yay or nay in addition to yes and no? Try it out live. You will quickly see that the performance is far from perfect. But that’s ok with me: this example is intended to be a reasonable starting point for doing all sorts of audio recognition on the web. Now, let’s dive into how this works. Continue reading →

Audio features for web-based ML

Posted on March 6, 2018 by Boris Smus

One of the first problems presented to students of deep learning is to classify handwritten digits in the MNIST dataset. This was recently ported to the web thanks to deeplearn.js. The web version has distinct educational advantages over the relatively dry TensorFlow tutorial. You can immediately get a feeling for the model, and start building intuition for what works and what doesn’t. Let’s preserve this interactivity, but change domains to audio. This post sets the scene for the auditory equivalent of MNIST. Rather than recognize handwritten digits, we will focus on recognizing spoken commands. We’ll do this by converting sounds like this:

Into images like this, called log-mel spectrograms, and in the next post, feed these images into the same types of models that do handwriting recognition so well:

The audio feature extraction technique I discuss here is generic enough to work for all sorts of audio, not just human speech. The rest of the post explains how. If you don’t care and just want to see the code, or play with some live demos, be my guest! Continue reading →

The Power of Bisection Logic

Posted on January 11, 2018 by Abhineet Saxena

Bisection or Binary logic is an example of a simple yet powerful idea in computer science that has today become an integral part of every computer scientist’s arsenal. It stands synonymous to logarithmic time complexity that is no less than music to a programmer’s ears. Yet, the technique never fails to surprise us with all the creative ways it has been put to use to solve some tricky programming problems. This blog-post will endeavour to acquaint you with a few such problems and their solutions to delight you and make you appreciate it’s ingenuity and efficacy. Continue reading →

XRDS

Crossroads – The ACM Magazine for Students

Category Archives: Algorithms