Join the challenge!

Table of Contents

Overview

The PSST Challenge consists of two tasks:

  • In Task A: Phoneme Recognition, we ask participants to phonemically transcribe speech from recordings of people with aphasia
  • In Task B: Correctness, we ask people to determine whether a target word is both present and correctly pronounced in the same recordings

Contestants may focus on either Task A or B, or they may take on the entire challenge. Although Task B is dependent on the output of Task A, you are most welcome to use our baseline speech recognition model for the foundation of your system.

Papers must be submitted in softconf by Saturday, April 9, 2022 (extended, originally April 4), formatted according to the author’s kit. Please refer to RaPID submission details for more information.

Challenge Data

Access to the data

We’ve created a unique data set for phonemic ASR—derived from the recordings in the English AphasiaBank—for the challenge. Please fill out this form to gain access to the data.

We have prepared a set of scripts and utilities for downloading and using the data, once you have received access permissions.

Notes on Usage

All users of this dataset must follow the appropriate AphasiaBank protocols for data management, and is intended for use solely as part of the PSST challenge. It must not be re-distributed, shared, or repurposed without permission.

Furthermore, we ask that users of this dataset cite both AphasiaBank and us in any resulting publications, as follows:

  1. Macwhinney, B., Fromm, D., Forbes, M., & Holland, A. (2011). AphasiaBank: Methods for Studying Discourse. Aphasiology, 25(11), 1286–1307. https://doi.org/10.1080/02687038.2011.589893
  2. Gale, R., Fleegle, M., Bedrick, S., & Fergadiotis, G. (2022). Dataset and tools for the PSST Challenge on Post-Stroke Speech Transcription (Version 1.0.0) [Data set]. https://doi.org/10.5281/zenodo.6326002

A quick look at the data

To get started with the data using our toolkit, have a look at this short example, where we simply print data from the first four records in the corpus:

import psstdata

data = psstdata.load()

for utterance in data.train[:4]:
    
    # The key ingredients
    utterance_id = utterance.utterance_id
    transcript = utterance.transcript
    correctness = "Y" if utterance.correctness else "N"
    filename_absolute = utterance.filename_absolute

    print(f"{utterance_id:26s} {transcript:26s} {correctness:11s} {filename_absolute}")

Which produces:

# utterance_id             transcript                 correctness  filename_absolute

ACWT02a-BNT01-house        HH AW S                    Y            /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav
ACWT02a-BNT02-comb         K OW M                     Y            /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT02-comb.wav
ACWT02a-BNT03-toothbrush   T UW TH B R AH SH          Y            /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT03-toothbrush.wav
ACWT02a-BNT04-octopus      AA S AH P R OW G P UH S    N            /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT04-octopus.wav

These four fields (utterance_id, transcript, correctness, and filename_absolute) are the pieces you’ll need for both challenge tasks. Transcripts are in ARPAbet, and can be reliably tokenized using spaces as a separator (phonemes = utterance.transcript.split(" ")). For Task B, we provide an additional resource: all the pronunciations we consider “correct” (more on that below). You can learn more about the psstdata tool on its GitHub page, where you can also find a copy of the README provided with the data packs.

Background on the PSST corpus

This dataset includes audio recordings and phonemic transcriptions of research participants with post-stroke aphasia, specifically their responses to two confrontation picture naming tests, the Boston Naming Test - Short Form (BNT-SF) and the Verb Naming Test (VNT). For more background on what aphasia is and the clinical context of this work, please visit our clinical background page. The audio data was sourced from the AphasiaBank database (MacWhinney et al., 2011), and participant first responses were selected, segmented, and transcribed by our team at Portland Allied Laboratories for Aphasia Technologies (PALAT).

Task A: Phoneme Recognition

The primary task of this challenge (Task A) is to develop an automatic phoneme recognition (APR) system that accurately identifies the phonemes produced by people with aphasia and to present those phonemes as they are, rather than smoothed over to fit whatever real word the speaker may have intended. This task has important clinical applications, including automatic scoring of picture naming test responses (Task B).

Evaluation: Phoneme & Feature Error Rates (PER and FER)

Task A will be evaluated in terms of two metrics: phoneme error rate (PER) and feature error rate (FER), with the emphasis on the latter. PER is defined as the number of phoneme errors (edits, insertions, and substitutions) divided by the number of phonemes in the reference transcript. It is the standard metric for phoneme recognition, and without it we would have little perspective on the performance of the APR systems built for this challenge. However, we find PER to have particular weaknesses, which we attempt to overcome with the use of FER. FER is similar to PER, but instead of phoneme errors, we compute the number of errors in terms of distinctive phonological features. Without worrying too much about the theory behind phonological features, we want to emphasize this intuition: we’re looking for speech recognizers that produce transcripts which sound closest to the truth.

Consider an example utterance where someone says /V AE N/  (“van”), but a speech recognizer predicts /F AE N/  (“fan”). In terms of PER, we find 1 error out of a target length of 3, so 1/3 is about 33%. Now let’s say another speech recognizer predicts /K AE N/  (“can”), which has the same PER of 33%. However, we insist that /F/ is a preferable error to /K/, since it sounds closer. Phonological features can quantify the difference. Phonologically, /F/ and /V/ differ in one respect: the latter uses the voice while the former does not. Otherwise, they’re both consonants produced with the lower lip against the upper teeth, and a steady stream of air (labiodental fricatives). Like /F/, /K/ is also unvoiced; however, it is produced with the back of the tongue against the soft palate, and a sudden burst of air (velar plosive). In terms of phonological features, going from /V/ to /F/ is [-voice], while /V/ to /K/ is [-continuant, -voice, -anterior, -labial, +high]. So in this example, “fan” amounts to 1 feature error while “can” totals 5. If our feature system utilizes 20 distinctive features per phoneme, the FER of “fan” is 1/60 (about 2%), while “can” is 5/60 (about 8%).

If you’ve never studied phonology, this might seem like a lot. But rest assured that if you work on a speech recognizer, we’ll provide the linguistic analysis. Soon we will provide tools to compute your WER and PER metrics, as well as a feature distance breakdown similar to the above.

Task B: Correctness

Test how well the APR developed in Task A performs in a clinically relevant context. Task B is automatic binary scoring (correct vs. incorrect) of automatically transcribed picture naming responses. If the transcribed response contains the name of the picture, or one of its accepted alternative names, then the response is scored as correct.

Correctness Data

To simplify the task, we provide a set of accepted “correct” pronunciations (e.g., send-mail, send-sendin’, etc.) are provided, since Task B will be evaluated in terms of response scoring agreement and assembling this dictionary is outside the scope of the present challenge. The dictionary can be retrieved from our Python tool at psstdata.ACCEPTED_PRONUNCIATIONS, or in json format.

The main challenge of Task B is distinguishing mispronunciations from APR errors. With an ideal APR (0% phoneme errors), a simple substring function would achieve about 99% accuracy on this task. (The other 1% is mostly part-of-speech errors—e.g. on the VNT, the verb “mail” is correct, but the noun “mailbox” is not—and probably not worth pursuing during the challenge.) Conversely, if you’re relying only on the 1-best transcript from an unreliable speech recognizer, you’ll score poorly with little room to improve. For contestants taking on Task B without also working on Task A, we are sharing our baseline speech recognizer, which first produces a matrix of per-frame, per-phone likelihoods prior to decoding into an ARPAbet transcript.

Evaluation

Task B will be evaluated in terms of metrics of classification performance. Specifically, we will use F1 to quantify how well automated scores obtained from APR transcriptions compare with our team’s scores obtained from manual transcriptions.

Tasks A & B: Submission

The file format for results submission for the two tasks is currently being finalized, and will be posted in the second week of March. We will also provide validation scripts that can be used before submitting your results, to ensure their compatibility with our evaluation process.

Papers must be submitted in softconf by Saturday, April 9, 2022 (extended, originally April 4), formatted according to the author’s kit. Please refer to RaPID submission details for more information.

We look forward to seeing your work!

Frequently Asked Questions

Q: Will <sil> and <spn> be included or excluded in the evaluation of phoneme recognition results?

A: All <sil> and <spn> tokens will be filtered out and excluded prior to the phoneme recognition evaluation process.

Q: Can additional data be used for training?

A: Yes, additional/external data is allowed in the challenge, with one caveat: If you are using AphasiaBank data, you many not use any of the sessions in this list, which includes the sessions in the test set. Be sure to include a discussion of any additional data used in your paper’s methods section.