Cold, Coons or Punks: The Selective Perception of Recorded Speech

by Sam Guiberson


In the news coverage of Trayvon Martin’s death, a 911 recording of George Zimmerman has taken center stage. Advocates for Zimmerman’s claim of self defense, as well as advocates for his prosecution, have embraced this recording as certifiable evidence of their assertions about the actions and reactions of these two individuals during their encounter on that ill-fated night.

As the days of coverage have progressed, this 911 recording has been replayed over and over again on cable news and repeatedly enhanced by forensic audio experts. News commentators, lawyers, family proxies, and even experts have heard different words in the very same utterance on tape. This case presents one more learning experience about the neurological phenomenon we know as “hearing.”

What the ear feeds the brain to digest is a well-cooked meal that satisfies our hunger to communicate with each other almost all of the time. When hearing deficiencies, or a lack of ample or unambiguous audial cues get stirred up in the physics, neuroscience and psychology of why we hear what we hear, the routine reliability of communicating  with speech evaporates. When the easy consensus of what many ears hear as the same word is lost in the selective perceptions of individual listeners struggling to decide what word they think they hear, the tape recordings many proclaim as unassailably objective evidence become as subjective as your favorite color.

My personal experience with the transcription of thousands of hours of recorded evidence has taught me that there is no such thing as universal, objective comprehension of language. At the remote edges of marginally intelligible speech, where the brain fails at matching  a pattern of sonic information against a standard set of sound-to-language translations, our regular and mostly successful means of recognizing words is abandoned for more creative processes .

If the brain can’t attach a word to the sound drawn from its audial “muscle memory” of what words match what sounds, it begins to apply assumptions based on recollections from similar past experiences with language, contextual cues from contemporaneous conversation and our own individual expectations of what we think we should be hearing. In other words, we try to define experientially what we cannot define acoustically.  Our own present and past experience is our brain’s last resort in its desperate effort to decrypt ambiguous aural input. Individual perceptions that have little to do with what sonic impulses move through the ear canals ultimately decide what we think we hear.

We have all encountered such a phenomena with the elderly. When someone is hard of hearing misinterprets our words and substitutes other phrases for what we clearly said, we laugh at the disparity. When “I’ve got pilates class” is heard by a confused listener as “I’ve got potato gas,” we know that this listener took in less sound information than we communicated.

The minds of both the able and the impaired listener are functioning in exactly the same way, but because the latter is working with much less information to associate sounds with words, the impaired listener will have  to compensate with higher risk, and more likely inaccurate, associations to the most similar words they imagine might fit into the context of the conversation at that time. We construct an interpretation of an obscure spoken word by choosing to hear that word as what we most likely expected to hear, or what we have heard in similar situations in the past.

Since the invention of voice recording and the admission of tape recorded speech as evidence in courts, hard to hear recordings have become a more  challenging forensic issue less respectfully acknowledged and less well understood than the infirmities of the hard of hearing.

Microphones are dumb listeners. They only present the sounds they record without understanding what is more or less important to hear. The unpredictable and chaotic acoustic environments in which covert audio recordings are produced are not filtered into a hierarchy of minimum or maximum attention by a microphone as they are by a person engaged in conversation. When we converse with each other in a loud and distracting place, we automatically give less attention to the ambient noise around us, such as back ground noises and people speaking over one another. Because our brains have evolved into much more sophisticated instruments for  speech processing than microphones, humans can isolate the vocal range of sounds and selectively hear what is speech that is important to us rather than the random sounds that are not.

By presenting everything without selective perception, without contemporaneous contextual assessment of the reason for listening, i.e. to hear words, the recorded conversation is less stimulating and less dimensional than if we had experienced that same conversation personally. The concealed location of a microphone, low voice levels, intrusive background noises and poorly enunciated words can also diminish our ability to capture the less intelligible recorded language.

It is also a cruel trick of nature that the most important language evidence is often found in the most problematic zones of marginal intelligibility. When it comes to juries evaluating the probative value of words spoken on recorded evidence, the devil is in the details, often located deep down in the most obscure facets of a recorded conversation.

Everyone listening to the coverage of the Trayvon Martin case is a juror of sorts, trying to assess Mr. Zimmerman’s motives and behaviors and how they might have contributed to the fatal outcome. Once the 911 tape was available to the news outlets, the coverage began to focus on the cries for help that advocates for Zimmerman and Martin have each identified as being the voice of their man crying out for help. Because the passage on the 911 recording doesn’t include language content, the dispute turns upon the quantifiable dissimilarities in the distinct sonic profiles of each of their two voices. If their voices are dissimilar enough in range and pitch, the issue is settled in a scientific framework that is less disputable than the accurate recognition of recorded speech.

In the ensuing days of coverage, the focus turned to an under the breath utterance of George Zimmerman. On MSNBC, Larry O’Donnell announced that he could clearly hear the word “coons” being spoken in the version of the recording he aired. A lawyer for the Martin family agreed, other guests were equally sure or equally uncertain. Abruptly, the Martin’s attorney backtracked, indicating that the “coons” word was clearly heard on a tape that was not the same as what was just played and the scandal over Mr. Zimmerman’s words was mired in the uncertain provenance of the recording that lent itself most definitively to that damning interpretation.

This presents another fissure in the bedrock of certainty in evaluating recorded language evidence. The perception of what words are spoken can vary with the characteristics of a duplicate version of a recording and with what audio technology we choose to hear it. The technical properties of the playback equipment used, or even the bass, treble and mid-range biases of the speakers or headphones we use to listen can influence what we can hear when we are operating at the fringes of intelligibility. When dealing with obscure audio content, the media can definitely affect the message.

Playing defense in the 24/7 news cycle, Mr. Zimmerman’s spokespeople offered a distinctly different interpretation of that word uttered in that 911 call, one with a less prejudicial take. The word in question wasn’t “coons,” but “punks.” Soon thereafter, different forensic enhancements gave us both an endorsement of “punks” and a fresh alternative, “cold”. These words are not particularly similar, and yet even after enhancement using sophisticated forensic audio technology, the experts are hearing completely different words spoken in but a single second of sound.

When forensic audio technicians shape voice audio to render it more unnatural sounding, they do so for the purpose of making the spoken phrases more discernible, but the distortion has its own effect on the brain’s analytical process. The manipulated audio introduces tonal variations that can either reveal or obscure the sounds we need to recognize a word.

Listen to two different forensic renderings looping repetitions of the single enhanced word (mp4 format):


These enhancements produce different experiences for the listener. In the Owen enhancement, there is a sharper edge to the tone that is not as severe in the Brian Stone enhancement. When recorded speech is as aggressively modified as it is here, it may lead us away from accurate interpretation precisely because it steals away the tonal information about the pronunciation of vowels and consonants that our brains depend upon.

In the Owen enhancement, the resonant quality of the long “O” sound is flattened to approach an “ooh” sound that lends false credence to solving the word as “coon.” Mr. Owen is quoted in news reports as hearing “punk,” a choice that also depends on the absence of a long “O” vowel.

One way we can try to escape our selective perception and our intuitive process for recognizing obscure language on tape is to trick the brain into avoiding interpreting words at all. When we isolate the task to recognizing only phonetic units, we can eliminate the more complex engagement with the brain’s contextual processes of resolving what words are being spoken. These steps reduce the word identification process to simply identifying what individual sounds make up the word, and then deducing what the word can or cannot be from the sequence of sounds we identify as letters in the alphabet.

Each competing option for the word in question has four enunciations from the alphabet, known as “phonemes.” For “cold” to be validated, we must identify, in sequence, a /k/ sound for a hard “C”, a long “O”, an “L” sound and a brief closing “D” sound. For “punk” to prevail, we must hear, again in sequence, a “P” /p/, a soft “U”, an “N” sound and a closing “K” sound, a /k/. “Coons” would require an opening /k/, a soft and extended “O”, an “N” sound and a closing “S.” Since “punks”, coons, and “cold” all have only one syllable, syllabic stress is not a factor as it might be if our competing alternatives were “canyon” and “Cancun.” In the clips below, I have slowed down the progression of sounds on each of the two enhanced versions using identical settings in order to offer more experience with the phonemes that enunciate the word in question.


The Owen enhancement seems to infer a closing “S” even more overtly in the slowed version while the slow Stone enhancement seems to more likely arrive at a closing “D.” How can we explain the contrasts other than by the manner of the enhancement? A 911 tape isn’t governed by quantum physics – two distinct sounds cannot occupy the same place in a phoneme sequence at the same time.

On the three audio clips below, I have isolated and substantially slowed down the brief intervals in which each phoneme in the sequence must occur. The third and fourth are so brief that they must be heard together to make sense of them. If we can successfully identify the four sounds, or at least the ones that exclude the other phonemes, the correct interpretation of the word between the choices should become apparent. There is a little overlap in the phonemes to orient the listener.

Phoneme 1   Phoneme 2   Phonemes 3 & 4

All four phonemes don’t have to be definitive for us to rate one solution over the others as the most probable right choice. If the phoneme sequence begins, or ends, with a hard “C”, a /k/ sound, the more likely choices between the alternatives “punk” and “coons” or “cold” becomes apparent. In other words, no “P”, no “punks.” If the second phoneme is “O” or a “U” sound, another outcome is equally favored, since “Coons” doesn’t have a long “O” as does “cold.”

After listening, it is evident that the word in question begins with a  /k/ sound, a hard “C”. Most of the sound energy in the waveform is devoted to enunciating that /k/ sound. The long “O” sound is also apparent. So far, the word sounds like “coe.” This is where the sonic train comes off the tracks. The next phoneme of the two contiguous ones that run together is likely to be an “N” followed by… an “S” sound. What certainly never sounded like an ending “K”, and has flip flopped between sounding like an “S” or a “D”, depending on which enhancement one relied on, now sounds more like an “S”. Before the excerpt was isolated, it sounded more like a “D”.

Is it possible that Zimmerman committed a speech performance error when he said the word “coons” mispronouncing a long “O”? Is it possible that another phrase, “f’ing codes” is in play, and the “N” sound we register is just a poorly enunciated “D”? Could it be “f’ing cones?.” Is the method of analysis influencing our perception of the results of the experiment? Traveling from the macro to the micro scales of audio analysis, we find ourselves still questioning what we are hearing.

Our scientific knowledge has taken us to the moon and to the extreme depths of the ocean, but it cannot remove all doubt about a single spoken word. Language can be that complicated, less about certain outcomes and more about the elusively subjective human perceptions of what is heard.


Sam Guiberson advises and consults with other defense attorneys in cases involving undercover operations, electronic surveillance and recorded evidence.                         For more information about his work, see or email