Articles

The Complexities of Watermarking Audio Speech Signals

Nick Gaubitch

author • 3rd September 2024 (UPDATED ON January 17, 2025)

6 minute read time

Background 

Digital audio watermarking received a great deal of attention in the early 2000s as a means of protecting the intellectual property of digital media in the advent of file-sharing and media streaming. While there were some publications discussing watermarking, the vast majority of research from this period focused on music where copyright protection requirements have been most significant.

In the past year, the topic of audio watermarking has seen a resurgence in interest and this time the focus is on speech. The key driver behind this resurgence has been the vast improvement in text-to-speech and voice conversion technologies that has led to the somewhat negative connotation of ‘deepfakes’. It quickly became apparent that deepfakes can be a vehicle for misinformation, media manipulation, social engineering and fraud, to mention but a few. It has therefore become increasingly important to be able to quickly and accurately decide if a speech signal is real or not – something that by now is far beyond the capabilities of a human listener. And this is where watermarking comes in. It has been proposed to watermark synthetically generated or manipulated audio then use the watermark to validate the authenticity of the speech signal.

In Pindrop’s contribution at Interspeech 2024, one of the flagship scientific conferences of speech science, we present an improved method of watermarking based on the classic spread spectrum approach. In this blog, we will provide a summary of the main findings of this work. The interested reader is referred to the paper for details [REF TO PAPER]. 

Fundamentals of watermarking

A watermark typically consists of a pseudo-random sequence of +1s and -1s. This sequence is embedded in a carrier signal, which in our case is speech. The watermarked signal may then be compressed for storage, transmitted over telephone networks or replayed through a loudspeaker. At some point, a user of the speech signal can probe its authenticity by trying to detect a watermark and, if the watermark is present, it would indicate a deepfake. 

Figure 1. Generic watermarking system diagram.

Although conceptually straightforward, watermarking presents conflicting requirements that must be satisfied for a watermark to be useful in practice. These requirements are illustrated in Fig. 2. The balance must be struck between the robustness to anticipated and deliberate attacks, imperceptibility (or inaudibility) to a listener or to an observer, and the information bearing capacity.

Figure 2. Triangle of conflicting requirements of watermarking.

What makes watermarking speech more challenging than music?

There are a number of factors that make watermarking of speech more challenging than watermarking music. The most important of these factors are listed below:

  • Speech communication channels: a typical speech communication channel includes several stages where the speech signal is degraded through, for example, downsampling, additive noise, reverberation, compression, packet loss and acoustic echoes. All of these may be viewed as non-deliberate attacks, and thus they form the base of minimum requirements for watermark robustness. 
  • Tolerance to degradations: the objective of a speech signal is to convey two pieces of information: (i) who is speaking and (ii) a message between a speaker and a listener. Both of these can be achieved successfully even in large amounts of background noise and reverberation. This may be exploited by bad actors to make a watermark undetectable. 
  • Limited spectral content: speech signals generally have much less spectral content than music. This makes it more difficult to find space for embedding a watermark in a manner that makes it imperceptible. 
  • Short frame stationarity: speech can be considered stationary in 20-30ms frames only which is at least two to three times lower than music signals. As will be discussed later in the blog, this has implications on the length of watermark that can be embedded. 

Improved spread-spectrum watermarking of speech

Spread-spectrum watermarking is one of the most prominent solutions available in the scientific literature. However, it was developed with focus on music and as we described earlier, watermarking of speech requires a different set of requirements. Below we summarize the important improvements and, thus, the novel contributions of our work.

  • Frame-length analysis: in the original spread-spectrum work, frame sizes of 100 ms were used for the embedding of the watermark. We demonstrated empirically that the optimal frame-size for speech is in the range of 20-30 ms; longer frame-size than that makes the watermark audible and its intensity must be reduced, which in turn reduces robustness. We also showed that frame-sizes greater than 100 ms may be used for music without compromising robustness or imperceptibility.
  • LPC-based weighting: one commonly used technique to improve imperceptibility and without compromising robustness is to embed the watermark in high magnitude frequency components of the carrier signal. While this has proven to work for music, we demonstrate in our work that it is detrimental to speech. The reason for this is that the high magnitude frequency components in speech typically correspond to formant frequencies and when these are disturbed the speech quality is adversely impacted. Hence, we derive a weighting function from the linear prediction coding (LPC) spectral envelopes that is closely related to the formants and use it to weight the watermark such that it is reduced within the spectral peaks but emphasized elsewhere. Our results show that the intensity of a watermark may be doubled (thereby increasing robustness) when this method is applied. 
  • Deep spectral shaping: from classical detection theory, the optimal detection of the watermark (or any signal in general) is a matched filter or the correlation between the watermarked signal and the watermark. This holds true if the carrier signal is spectrally white and for simple interference such as added white Gaussian noise. As we have discussed above, this is rarely the case for speech signals. Applying a pre-whitening filter, such as a cepstral filter, can improve detection accuracy by combating the inherent spectral slope in speech, however, it does not deal with more complex degradations. Hence, we considered two different deep neural network (DNN)-based architectures for preprocessing the signal prior to the matched filter operation. The models were trained on anticipated degradations such as downsampling and compression down to 8 kbit/s. We showed that this could significantly improve detection accuracy in these more challenging cases with an equal error rate improvement of up to 98%. 

Summary

Watermarking has been proposed as a possible solution to the detection of synthetically generated or modified speech. While many methods were developed originally for music, they are not directly applicable to speech. We have highlighted the differences between speech and music and we addressed several of those in this work. Specifically, we defined an optimal frame-size range for embedding a watermark, we derived an LPC-based weighting function for improved watermark embedding, and a DNN-based decoding strategy for watermark decoding robust to complex degradations. This work thus shows that we are able to obtain reasonably robust watermarking strategies for speech signals. However, there is still work to be done in order to fully understand the extent to which this can help combat the misuse of deepfakes.

Learn more about Pindrop liveness detection technology here.

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.