Edit: The method utilized here is called "delay-and-sum beamforming", which I've learned a fair bit more about since writing this post. When I wrote this post, I was trying to deduce the working principle without reading too much literature. Still, the basic principle holds true, and I think my post here does a fine job of intuitively explaining the basics of things.
In this post I'm planning on investigating the very basic principles on which an acoustic camera might work. By an acoustic camera, I mean an array of microphones that, after some DSP, will produce an image showing the direction from which sound is arriving at the array.
I'll assume the sound source will be relatively far away from the microphone. I'll also approach the principles from a 2D perspective, to make the points I'm going to make as plain as possible.
The starting point
The setup will in this case be the simplest one I can think of; an array of microphones adjacent to each other in the same plane, with a constant distance from each other. I've attempted to visualize what I'm thinking of in the following figure:
We have a set of microphones, which record the changing sound pressures in the environment. We want to somehow convert this information into a picture, using digital signal processing. This post will focus on the DSP-part of the figure without delving into the specifics of it all too much. I'll present some principles for one possible approach.
We will for now assume that there's only one sound source, which is relatively far away from the array of microphones.
Let's also assume that the sound source sends out an ideal impulse. It will become evident later why this is important, but for now let's just assume that the sound pressure at the microphones is zero at all times except when when this ideal impulse reaches the microphones.
The figure to the left visualizes the sound source relative to the array of microphones. As the sound source is relatively far away from the array, one can see that it starts to resemble a plane wave from the point of view of the array.
I've tried to visualize this further in the next picture. The letter in the figure represents the angle at which the sound is coming from, relative to the array of microphones. Remember that the incoming sound wave represents an ideal impulse; the microphones will only record something when the wave is directly at their position.
The figure displays the resulting output of the microphones. The same ideal impulse, carried by the wave front, arrives at all the microphones, but at a different time. Because of our microphones being evenly spaced, and thanks to our assumption of the plane wave, the delay between adjacent microphones will be equal. The delay for this particular case is:
in which is the distance between two adjacent microphones and is the speed of sound. In practice this means, for example, that the sound will arrive four times later at microphone e than it did at microphone a.
Localizing a sound source
Let's now assume that we really do have a sound source at the same location as earlier (at the same angle , relative to the microphone array). The microphones will record the sound arriving from the sound source, with the same delays as we calculated earlier. This time we won't be using an ideal impulse, but still something close to that.
How, then, do we know that the sound really did come from the direction we calculated earlier? If we shift each signal by the "ideal" amount of time they would need to be shifted for the signals to be identical to each other, we get the result shown in the following picture:
Figure a shows the case where the sound source really is in the same direction as we calculated earlier. By summing each of the signals up, and taking the average of the summed signal, we get something that in this case very closely resembles the original signal (in this ideal case they actually are the same). Figure b shows a case where the direction of the sound source differs from the case calculated earlier. In case b, an averaging of the signal will significantly attenuate the result, as opposed to case a, where the averaged signal was identical to the original.
There's more to it...
OK. So far, everything's been relatively logical and intuitive. But there's still more to it. For example: let's imagine that the recorded pulse shown earlier is wider (it has proportionally more low frequency content):
The delay is the same as earlier, but the difference between the two angles isn't that big anymore. Lower frequencies require longer delays, which in turn will require larger microphone arrays. I think localization is still theoretically possible with smaller arrays, but it would require very accurate microphones, so the limit will be based on both the distance between the microphones and the accuracy of the microphones.
Another case to consider are very high frequencies. Intuitively one can think of very high frequencies containing very quick changes in the signal. If the microphone won't record the phase of the signal correctly, localization won't be possible. Interference at high frequencies will also present problems.
As such, localization using the presented method will only work in some frequency range. The frequency range will have a lower and upper limit.
Theoretical thoughts and practical realization
One can think of an identical signal, delayed and added to itself, as a FIR filter. A quick Google search will return a lot of resources of what this is and how it works. I won't go into the details here, so if you're interested in the theory behind it, you need to look it up on Google.
The most important thing to understand is the FIR-perspective can be used to both examine the behaviour of the system from a frequency-perspective and to construct the final DSP-part of the system. Here I will examine the frequency-behaviour of the system. I will leave specific frequencies and sample rates out of the investigation, to keep things as simple as possible.
Remember the "ideal impulse" I mentioned at the very beginning of the post? It's also known as the Dirac delta, which, when sampled, always gives an amplitude of 1. The impulse response in the FIR-filter can be built up using this. Each microphone will receive an ideal impulse, at a specific point in time. Let's assume the microphone is ideal, and as such the impulse will travel on into the DSP-part of the system with the exact amplitude of 1 and at the exact time it arrived at the microphone.
After the impulse arrives at the DSP-part of the system, the system will reroute the signal to another part of the system, which is responsible of investigating the sounds coming from a specific angle. Each microphone's signal is delayed by the "ideal" amount, calculated according to what was shown earlier. In the case of the ideal impulse, this means a that a specified amount of samples (zeros) is added in front of the impulse. Suffice it to say that the sample rate of the system needs to be quite high, so there are enough zeros to add for all the angles we are interested in.
Using these methods and assumptions, let's investigate one of these parts of the system, focusing on sounds coming from some specific angle . In the very ideal case, if the microphones were perfect and the sound would arrive at exactly the angle , the frequency response would be flat. The further away from this ideal angle we go, the further away from the ideal frequency response we go. This is how we approximate how much sound comes from each angle.
In practice, the frequency response will never be ideal. This is illustrated in the figure below. It shows the response for an ideal sound arriving at exactly the angle , compared with the response when the sound arrives at almost the correct angle, and lastly the response when the sound should be as attenuated as much as possible.
The meaning of the different responses is perhaps clearer when we put one figure, which shows the frequency response when the incoming sound is coming from almost the correct angle (), on top of another, where the difference between the angles is somewhat larger (). When we increase the difference in angles, so does the difference in the frequency responses increase.
The circles show the critical spots, at which the system won't work correctly. The result is correct as long as the blue plot is somewhat beneath the red one (mostly sound coming from the correct angle will contribute to the final result). Some unwanted peaks can be seen, which will result in the final response not being correct at all frequencies. At some specific frequencies, the system will interpret sound coming from the wrong angle (blue plot) as coming from the angle we're investigating (red plot).
Clearly the simplified case we have investigated here won't give good results as such, but the system should already work for noise-type sounds or sounds that we heavily band-pass before processing. This was to be expected, as we're only using 5 linearly spaced microphones. For better results, we need more microphones, for one. The microphones should also be placed in such a way that the frequency response is as close to possible to flat when the sound arrives from (approximately) the direction we're investigating, and as attenuated as possible when moving away from it. Evenly spaced microphones bring unwanted filter-type effects, so I would imagine evenly spaced microphones always being a bad thing, although I might be wrong.
The system is still far from ready, but I think this is a good starting point. A lot of questions are left unanswered. Focusing on the parts shown in the last chapter should at least enable me to build a working "virtual" prototype (it's here!).