Deep learning networks prefer the human voice — just like us
The digital revolution is built on a foundation of invisible 1s and 0s called bits. As decades pass, and more and more of the world’s information and knowledge morph into streams of 1s and 0s, the notion that computers prefer to “speak” in binary numbers is rarely questioned. According to new research from Columbia Engineering, this could be about to change.
A new study from Mechanical Engineering Professor Hod Lipson and his PhD student Boyuan Chen proves that artificial intelligence systems might actually reach higher levels of performance if they are programmed with sound files of human language rather than with numerical data labels. The researchers discovered that in a side-by-side comparison, a neural network whose “training labels” consisted of sound files reached higher levels of performance in identifying objects in images, compared to another network that had been programmed in a more traditional manner, using simple binary inputs.
“To understand why this finding is significant,” said Lipson, James and Sally Scapa Professor of Innovation and a member of Columbia’s Data Science Institute, “It’s useful to understand how neural networks are usually programmed, and why using the sound of the human voice is a radical experiment.”
When used to convey information, the language of binary numbers is compact and precise. In contrast, spoken human language is more tonal and analog, and, when captured in a digital file, non-binary. Because numbers are such an efficient way to digitize data, programmers rarely deviate from a numbers-driven process when they develop a neural network.
Lipson, a highly regarded roboticist, and Chen, a former concert pianist, had a hunch that neural networks might not be reaching their full potential. They speculated that neural networks might learn faster and better if the systems were “trained” to recognize animals, for instance, by using the power of one of the world’s most highly evolved sounds — the human voice uttering specific words.
One of the more common exercises AI researchers use to test out the merits of a new machine learning technique is to train a neural network to recognize specific objects and animals in a collection of different photographs. To check their hypothesis, Chen, Lipson and two students, Yu Li and Sunand Raghupathi, set up a controlled experiment. They created two new neural networks with the goal of training both of them to recognize 10 different types of objects in a collection of 50,000 photographs known as “training images.”
One AI system was trained the traditional way, by uploading a giant data table containing thousands of rows, each row corresponding to a single training photo. The first column was an image file containing a photo of a particular object or animal; the next 10 columns corresponded to 10 possible object types: cats, dogs, airplanes, etc. A “1” in any column indicates the correct answer, and nine 0s indicate the incorrect answers. More