Mars spacecraft… squared by Google
June 22nd, 2009 at 9:55 pm (Books, Computers, Dancing, Spacecraft)
I recently discovered Google Squared, an interesting combination of web search and automated information extraction. Actually, it reminds me strongly of the strictly formatted report we had to write in 6th grade English class on frogs. You had to draw a square chart and then label the rows with different kinds of information about frogs, like how they reproduce, what kind of food they eat, and where they live. You then labeled the columns with different information sources, like “Encyclopedia Britannica”, and then you filled in each square with what source X reported about property Y. You then used this chart to write the report itself. This was supposed to teach you how to do research, in the “look up information” sense of the word.
With Google Squared, though, the system figures out what the rows should be (different examples of the category you searched on) and what the columns should be (different properties of each of the examples). It’s fascinating, although you immediately run up against the limitations of current state-of-the-art IE (Information Extraction) technology.
Exhibit A: mars spacecraft
This produces a nice collection of Mars spacecraft, with columns for “mass”, “launch vehicle”, and “launch date.” The first thing I wanted to do was sort by launch date. Unfortunately, the columns aren’t sortable. You can however add your own columns, so I tried “cost”. This looked mostly reasonable, except that “Phoenix” was cited as $350M, “Mars Phoenix Lander” was $420M, Mariner was $2.6 (dollars?), and the Spirit rover was $10,000 (if only!). However, a really neat feature is that each factoid reports its source if you hover over it, and you can click to see other candidate values as well as a confidence rating. All of the values under “cost” were rated low-confidence, even the ones that looked accurate to me.
Exhibit B: science fiction authors
This yielded a combination of books and authors, with the auto-chosen columns being “publisher”, “language”, and “Australia” (?!). Specifying “science fiction author” yields the same list of items, but with different columns: “publisher”, “ISBN”, and “language”.
Exhibit C: ballroom dance
This yielded an excellent list of ballroom dances. Unfortunately, the columns (“typical instrument”, “mainstream popular” (?), and “stylistic origins”) were almost entirely unpopulated with data. I tried to add “tempo”, but this yielded a result for tango only (33 mpm). That one could definitely use work!
In summary, I’d say it’s a very cool idea (and fun to play with), but still definitely at the beta level. Doing a good job of information extraction from unrestricted text (the Web) is a really hard task. Keep at it, Google!