Google Cleans Up Messy Data with Refine

google-refine

Mashable: If you live for data, slave over spreadsheets and constantly find yourself sifting through endless rows and columns of facts and figures, Google’s got a lovely new product just for you — and it’s free and open-source, too.

Google Refine is a project born of Freebase Gridworks, a data-cleaning tool Google acquired when it bought Metaweb during the summer. Google has since renamed Gridworks and relaunched it as Refine.

Basically, Refine makes it much easier for data geeks to clean up and use big sets of data.

For example, if you’re writing an academic paper, government study or news article that requires you to download and parse spreadsheets from Data.gov or similar source of free information, you might notice all kinds of inconsistencies when you try to sort the data. This is a particular problem when you’re using free, open-to-the-public data that no one has maintained or cleaned up in the past.

Google Refine builds on its Gridworks roots by helping its users correct inconsistencies, changing data formats, extending data sets with data from web sources and other databases and much more. Refine also brings “a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase) and a ton of new transformation commands and expressions,” according to the official Google Open Source blog.

Here’s the first of three demo videos showing off Refine’s new and improved data-cleaning capabilities:

We can imagine this tool will allow non-programmers who deal with lots of data, including students and journalists doing research, to manipulate and sort data much more quickly.

What do you think of Google Refine so far?

Similar Posts

  • Blogger.com – -not!

    Cybercriminals are attacking bloggers who use Google’s Blogger.com. We have received emails intended for bloggers to update their account. Here’s the snapshot email of the email we have received: The email contains link that will redirect to fake login page of the “Blogger.com”. As seen from the highlighted link, it has a root domain “*.erdca.kr”…

  • Google.cn is Dead Now

    Google.cn is now redirecting to google.com.hk [ Hong Kong google servers ] – this has happened after the cyber attack on google china servers in december. right now if you try to access the google china web, news and image search are being redirected to google.com.hk Below is short snippet of the update about this…

  • Sandbox

    In computer security, a sandbox is a security mechanism for separating running programs. It is often used to execute untested code, or untrusted programs from unverified third-parties, suppliers and untrusted users. I Recommend Sandboxie for Daily use. The sandbox typically provides a tightly-controlled set of resources for guest programs to run in, such as scratch…

  • Hacker Fail

    Very funny: The story starts with an guy insulting everyone on the IRC channel. Most people there believed it was rather funny, but it got even more funny. For information: The dangerous hacker is called bitchchecker and the one being hacked and original author of the comments, who is talking here, is known as Elch….