From Rough Download to Smart Data

We all do it but we all scratch our heads and wonder how to do it well. I mean downloading data from the internet. Just look at the two screenshots: before, during and after the download of part of a large database of accounting data.

From the Forbes Magazine web site, I wanted to download their Forbes Global 2000 database:

What you see on the web page

Raw Data on the Internet

When I copied and pasted this, it was a complete mess in Excel so I copied and Pasted Special as Text, this is the result:

Copy ... Paste Special ... Text

Initial Worksheet after Paste Special, Text

You can see the problems: the rank, the company name and the County have all been pasted vertically into the same column as each other whilst the quantitative data have been pasted correctly.

I used a variety of techniques to clean this initial worksheet: i needed to set up columns to separate out rank, company and country. Then I needed to put the columns in the right order … I could have done the organisations of columns first, of course.

You will then see that each cell contains a Dollar sign and the letter B or M … this latter point means that some data is shown as $ billions and some are shown as $ millions. I then needed to identify which is which and then develop a strategy for dealing with that. As an exercise you should try to solve this problem yourself … look at this blog and you will see that I have had questions from friends and delegates asking me to show them how to sort this out.

What I did is not difficult but you have to be systematic. Here is the result of my efforts with the database now being ready for analysis:

The Final Version

The Final Version: the clean worksheet

There you are! If you get stuck with this but want to know how I did what I did, go to this page … Rough to Smooth!