Importing Data into Excel
Download
Report
Transcript Importing Data into Excel
Chapter 18
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a
publicly accessible website, in whole or in part.
BUSINESS ANALYTICS:
DATA ANALYSIS AND
DECISION MAKING
Importing Data into Excel
Introduction
Any statistical analysis presumes that you have the appropriate data
in a format suitable for analysis.
The data might exist:
In an Excel® file—which might still need to be rearranged to get it in the
form of a rectangular data set.
In a text file (or ASCII file)—which is any file that can be opened and
read in a text editor such as Notepad; it can be imported into Excel
using Excel’s text import wizard.
In a relational database (such as Access, SQL Server, Oracle)—which
can be imported into Excel by forming a query using the Microsoft
Query package.
A query specifies exactly which data you want to import.
On the Web—which can be imported into Excel by creating a query
and then running it in Excel.
Once the data is imported, it may need to be cleansed to fix wrong
values.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Rearranging Excel Data
Even if the data already exists in Excel, it may need
to be rearranged in the form of a data set—a
rectangular array of data with observations in rows,
variables in columns, and variable names in the top
row.
Sometimes
simple cutting and pasting works.
In other cases, advanced Excel functions are required.
In all cases, it is best to map out a plan and then decide
how to implement it.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.1:
Baseball Salaries Original.xlsx
Objective: To rearrange the data from the
baseball Web queries into a single data set.
Solution: Data on baseball salaries was
imported into Excel from a Web site, with a
separate Web query for each of the 30 teams.
The results for a typical team are shown to the
right, with only a few players listed.
To rearrange all of the data into four long columns with the headings Player, Team,
Salary, and Position, follow these steps:
1. Insert a blank column before column B, and enter the label Team in cell B2.
2. Cut the Arizona Diamondbacks team name from cell A1 and paste it next to
the first Arizona player in cell B3. Then copy it down for the other Arizona
players.
3. Repeat step 2 for each of the other teams.
4. Delete unnecessary rows of labels for the other teams.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.2:
CPI.xlsx (slide 1 of 2)
Objective: To rearrange the monthly data into two long columns, one with
month-year and one with the CPI.
Solution: Monthly data on the Consumer Price Index (CPI) was imported
from the Web using a Web query. A few rows appear below.
The desired results after rearranging are
shown to the right.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.2:
CPI.xlsx (slide 2 of 2)
To rearrange the data into two long columns, follow these steps:
Create the range name Data (for all the CPI values, not the headings in row 1
or column A).
Add a new worksheet for the rearranged data, create the column headings in
row 1, enter 1 in cells A2 and B2, and enter 1913 in cell C2.
To generate the recurring pattern of 1 to 12 in column B, enter the formula
=IF(B2<12,B2+1,1) in cell B3 and copy this down as far as necessary.
To generate the pattern in column A, enter the formula =IF(B3=1,A2+1,A2) in
cell A3 and copy it down.
To generate the years in column C, enter the formula =IF(B3=1,C2+1,C2) in
cell C3 and copy it down.
To generate the month-year values in column D, enter the formula
=DATE(C2,B2,1) in cell D2 and copy it down.
To generate the CPI values in column E, enter the formula =INDEX(Data,A2,B2)
in cell E2 and copy it down.
Copy columns D and E and paste them over themselves as values. Then delete
columns A-C.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Importing Text Data
Most software packages store documents in a proprietary
binary format that is readable only by that package.
However, most software packages allow you to save a file
as a text file, usually with a .txt extension.
A text file can be fixed width or delimited.
With fixed width, each variable’s value starts and stops at fixed
positions (columns) in the line.
With delimited data, there is a delimiter character, usually a tab,
comma, semicolon, or space, that separates the values in a line.
Each line of data has the same length, and the columns line up.
The lines are typically of different lengths and do not line up nicely.
Excel can import either fixed-width or delimited formats
with its text import wizard.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.3:
srn_tmp.txt (slide 1 of 2)
Objective: To import the fixed-width text file data into Excel by using Excel’s text
import wizard.
Solution: The text file srn_tmp.txt was downloaded from the Web. It contains state,
regional, and national (srn) annual data (1895-2005) on temperature. A small
portion of the data is shown below.
The Web site from which the text file
was downloaded also has a data
dictionary, srn_data.txt, that indicates
what the variables are and how they
are stored in columns. Part of this data
dictionary is shown to the right.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.3:
srn_tmp.txt (slide 2 of 2)
Open the srn_tmp.txt file within Excel, and the first step of the text import wizard
appears. Select Fixed width.
The second step of the wizard allows you to separate (or parse) the columns as
listed in the data dictionary. Click on the third, fourth, and fifth positions on the ruler.
The last step of the wizard allows you to fine-tune the import, column by column, but
bypass this step and simply click Finish.
The data is imported into Excel, as shown below.
Create column headings in row 1, using the data dictionary as a guide.
Use “Save As” to save the .txt file as an .xlsx (or .xls) file.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.4:
Mobile Subscriptions.txt
(slide 1 of 2)
Objective: To see how delimited text data can be imported into Excel with
the import text wizard.
Solution: Annual data by country on the number of mobile subscribers
during 2002-2009 was downloaded from the Web into the file Mobile
Subscriptions.txt. A portion of the file is shown below.
This time there are column headings in row 1, but the ragged lines indicate that
this file must be delimited, not fixed width.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.4:
Mobile Subscriptions.txt
Open the Mobile Subscriptions.txt file in Excel, and check the Delimited
option in step 1 of the wizard.
In step 2, Excel guesses that the file is tab-delimited, which is correct, so
click on Finish to accept this.
Some of the imported data is shown below.
(slide 2 of 2)
The Flags column contains no data, so it can be deleted.
Column A can also be deleted because it includes a constant value, Mobile
Subscribers.
The numbers in column D can be reformatted to Numeric with 0 decimals.
Use “Save As” to save the file in .xlsx (.xls) format.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Comments About Importing Text Data
If a text file is comma-delimited and is saved as a .csv file, you can
open it directly into Excel, without the import text wizard.
Excel automatically parses the values between the commas into columns.
Sometimes you will find a table of data on the Web where there is
no option to save it in some type of format, text or otherwise.
You can try copying and pasting the data into Excel, but it is possible
that everything will be pasted into a single column.
If this happens, highlight the data in this column and click on the Text to
Columns button on Excel’s Data ribbon.
The purpose of this button is to parse delimited data in a single column into
several columns.
Whenever you parse text data into Excel columns, there is always
the chance that the data will not line up properly—that is, the data
will get into the wrong columns.
Look closely at the parsed data before proceeding.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Importing Relational Database Data
Database packages such as Access, SQL Server,
Oracle, and many others are extremely complex and
powerful packages.
For database creation, querying, manipulation, and
reporting, they have many advantages over spreadsheets.
However, they are not nearly as powerful as spreadsheets
for statistical analysis.
It is often necessary to import data from a database
package into Excel, where the statistical analysis can be
performed.
Microsoft includes software called Microsoft Query in its
Office suite that makes the importing relatively easy.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction to Relational Databases
The Excel “databases” are often called flat files or, more
simply, tables.
They are also called single-table databases, where table is the
database term for a rectangular range of data, with rows
corresponding to records and columns corresponding to fields.
Flat files are fine for relatively simple database applications, but
they are not powerful enough for more complex applications.
A relational database is a set of related tables, where
each table is a rectangular arrangement of fields and
records, and the tables are linked explicitly.
The linked fields are called keys.
A primary key must contain unique values, whereas a foreign key
can contain duplicate values.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Using Microsoft Query
There are two ways to import data into Excel in Excel 2007
or later.
The first method uses the From Access button in the Get External
Data group on the Data ribbon.
It is limited to importing whole tables or saved queries.
The second method employs the Microsoft Query software, which
allows you to import all or part of the data from many database
packages into Excel.
1.
2.
3.
It comes with the Office package, but may need to be installed.
Once Microsoft Query is installed, importing data from Access (or any
other supported database package) is essentially a three-step
process:
Define the source, so that Excel knows what type of database the data is
in and where the data is located.
Use Microsoft Query to define a query.
Return the data to Excel.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.5:
Shirt Orders.mdb
(slide 1 of 3)
Objective: To illustrate how Microsoft Query can be used to import the
results of queries on the Shirt Orders database into Excel.
Solution: Fine Shirt Company has created an Access database file that has
information on its sales to its customers during the period of 2005 through
2009.
There are three related tables in this database, Customers, Products, and
Orders, with a link between the CustomerID fields in the Customers and
Orders tables and a link between the ProductID fields in the Products and
Orders tables, as shown in the diagram below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.5:
Shirt Orders.mdb
(slide 2 of 3)
The entire import procedure is done within Excel and Microsoft
Query, not Access.
First, define a data source.
In the Choose Data Source dialog box, make sure the Shirt Orders item
is selected and the bottom checkbox is unchecked, and click OK. This
brings up the Add Tables dialog box and begins the second step, where
you define the query.
Open a blank spreadsheet in Excel and select From Microsoft Query from the
From Other Sources dropdown menu on the Data ribbon.
Select the top <New Data Source> item in the Choose Data Source dialog
box and then click OK.
Fill in the Create New Data Source dialog box, and then the ODBC Microsoft
Access Setup dialog box to indicate which database file you want to use.
Click OK to return to the Choose Data Source dialog box.
Specify which tables are relevant for the query, which fields you want to
return to Excel, and which records meet the criteria you spell out.
The final step is to get these data back into Excel.
Select the Return Data to Microsoft Excel menu item from the File menu.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.5:
Shirt Orders.mdb
The results of one query (Find all of the
records in the Orders table that correspond
to orders for at least 80 units made by the
customer Shirts R Us for the product Longsleeve Tunic, and return the dates and units
ordered for these orders) are shown to the
right.
Once the results of the query data are
returned to Excel, you can then begin the
statistical analysis of the data.
(slide 3 of 3)
The data are still linked to the query. This
means that you can refresh the data in Excel
if the Access data change.
You can also get back to Microsoft Query so
that you can edit your query.
If your ultimate goal is to create a pivot
table based on the database data, you can
do this directly, as shown in the next
example.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.5 (Continued):
Shirt Orders.mdb (slide 1 of 2)
Objective: To illustrate how Microsoft Query can be used to
import data directly into a pivot table.
Solution: Fine Shirt Company would like to break down
revenue from its various customers and products by using
pivot tables.
To base a pivot table on external data, go through
Microsoft Query, not through the usual PivotTable button on
the Insert menu.
Get into Microsoft Query and define a query.
When you select the Return Data to Microsoft Excel menu item
from the File menu, you see a dialog box where you can specify
the type of report you want and where you want it. Select
PivotTable Report (or PivotChart and PivotTable Report).
From here, you can create any pivot tables in the usual way.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.5 (Continued):
Shirt Orders.mdb (slide 2 of 2)
One possible pivot table, showing total revenue broken down by
product, customer (using the Report Filter area at the top), and
quarter of year, is shown below.
You have the option of obtaining corresponding pivot charts
automatically.
The pivot table is linked to the query. This means that you can go back
to Microsoft Query, edit the query, and return to Excel to update the
pivot table.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
SQL Statements
Queries represent a large part of the power behind relational databases.
To standardize queries across packages, SQL (structured query language)
was developed several decades ago.
It is often called the “language of databases.”
Behind each query developed in Microsoft Query is an SQL statement.
The statements can be viewed by clicking the SQL button in the Query toolbar once you
have created a query.
The statements include keywords such as SELECT, FROM, WHERE, and AND, as shown in
the example below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Web Queries
(slide 1 of 2)
Web sites that contain data are structured in all sorts of
ways, and the steps required to import the data into Excel
for analysis vary greatly.
Many Web sites provide buttons that allow you to download the
data directly into Excel.
On some Web sites, the only way to get the data into Excel is to
cut and paste.
Web query is an Excel tool that lies between these two extremes.
Web queries search for HTML <Table> tags, find the corresponding
data, and bring them into Excel in the usual row and column format.
Many data sets on the Web import beautifully with Web queries, but
some return virtually nothing.
Sometimes you need to run several Web queries on the same basic
site to get all of the data you want.
In a URL, the part to the right of the question mark (if any) is called
the query string, and it specifies exactly which data you want.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Web Queries
(slide 2 of 2)
To get data into Excel using a Web query:
Make sure you have an active connection to the Web, and
open a new workbook in Excel.
Click the From Web button on the Data ribbon.
Fill in the dialog box, the most important part of which is the
URL (the address of the page) at the top.
Once you enter the URL, click Go. You will see the Web
page with yellow arrows next to all of the tables.
Click any of these yellow arrows to change them to green
checkmarks. The selected tables will then be imported into
Excel.
After you click Import, specify where to place the results.
A link to the Web page remains, so you can refresh to
obtain the latest data.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Cleansing Data
You cannot count on real-world data sets to be perfect.
This is especially the case when you obtain data from
external sources such as the Web.
It is your responsibility to correct any problems before you
do any serious analysis.
Cleansing data requires careful detective work to
uncover all possible errors that might be present.
Once an error is found, it is not always clear how to correct
it (for example, missing data).
Some subjectivity and common sense must be used when
cleansing data sets.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.6:
Data Cleansing.xlsx
(slide 1 of 2)
Objective: To find and fix errors in this company’s data set.
Solution: The data file has data on 1500 customers of a
particular company. A portion of these data appears below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 18.6:
Data Cleansing.xlsx
The data set has a number of problems, all of which you might encounter in
real data sets:
Duplicate SSNs
Duplicate birthdates (as a result of the code 9999 being used for missing data
and then formatted as a date)
Negative ages (as a result of the Y2K problem)
Too many categories (as a result of a space being entered before a value in a
text field)
Quotes around numbers (causing them to be interpreted as text, not numbers)
Suspiciously small incomes (as a result of incomes being entered without trailing
zeroes)
Suspicious outliers (as a result of the average spent being entered rather than
the total amount spent)
Use Excel tools to search for the suspicious data values.
(slide 2 of 2)
Sort on columns and look for unusual or duplicate values.
Create a scatterplot of the data and look for suspicious outliers.
Then use other Excel tools, such as Find and Replace, to fix the errors.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.