Referring To Your Reference Table with Fuzzy Lookup
Referring To Your Reference Table with Fuzzy Lookup
By: Kasim Wirama, MCDBA, MVP SQL Server
One of problem of ETL world is referencing non existent data. For example when one column of fact table that is foreign key to a dimension. It is usually caused by bad data in the fact table. So how to handle it more appropriately in cleansing data with less effort. The answer is that using Fuzzy Lookup. This component will help most of painful process in dealing with data cleansing.
This component provides 2 important informations besides cleansed data:
1. Similarity
Is a number between 0 and 1 representing similarity. You can define how close the similarity is. The close the number to 1, the closer similarity should be met.
2. Confidence
Is a number between 0 and 1 representing confidence of match relative to set of matched results.
I will show you how to use this component.
First you make a flat file named employeelist.txt, its content is as below, separated with tab among columns.
EMPNAME EMPTITLE
HOWARD GEN MANAGER
MATTHEW GENERAL MANAGER
LINDA GENERAL MGR
SUSAN JUNIOR MANAGER
ERIC JUNIOR MGR
JOHN JUN MANAGER
Next you created below script in tempdb database, a lookup table :
create table dbo.OccupationTitle
(
OccupationTitleID int not null,
OccupationTitleDesc varchar(30) not null
);
go
insert into dbo.OccupationTitle values (1,'GENERAL MANAGER');
insert into dbo.OccupationTitle values (2,'JUNIOR MANAGER');
Now, you create SSIS package with Visual Studio 2005, more details how to create this, just refer to early links about SSIS.
Drag Data Flow task into Control Flow designer, double click on the task to switch to Data Flow designer. Drag Flat File Source, double click on it, create new Flat File Connection Manager by clicking New button. And create your flat file connection manager as previous article I ever write before, with Column Names in the first data row checked.
Drag Fuzzy Lookup and connect to it. Double click on it, specify OLEDB connection manager and select reference table that you have just created on previous step
On columns tab connect arrow between EMPTITLE and Occupation TitleDesc, specify input column to EMPTITLE, Lookup column to OccupationTitleDesc, Mapping Type to Fuzzy, and check all comparison Flag check box. And check all column on Available Lookup Column and Available Input Column.
On Advanced tab, you can define your own Similarity threshold, if you have no idea, specify it to 0.5.
Drag Recordset Destination, connect Fuzzy Lookup component to it. Create variable with type Object. Double click on Recordset Destination, specify VariableName property to the variable. On Available Input Columns, check all columns. And click OK.
If you want to see contenct of Recordset Destination, you can add Data Viewer component by right clicking arrow between Recordset Destination and Fuzzy Lookup. Now you can run the package.
You can compare original data and cleansed data between EMPTITLE column and OccupationTitleDesc column, and Fuzzy Lookup will find its OccupationTitleID.
What a useful component for data cleansing.