| Thomas's profileThomas.I Microsoft BI & ...PhotosBlogLists | Help |
|
Thomas.I Microsoft BI & AnalyticsJune 19 Sampling data with the Excel 2007 data mining add inI have used the sample tool in the Excel 2007 before but I never thought that I could do this directly on a data source, only on an Excel 2007 table, but you can. I am using SSAS 2008, as the data source, and the Excel 2007 data mining tools for SSAS 2008. This can be useful even if you do not want to use the data mining add ins and only collect data for a pivot table. Think about building a pivot table on top of 100 thousand records or do it with a random sample of 10 thousand records.
You cannot join tables in this tool, only select single tables or views.
Edit: I had the wrong order of the pictures but that is fixed now.
Open Excel 2007 with an empty sheet and go to the data mining menu.
Click on the button next to the data source name box. You will see the select data source page.
You will se the empty data source query editor below.
Click on the button next to server data source listbox. You will see the new analysis services data source form. Enter the data source name and the taget server name and browse for the database at the bottom in the catalog name listbox. I will select the AdventureWorksDW database in the cataloge name listbox below.
After we have selected the database we will return to the data source query editor. I will select the well known target mail view in the AdventureWorksDW database on the next page below.
You will come back to the select data source page and see the query in the bottom test box. Click next. You will arrive to the select sample type page where you do not have any choice but accepting random sampling.
In the random sampling option page you will only have the option to select the number of records for the random sample.
In the final page you enter the name of the spreadsheet tab. I will not show that. Click finish to get the random sample records into the spreadsheet.
Another approach is to write a TSQL select and use the random sampling features that has been available since SQL Server 2005. The built in sampling in the data mining tools can be interesting for end users without TSQL coding skills. May 20 The ITEM() function with MDX code examplesI usually write MDX blog posts with a business problem as the start point and seldom posts about how MDX functions might work. Here is one exception, the Item function that is useful in many analytic scenarios, will be discussed with examples but only in an attempt to explain how it works. I thank Chris Webb for helping me on some questions that I had.
We will start with a query to get a base result set to use with the Item function in the later examples. Run this MDX in management studio:
Select {[Measures].[Internet Sales Amount]} on 0, CrossJoin (Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])) On 1 From [Adventure Works];
Now let us continue with a simple Item example on the same MDX select that we have started with.
Select {[Measures].[Internet Sales Amount]} on 0, CrossJoin (Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])).Item(0) On 1 From [Adventure Works];
The result is the first tuple in the set that you have seen in the starting query. The index that you enter into the Item() function starts with zero so this is a clear result.
If you change the index in the Item() function in the pervious query to 1 you will get the second tuple in the starting query.
If you increase the index one by one you will walk down the first result set tuple by tuple, so there is nothing unclear about how the Item() function works with one Index.
Now we will add a second argument to the Item() function like Item().Item() and se how this works. Remember to refer to my first example at the top of this blog post.
Select {[Measures].[Internet Sales Amount]} on 0, CrossJoin (Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])).Item(0).Item(1) On 1 From [Adventure Works];
What you get as a result might require an explanation. The query asks for the second member from the first tuple. That means that from the tuple (2001, Accessories) we are asking for Accessories. What we get as a result is the sum of all Accessories in all the years in the first query.
We can repeat the same behaviour for the second tuple with this query.
Select {[Measures].[Internet Sales Amount]} on 0, CrossJoin (Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])).Item(1).Item(1) On 1 From [Adventure Works];
Finally if we execute this statement we will see the total for the components product category that is null.
Select {[Measures].[Internet Sales Amount]} on 0, CrossJoin (Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])).Item(3).Item(1) On 1 From [Adventure Works];
I hope that these code examples can help with understanding how the Item functions works. This function is easy to understand as long as you use one argument or index but gets a little harder with two arguments or indexes. March 08 Empty values in MDXEmpty values in an SSAS cube can sometimes be valuable information but according to my experience, but in most scenarios, you would like to somethng about them. The scenarios I will discuss assumes that you do not have entered zeros for non existing measures in the source fact table.
Let us start with the first query:
Select Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],[Measures].[Internet Sales Amount]) It will show customers in Florida(USA) and what they have bought during the calendar years.
The first obvious tool in the MDX toolbox is to use the NON EMPTY key word to remove empty rows and columns from the result.
So far nothing new. Most client tools use the NON EMPTY option to remove empty rows and columns.
Update: I forgot the NonEmpty function. Run this MDX and it will give you the same result as in the previous picture, but probably much faster.
Select NON EMPTY Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
NONEMPTY(CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])), [Measures].[Internet Sales Amount]) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],[Measures].[Internet Sales Amount]); Lets have a look at some other MDX keywords and functions that can be useful. What can the EXISTS function do? Run the MDX below and have a look at the result.
Select NON EMPTY Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
EXISTS(CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])), [Measures].[Internet Sales Amount]) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],[Measures].[Internet Sales Amount]); The empty rows appears again but that is because EXISTS will use the members from the first set, Customers in Florida, that have bought products but only show the members from the Customers(in Florida) dimension. The products these customers have bought will not be shown.
Using the EXISTING key word will return the same result as EXIST.
Select NON EMPTY Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
EXISTING CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],[Measures].[Internet Sales Amount]); Another option is to use the FILTER function. It will return the same result as the NON EMPTY key word.
Select NON EMPTY Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
Filter(CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])), [Measures].[Internet Sales Amount] > 0) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],[Measures].[Internet Sales Amount]); Finally you can see what the COALESCEEMPTY function can do with null values. This function do not remove nulls or zeros but change their values.
WITH Member Measures.InternetSalesX as
CoalesceEmpty([Measures].[Internet Sales Amount],"x") Select Descendants([Date].[Calendar],[Date].[Calendar].[Calendar Year]) On 0,
CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Customer]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Product])) On 1 From [Adventure Works] Where([Customer].[State-Province].&[FL]&[US],Measures.InternetSalesX) These examples have been written without paying any attention to performance issues. They are simple examples of how to change or remove empty values from a MDX query.
February 08 Text Data Mining on Mosha's blogWith the permission of the blog owner I have collected various blog posts since late 2004 and have excluded posts about subjects outside of core MDX discussions. This have been a copy and paste work into a single large text file of the text and not the HtML source tags. I have marked the main text in each blog entry of interest and have simply done a copy and paste into a text file.
Text data mining in SQL Server 2005 or 2008 starts with using SSIS and the Term Extraction and Term Lookup transforms in the data flow. You start by creating a dictionary of terms that you think are relevant. Create a data flow task in the control flow and in the data flow you build something like in the picture below.
I have a flat file connection to my text file with all the blog entries. Import all text in a single column. One error that can occur with this task is truncation errors so in the next picture you can open the the advanced editor of the flat file source and set it to ignore truncation errors. I recommend you to set up a error flow but that is outside of the scope of this blog post.
The next task is the data conversion task since the text data mining algorithm only accepts unicode.
The term extraction task is used to create the dictionary of interesting MDX key words. The first time you run it you will get a long list of frequent words were 85 percent of the are rubbish. You set up the term extraction as in the following picture. I have selected Noun Phrase only because it returned the most interesting and clear results. Have a look at Books On Line for more information about this.
Now follows the long work of removing less interesting key words. If not you will see that [Internet Sales Amount] is the most frequent key word i´n Mosha's blog. You build a table of rejected words that will filter out them from the list of key words each time you run the term extraction task. You refer to this rejected key words list in the same task on the first tab. I have done selects on the results of the term extraction task and inserted the rejected words by a simple Insert Into TSQL statement.
When you have a dictionary of interesting key words you can create a new data flow task and set it up like the first with a connection to the same flat file. The only difference is that you use term lookup instead of term extraction task. All other task remain the same except for that you will need a second OLE DB destination. Here you keep the key words and their frequency in significant parts of the text file. The reference table is the dictionary that we have created in the first part(term extraction output).
The relation between the flat file output and the reference table is set up in the second tab of the term lookup task.
Finally, what were the results of this text data mining of Moshas blog?
First we have the most significant key words that requires a lot of work with the rejected words table.
This dictionary is the reference table that I have used in the terms lookup task.
The result from the terms lookup is the following.
It can be a little harder to se but to the left you have the keywords from the dictionary and to the right the text fragment where they have appeared. The frequency in the middle is the number of times the key word appears in the text fragment.
My model can be improved by adding dates for the blog entries and to run a data mining algorithm on top of these two tables that I have created, but that is the subject of another blog entry. February 01 The Performance Point turmoilIt is more than a week since the news broke out. PP Planning will be discontinued and PP M&A will go into the next version of MOSS Enterprise Edition.
I am lucky to have focused only on PP M&A and only a few hours on planning.
My first reaction was why MS would move a cheap product like PP M&A and ProClarity into the expensive MOSS. The next question was why MS would cancel PP Planning since the market for budget applications(even if PP Planning have more features than that) must be big since this i handled manually today even in large organizations?
Budgets and forecast are done by by every level of management, not only the top, and many times this is handled by sending out Excel spreadsheets with many revisions before a budget/forecast is approved. The market must be large enough even for MS.
What is going on? During the last week I have read blog post, articles and comments on the PP Planning forum to get a view about what has happened and the future.
This is only fragments of comments based on sources open to everyone with internet access.
I have always been told to find the simple explanation and to avoid the more complex ones. Maybe this change is only about revenue? I also have been thinking about another scenario. MS has a strong market share in Office applications, Windows on the desktop and Exchange as the communication server. If I am not wrong I have seen that Windows as a server platform is closed to 50 percent of the market. If MS continues to release service applications on top of Office ,new law suits can be a threat. Since MOSS is not dominant on its market that can be a safe habour for PP M&A. Since SQL Servers market share is the third in size it can be safer to sell and develop planning applications as a part of this platform instead. It is a new regime in the White House so maybe this was a preventive action. Regarding Office 14 I still guess that we will see it on the market next year. January 18 A simple MDX ratio problem"Ratio to parent" in MDX is a frequent discussion in the SSAS newsgroup and on the blogs. Mosha's and Darren Gosbell's thoughts are linked here and involves the MDX Axis function. This blog entry is about a less advanced problem in the "ratio to parent scenario". It is about how you can get a total percentage distribution of members from two crossjoined dimensions on rows in a MDX select statement. I think it can be help if your boss demands a quick solution and do not accept the default behaviour in a SSAS client.
The default behaviour can be seen if you run this MDX in Management Studio.
Select {[Measures].[Internet Sales Amount],[Measures].[Internet Ratio to All Products]} On Columns, NON EMPTY CrossJoin(Descendants([Customer].[Customer Geography],[Customer].[Customer Geography].[Country]), Descendants([Product].[Product Categories],[Product].[Product Categories].[Category])) On Rows FROM [Adventure Works] WHERE ([Date].[Calendar Year].&[2004]); The Internet Ratio to all products is defined like this in the Adventure Works cube.
[Measures].[Internet Sales Amount] The result will look like this: For each country you will get the percentage distribution of [Internet Sales Amount] in the product categories for each country. A client will accept the dimension you have crossjoined and calculate the percentages according to this. And now to the simple problem that might not have an obvious simple solution. If I would like to to see the percentage distribution for all the countries and the product categories as a ratio to total. What was the percentage contribution of [Internet Sales Amount] for Bikes in Australia to the total sales for all markets? This analysis is an issue for the SSAS clients that I know about. My solution is to build a calculated measure that references All Products and All Customers members like this. WITH Member Measures.AllInternetSales2004 As Member Measures.RatioInternetSales2004AllGeo As Select {[Measures].[Internet Sales Amount],[Measures].[Internet Ratio to All Products], The result will look like this. In order to solve this simple ratio to parent problem you will not need to use the MDX Axis()-function but build a calculated measure with the two top members for each dimension on rows. The All-member is useful in other scenarios as well and I will return to that later. The problem with my approach is that you will have to make this ratio for each measure that the ratio applies to. If this takes less or more time than learning about the Axis function is up to you to decide. Edit: The pictures are updated. To validate the percentages you copy the resultset in Management Studio with a right click in the upper left corner. Paste it in to Excel but you might have to change the decimal sign from (.) to (,). I have only one decimal so you will see a minor gap. Add decimals in the format_string property of the calculated measure. January 06 Book Review: MS SQL Server 2008 Analysis Services UnleashedThis is the second edition of the previous title "MS SQL Server 2005 Analysis Services" and includes some updated chapters on dimensions strucrures, since that was a total change since the AS 2000 cubes. There are still projects going on that migrates AS 2000 cubes to SSAS 2005 or 2008 without paying attention to the fact that dimensions are much different in the two latest versions. For you who will be facing a migration project I recommend you to buy this book and read chapter 4 to 8 because it will save you from a lot of problems.
Both the chapters about MDX and the ones about dimensions are vital for any professional SSAS cube developer and they do not assume any high technical skills in the interior of the engine. The second halph of the book requries a slightly higher technical level and, I assume, are mainly written for developers that writes client software or extensions to SSAS 2008.
The new chapters are a few due to the fact that the changes between SSAS 2005 and SSAS 2008 are limited and mainly about impovements in performance. Chapter 41 is about the new dynamic management views for monitoring SSAS 2008 resources. Chapter 28, Thread Management and 29, Architecture of Query Execution - Calculating MDX Expressions are also new. The main news regaring MDX in SSAS 2008, "Dynamic Named Sets", is covered on page 213 to 215 with some clear examples. The improvements in BIDS are covered in chapter 9.
What kind of BI-developers can find this book useful? The first chapters about dimensions, attribute relations, measure groups and MDX will be suitable for developers with some previous experience of SSAS 2005 and 2008. The second halph is a great help for developers that write code for SSAS 2008 applications and for developers interested in optimizations and advanced maintenance topis.
If you are still on SSAS 2005, like myself in my daily work, you can buy the SSAS 2008 edition because there are still many things that applies to SSAS 2005. December 18 Searching the web for information about the next MS SSAS browser for SSAS(2008?)When MS bought ProClarity in early 2006 I never expected that this analytical software would be sent into the deep freezer for 3-4 years. Obviously I have been spoilt with the ProClarity development team that added good innovative capabilities in one year or two years release cycles.
The only new innovative end user tool I have seen from MS is the data mining add ins for Excel 2007.
Performance Point Monitoring and Analytics 2007, added some capabilities, like balanced scorecards, to the PPS Dashboard Designer, but no other innovative visualizations tools were added in that release.
What will happen in the near future? Since I am a MVP since october this year I can only refer to information already published on the web. What will a search about Office 14, the next Office release, or Performance Point(next version) reveal. Not much but some information can point to a direction.
Since ProClarity(that have seased to exist) and Performance Point are parts of the next Office 14 release you can follow what directions this next Office release is taking.
If I do a summary from blogs that i trust they say that Office 14 will both be a full webb release and a Windows release.
According to Paul Thurrot the Live version of the next Office release is delayed. That is one of the web versions. The web version is the only real indicator of the next release of Office I have found this far.
Other blogs connects the next Office with the release of Windows 7, that is the next O/S after Vista.
The time table, at present, for the next version of Performance Point seems to be the end of 2009 or early 2010.
And in 2010 the next version of SQL Server is supposed to be released. December 12 MDX and response time in ProClarity and Excel 2007I have heared some talk, from other consultants here in Sweden, about the MDX generated by Excel 2007, when querying SSAS 2005 cubes, is slow in some scenarios. The worst performance killers, from what I have been told are arbitraty shape queries and calculated members.
In this first blog post about client performance I will compare the MDX generated by ProClarity 6.3 and Excel 2007 (sp1) in two other scenarios.
It will help with finding a baseline of each clients performance under one standard scenario and one that is more extreme.
The first is a simple query with the customers and products dimensions nested together on higher levels in the hierarchy in the Adventure Works cube.
The second scenario is the "query from hell" when customer and products are nested on leaf levels. I will take the MDX generated by the clients and paste it into MDX studio and run each query on a cold and warm cache. What we will miss is the time for each client to render the result sets graphically in grids. We will only measure the performance of the MDX generated in each client.
The first version of ProClarity was released at the same time as OLAP Services, the first SSAS version that shipped with SQL Server 7 in 1999. Excel 2007 was released one year after SSAS 2005, at the end of 2006.
For Excel 2007 I have used the Pivot table add in that you can download from Codeplex. The hardware is a one year old laptop with Windows XP sp2 and 4 GB RAM.
This is the first query in ProClarity.
This is the MDX generated by ProClarity.
SELECT { [Date].[Calendar].[All Periods].CHILDREN } ON COLUMNS ,
{ { { [Customer].[Customer Geography].[All Customers].CHILDREN } * { [Product].[Product Categories].[All Products].CHILDREN } } } ON ROWS
FROM [Adventure Works]
WHERE ( [Measures].[Internet Sales Amount] )
CELL PROPERTIES VALUE, FORMATTED_VALUE, CELL_ORDINAL On a cold cache in MDX I have got the following counters.
Time : 3 min 25 sec 334 ms
Calc covers : 182114 Cells calculated : 198643 Sonar subcubes : 3664 SE queries : 182110 Cache hits : 185750 Cache misses : 18 Cache inserts : 11 Cache lookups : 185768 Memory Usage KB : 65408 On a warm cache I have got these counters.
Time : 109 ms
Calc covers : 3 Cells calculated : 96 Sonar subcubes : 1 SE queries : 1 Cache hits : 1 Cache misses : 0 Cache inserts : 0 Cache lookups : 1 Memory Usage KB : 0 In Excel 2007 the same tablee looks like this.
The MDX generated by Excel 2007 looks like this.
SELECT NON EMPTY Hierarchize({DrilldownLevel({[Date].[Calendar].[All Periods]})}) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS ,
NON EMPTY CrossJoin(Hierarchize({DrilldownLevel({[Customer].[Customer Geography].[All Customers]})}), Hierarchize({DrilldownLevel({[Product].[Product Categories].[All Products]})})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM [Adventure Works] WHERE ([Measures].[Internet Sales Amount]) CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS In MDX studio I have got these counters for the Excel 2007 MDX on a cold cache.
Time : 3 min 26 sec 540 ms
Calc covers : 182159 Cells calculated : 198687 Sonar subcubes : 3675 SE queries : 182117 Cache hits : 185757 Cache misses : 18 Cache inserts : 11 Cache lookups : 185775 Memory Usage KB : 63936 And for the warm cache I have got these counters.
Time : 124 ms
Calc covers : 48 Cells calculated : 140 Sonar subcubes : 8 SE queries : 8 Cache hits : 8 Cache misses : 0 Cache inserts : 0 Cache lookups : 8 Memory Usage KB : 0 When we compare ProClarity and Excel 2007 we can hardly see any differences in performance. Some counters are a little bit higher in Excel 2007 but this can depend on that I have the grand totals active in Excel but not in ProClarity.
Now to the second scenario, combining the leaf levels in the customer and product dimensions.
The result look like this in ProClarity.
The MDX generated by ProClarity looks like this.
SELECT { [Date].[Calendar].[Calendar Year].&[2003] } ON COLUMNS ,
NON EMPTY { { { DESCENDANTS( [Customer].[Customer Geography].[All Customers], [Customer].[Customer Geography].[Customer] ) } *
{ DESCENDANTS( [Product].[Product Categories].[All Products], [Product].[Product Categories].[Product] ) } } } ON ROWS FROM [Adventure Works]
WHERE ( [Measures].[Internet Sales Amount] )
CELL PROPERTIES VALUE, FORMATTED_VALUE, CELL_ORDINAL The counters for the cold cache
Time : 3 min 28 sec 296 ms
Calc covers : 182114 Cells calculated : 222671 Sonar subcubes : 3665 SE queries : 182110 Cache hits : 185750 Cache misses : 18 Cache inserts : 11 Cache lookups : 185768 Memory Usage KB : 53696 And the counters for the warm cache.
Time : 1 sec 781 ms
Calc covers : 3 Cells calculated : 24124 Sonar subcubes : 1 SE queries : 1 Cache hits : 1 Cache misses : 0 Cache inserts : 0 Cache lookups : 1 Memory Usage KB : 3328 The Excel 2007 generated for "the query from hell" scenario is not possible to show here because it is a very large list. A fragment of that code is possible to show.
SELECT NON EMPTY Hierarchize({DrilldownLevel({[Date].[Calendar].[All Periods]})}) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS ,
NON EMPTY CrossJoin(Hierarchize(DrilldownMember({{DrilldownMember({{DrilldownMember({{DrilldownMember({{DrilldownLevel({[Customer].[Customer Geography].[All Customers]})}}, {[Customer].[Customer Geography].[Country].&[Australia]})}}, {[Customer].[Customer Geography].[State-Province].&[NSW]&[AU], ---
FROM (SELECT ({[Date].[Calendar].[Calendar Year].&[2003]}) ON COLUMNS FROM [Adventure Works] --This is the subcube statement at the end
Performance counters on a cold cache are these.
Time : 3 min 27 sec 375 ms
Calc covers : 182399 Cells calculated : 224175 Sonar subcubes : 3739 SE queries : 182158 Cache hits : 185842 Cache misses : 18 Cache inserts : 11 Cache lookups : 185860 Memory Usage KB : 62656 And on a warm cache they are these.
Time : 3 sec 93 ms
Calc covers : 288 Cells calculated : 25628 Sonar subcubes : 52 SE queries : 48 Cache hits : 92 Cache misses : 0 Cache inserts : 0 Cache lookups : 92 Memory Usage KB : 9216 On a warm cache the response time is three times higher for Excel 2007 than for ProClarity, but we are talking about a few seconds. On a cold cache responsetime are the same for both clients, like the number of cells calculated and SE queries.
Summary: For these two query scenarios the performance of the MDX generated by the two clients are almost identical, except for the difference for the cold cache scenario for the "query from hell". It i more or less what you should expect.
The next step is to test how the two clients handle calculated measures and calculated members.
December 03 Data mining - Key influencersThis is my first blog post that uses SSAS 2008. It is also valid for SSAS 2005 users.
It is about a data mining components for Excel 2007 and a simple algorithm, "Table Analysis Tools and "Analyze key Influencers", that I like because it is simple. You will only have to decide what column you would like to predict and the columns that you think might decide the value of that predicted column. It will only work with discrete or categorized columns and not continous data like sales or income.
Instead of using the known example of bike buyer patterns in the customer dimensions I would like to use this algorithm to predict what attributes that are most imprtant for explaining the number of cars each customer has bought.
This is the start table in the data mining samples that is part of the installation of the Excel 2007 data mining add ins for SSAS 2005 and SSAS 2008.
I select the "Table Analysis Tools and "Analyze Key Influencers"
The first step in that wizard is to decide on what column to explain.
And optionally you can unselect columns that are of no interest like a source table key (like ID in the next picture). You should also deselect columns with duplicate information like a avoid using both the key and the description columns(CustomerId and CustomerName). Select one of these columns.
And this is the result were I have filtered on the most important attributes below. The Naive Bayes algorithm that is used here uses the influencing attributes one by one. This means that you can only look at each number of cars and their influencers and not across different number of cars.
If the customers have zero cars this can be explained by education and the commute distance. One car is explained by another education degree and commute distance. More than one car is explained by education, the number of children, commute distance and income. |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|