Aug 17 2010

Handling Conditions on Outer Joins

Categories: Join Techniques,Universe Design Dave Rathbun @ 4:25 pm

I don’t like outer joins in my reporting universes. Never have. Sure, if I am creating a universe against an application system I might consider using outer joins because of the normalized nature of the data. But if I am reporting against a warehouse schema of some kind, I really prefer to use inner joins. That way I avoid any potential performance issues caused by outer joins, but more importantly I avoid questions about report data. That being said, outer joins do have a specific purpose, and if I need to use them in my universe I certainly can.

One of the biggest challenges with outer joins (other than potential performance issues) is explaining to a user why their query results changed because they added a condition to their query. Remember that users don’t (typically) look at the SQL, so they won’t know that I have created an outer join. It can be confusing. Fortunately I have options as to how my outer joins are executed, so once I determine their usage requirements I can change the way my universe behaves.

Defining the Problem

For this post I will am going to use a very simple universe with only three tables, shown here.

Summit Sporting Goods Universe screen shot

This universe joins a customer to an order, and an order to order lines. In my database I have one customer that does not yet have any orders. If I run a query against the current universe structure, this new customer will not show up. My requirement is to show all customers, whether they have orders or not. This must be true even if I put a condition on the order table. That’s where it gets tricky. 🙂

ANSI_92 Parameter Setting

In order to make this work, I have to make sure that my universe parameter setting for ANSI_92 is set properly. From the File menu I will select Parameters, and on the tabbed dialog box that appears I will click on the Parameters tab. I need to set the ANSI_92 parameter to “Yes” in order for this technique to work. Without this setting, my joins will be created using Oracle-specific syntax and that isn’t what I want here.

Universe parameters screen shot

Once I have verified this setting I can close the parameter screen. Next I need to define my outer join.

Defining an Outer Join

The outer join definition is done the same way with all databases, but it’s not always done in the same direction. In some databases I have to check the primary table and in others I need to check the secondary or optional table. Rather than trying to remember which database does what, I will pick a side and check to make sure that the tiny “o” appears on the expected side of my join. For example, I want to reconfigure my existing inner join between customers and orders and make it an outer join as shown here.

Outer join checkbox screen shot

The structure window looks a bit different now.

Outer join screen shot

As mentioned earlier, the tiny “o” on the right side of the join denotes the optional side. If the “o” is on the wrong side, I will open the join properties window and select the other checkbox.

That’s how I create an outer join in my universe. Let me show how it works first, then I will show how I can break it.

Outer Join Queries

I have exported my outer join universe and created a very simple query. I have included the customer ID, name, and their order total. There is one customer (appropriately named “New Customer”) that has no orders and their order amount shows up blank because of the outer join setting.

Query results from Web Intelligence using an outer join

So far everything is working as I expected.

Breaking an Outer Join

An outer join is designed to allow rows that don’t completely match to still show up on my report. What happens if I apply a condition to the outer relationship? For example, I might want to add a condition to limit orders to a particular date range.

Query panel screen shot

Here are the results after running this query.

Query results from conditions

What went wrong? (Answered on next page…)

33 Responses to “Handling Conditions on Outer Joins”

  1. Comment by Ton

    The date restriction is in the Where-clause which limits the output to customers with orders only. If the condition is moved to the outer join definition this query would produce customers without orders and customers that meet the condition on order date.

  2. Comment by Yoav

    Hi Dave,

    The row is missing because the customer doesn’t have an order date, therefore that row will not be found in the range you picked.
    To my opinion only solid RI (relationship integrity) is the real answer for getting rid of outer joins.
    Negative results when using Outer joins:
    1. Bad performance
    2. Writing Free hands (sometime the outer join is opposite to the one in the universe)
    3.Miss understanding of the users the business logic rather than using RI and giving “Unknown” or zero values for unmatched rows between the fact table to the dimension table .

  3. Comment by Dave Rathbun

    Hi, folks, make sure you read “page 2” of the post, because I do answer the question left at the bottom of page one. 🙂

  4. Comment by Bob

    Thanks for the tip Dave, that was a feature that I was unaware of. I think that this also illustrates the value of using properly designed star schemas for reporting. A lot of people are hesitant to snap their data off into a warehouse structure because they fear the complexity and time required to do it. In my experience, 90% of most reporting requirements can be dealt with using a very simple datamart that is loaded nightly. And the time spent building this is saved by the speed with which reports can be designed.

  5. Comment by Ton

    Thank’s Dave. I never explored the advanced tab (simply overlooked it for many years). That’s what you get when you think you know it all 🙂

  6. Comment by Al Gulland

    Hi, great article and always good to learn new tricks! My initial thought on this would to resolve the problem in the report: have two queries, one listing all customers and the other getting customers with orders in the data range. Then after merging on customer dimension, and setting the ‘extend merged dimensions’ option, we should be able to build table of required result.

  7. Comment by Dave Rathbun

    Hi, Al, thanks for your comment and welcome. Using two data providers is indeed one way to answer the question of “which customers have no orders” but it’s not the most efficient. If I want to see all customers and be able to identify those with no orders it works fine. But if I want only a list of customers with no orders, it’s going to download too much data and overwhelm me with data that exists when I am interested only in finding those customers where data does not exist.

    I’ll write up the techniques I have in mind in another blog post at some point.

  8. Comment by Chandra sekhar

    Hi,

    We can achieve the same using Derived table by pushing the filter condition (between) to the definiton of derived table and apply the outer join with customer table? isn’t it?

    We have overlooked this option for these years. Thanks throwing light on this option (reason being most cases we wont edit PARAMETER options (ANSI 92))

    Please share your thought on this view.

  9. Comment by Thani

    HI Dave..,
    I have encountered the same issue, i used to handle this by adding OR condition in Where Clause as below.This will get the result as expected.
    SSG.ORD.ORDERDATE BETWEEN ’01-01-1987 00:00:00′ AND ’31-03-1987 00:00:00′ or SSG.ORD.ORDERDATE is null

    Thanks,
    Thani

  10. Comment by Chuck

    Okay Dave….. I got a related issue that I had hoped would be solved with Nested Derived Tables and forcing the filter in the FROM clause. I had hoped you had an idea for me.

    I have 1a table that I alias twice to join to itself for VARIANCE reporting. There are 5 keys that may or may not be in both data sets. Thus I need to full outer join. However, I need to restrict the table BEFORE doing the full outer join (otherwise, the full outer join is negated.) I was hoping that I could expose a NESTED DERIVED table that the user can add FILTER CONDITIONS to and that “table” would be included in the derived table with the conditions nested, before I outer join. For the longest time I had belived Netsed Derived tables were going to work like this. They do not.

    Any ideas on getting user supplied Filters into the inline tables From cluase?

    thansk!

  11. Comment by Raj

    Dave, your blog provides so much information and very much happy with that. Thanks a ton.

    But i would like to request you to cover a topic in Designer for “Aggregate awareness for other purposes”, if possible.
    Some of the people think that it is just confined to store pre-calculated data to use sum() but i heard, not sure, it can be used for many other purposes.

    Hope you will take up this subject in your next post.

    Thanks
    Raj

  12. Comment by Dave Rathbun

    Hi, the @AggregateAware() has been on my list of topics to cover for quite some time. Unfortunately it never seems to bubble to the top.

  13. Comment by py

    This is good information. But I have a situation where there is a complex join (from two primary key columns) between the tables, and then a condition. Having the complex join, the Advanced tab is not allowing me to choose this drop down. Any ideas on how I can workaround this issue?

  14. Comment by Dave Rathbun

    Can you turn your complex join into two simple joins of one column each?

  15. Comment by py

    I tried doing that. But since all the other joins of the universe are in non-ANSI, and only this joins are ANSI type, my old reports starts to fail.

  16. Comment by Praveen Karnam

    Hi Dave, All the conditions move to ‘From’ instead of ‘ON’ when i use ‘OR’ in the filter conditions instead of ‘AND’. Is this a bug?

  17. Comment by Dave Rathbun

    Unfortunately I do not know. I have not tested it that thoroughly. Have you tried contacting SAP about this?

  18. Comment by franlin dsuza

    Hi Dave,
    You have mentioned earlier(refer comment Dave Rathbun August 20th, 2010 at 8:52 am ) that you would post as some point about techniques regarding getting customers with no orders, in cases where the customer table is huge and if one is interested in getting only customers with no orders. Could not locate the post.
    I have a similar situation and I have used a sub-query e.g select customer from cust_table where cust_no not in (select cust_no from orders). Would appreciate your views.
    Have a good day ahead.
    Thanks
    Franklin.

  19. Comment by Kuntal Bhowmick

    I have tables A and B. They are joined like this… A right outer join B on (A.PK_ID=B.PK_ID).Now I have a self restricting join on B as B.FK_ID=2.So when I use objects from A and B that generates a join like this…A right outer join B on (A.PK_ID=B.PK_ID and B.FK_ID=2).But because of the outer join,self join attributes are getting ignored by oracle.I think one solution can be creation of a derived table of B with self join.
    Please let me know your view on this and if any other solution is possible for this.

  20. Comment by Dave Rathbun

    In the “old days” we might use the (+) syntax on a self restricting (stub) join to include it as an outer restriction. Since I have been primarily working with Teradata lately I have not had a need to update this technique. If I get time, I will try some experiments. But I don’t have a solution for you right now.

  21. Comment by Dave Rathbun

    The fastest way is generally to use a relational operator like MINUS or EXCEPT, whichever your database offers. If this is required often I will built a pre-defined condition object in my universe that contains the logic required. Something like:

    cust.cust_id in (select cust.cust_id c MINUS select cust.cust_id from cust, orders where cust.cust_id = orders.cust_id)

    That will generate a list of customers that have no orders.

  22. Comment by Ramanathan S

    Hi,

    Hope you are fine 🙂

    Thanks for this useful post.

    Can you please clarify/ give suggestion for the following:

    There are only two tables in universe which are joined with one to one cardinality.

    One of the table has 33 million of records, so that even when we ran the report for one region we are not able to see the data in infoview. The report keeps on running and not able to see the data (report keeps on running more than 5 mins).

    Note: In report level there is no calculations it is a drag and drop report

    In that huge table no constraints defined but index defined for some of the columns.

    Any suggestions to improve the performance of the report, so that the user able to see the data in 1 or 2 mins.

    How to determine/set the array fetch size? Is it possible to increase it more than 500?

    Thanks,
    Ramanathan S

  23. Comment by Dave Rathbun

    Performance questions like that are best presented to the DBA team which should have access to table and index structures, database statistics, and so on. Array fetch size is specified as a part of the connection, and yes, it is possible to set it to more than 500, but it has very little impact on query execution speed. It only impacts how big of a “chunk” of data is retrieved from the database in each request.

  24. Comment by Ganesh

    1000 Likes for this post.

  25. Comment by Boppana

    One more way of doing this. Create a data provider with customer (in this case). Merge the new data provider with the main data provider on customer and use the Customer merged dimension in report. This way helps to show all the customer even if they don’t have data also applying any type of filter on the main data provider wont affect the result.

  26. Comment by Peter

    Comment by Ramanathan S September 11th, 2013 at 9:10 pm
    Hi,

    Hope you are fine 🙂

    Thanks for this useful post.

    Can you please clarify/ give suggestion for the following:

    There are only two tables in universe which are joined with one to one cardinality.

    One of the table has 33 million of records, so that even when we ran the report for one region we are not able to see the data in infoview. The report keeps on running and not able to see the data (report keeps on running more than 5 mins).

    Note: In report level there is no calculations it is a drag and drop report

    In that huge table no constraints defined but index defined for some of the columns.

    Any suggestions to improve the performance of the report, so that the user able to see the data in 1 or 2 mins.

    How to determine/set the array fetch size? Is it possible to increase it more than 500?

    Thanks,
    Ramanathan S

    I hope you found an answer already…..
    but one first thing, is there a index on the filter for region?
    second..is there a function on the region filter, like trim, substr, etc?

  27. Comment by Dave Rathbun

    Hi, typically I would talk to the DBA team about this. If you’re trying to do a full table scan of 33 million rows, it’s going to take a while. Do you have any conditions at all? Are the columns referenced in those conditions indexed?

  28. Comment by Allon

    Hi Dave,

    Thanks for a well written and helpful article. I have a related question. Using your example of the CUSTOMER, ORD and ITEM tables: Lets say we made, as you have done, the CUSTOMER to ORD join an outer join but left the ORD to ITEM join as an inner join. Correct me if I’m wrong but I would think any query that involved all 3 tables would only return customers that had orders. Even though the CUSTOMER to ORD join is an outer join, the inner join from ORD to ITEM would mean that the query across all 3 tables would not return customers without orders.

    If I am right, and this is not what we desire, how would we avoid this? Would we need to make the ORD to ITEM join an outer join even though the business rule is that every ORD will have at least one item?

    Thanks,
    Allon

  29. Comment by Allon

    Hi,

    I have answered my own question. From what I can see, BO is clever enough that when a query has a combination of inner and outer joins, it builds the query in such a way that the outer joins will not fail as a result of other failed inner joins. It seems to do this by always putting all the inner joins before the outer joins. That way, there is no need for sub-queries.

    I haven’t found documentation explaining this but have found it by testing to see what BO does. I would be interested to know if anyone can point me to a more formal explanation of this.

    Thanks,
    Allon

  30. Comment by Ananth

    Hi Dave,

    Nice explanation..!

    I’m Using IDT and how do I implement the same in IDT ? I dont see similar options there.

  31. Comment by wenlei

    HI-, Dave,

    thanks a lot for your post on outer join. I face the similar problem and I do appreciate your insight in this regard.
    You said you prefer not using left outer join and there are some better way to show null record. I would very much to know the alternative.
    I try to find your blog on that but not able to. Would you give the link please?
    thanks a lot

  32. Comment by Dave Rathbun

    One way to find missing records is to do a “Minus” or “Except” query. That process can be far more effective than an outer join.

  33. Comment by wenlei

    Hi-, Dave,
    thank you very much for your reply.
    if we are using the minus set function to find out the missing records, that means we create a separate universe (or free hand query in webi), then create a query in webi for missing record, finally combine the result with the query from universe using inner join.
    is this the way you do it?
    again thank you very much.

    Wenlei