Aug 17 2010

Handling Conditions on Outer Joins

Categories: Join Techniques,Universe Design Dave Rathbun @ 4:25 pm

I submit the answer is that nothing went wrong. 🙂 When I set up an outer join, I told my universe that some customers did not have orders, and that was okay. When my user set up their query, they said they wanted to restrict their report to show customers and order totals but only for orders that fell between a certain date range. Where does a customer with no orders go in this case? Or in other words, what can I tell from an empty amount on this report? It doesn’t tell me if the customer has zero orders, or if they have zero orders within the date range. Those are two possible cases. In order to avoid the ambiguity, the outer join is turned off and my new customer is missing.

Altering the Outer Join Behavior

In the scenario I have outlined so far, I don’t think I would change anything. In my opinion, the answer provided is the correct answer. But there could be cases where I want to allow conditions on an outer join without breaking the outer join behavior. Fortunately it is possible to do that. What I have to do is move the condition (query filter) to the FROM clause instead of applying it to the WHERE clause.

Here is the SQL code that generated the report shown above.

SELECT
  SSG.CUSTOMER.CUSTID,
  SSG.CUSTOMER.NAME,
  SSG.CUSTOMER.STATE,
  sum(SSG.ORD.TOTAL)
FROM
  SSG.ORD RIGHT OUTER JOIN SSG.CUSTOMER ON (SSG.CUSTOMER.CUSTID=SSG.ORD.CUSTID)
WHERE
  SSG.ORD.ORDERDATE  BETWEEN  '01-01-1987 00:00:00'  AND  '31-03-1987 00:00:00'
GROUP BY
  SSG.CUSTOMER.CUSTID, 
  SSG.CUSTOMER.NAME, 
  SSG.CUSTOMER.STATE

To fix this, I need to go back to my universe and fine tune the join settings for my outer join. I showed this same screen earlier but this time I am going to click the Advanced button so I can fine-tune my join behavior.

Join properties screen shot

Clicking that button gets me to this screen.

Advanced Outer Join Properties screen shot

Of the options shown, I can be selective or all-inclusive. If I want every single object from the impacted table to appear in the FROM clause for a condition, there is an easy setting to do that. In this example I am going to specify that only the Order Date object goes into the FROM clause.

The “Fix” In Action

After making these changes and exporting my universe, I can open my Web Intelligence report and see what happens. Here is what the report looks like.

Outer join modified results

And here is the SQL code.

SELECT
  SSG.CUSTOMER.CUSTID,
  SSG.CUSTOMER.NAME,
  SSG.CUSTOMER.STATE,
  sum(SSG.ORD.TOTAL)
FROM
  SSG.ORD RIGHT OUTER JOIN SSG.CUSTOMER ON (SSG.CUSTOMER.CUSTID=SSG.ORD.CUSTID  AND
  SSG.ORD.ORDERDATE  BETWEEN  '01-01-1987 00:00:00'  AND  '31-03-1987 00:00:00')
GROUP BY
  SSG.CUSTOMER.CUSTID, 
  SSG.CUSTOMER.NAME, 
  SSG.CUSTOMER.STATE

Is this better or worse?

Remember that earlier I talked about how an outer join condition can make a report ambiguous. I remember (from looking at my full data set earlier) that there is only one customer in my database that has no orders. Yet in the results shown above there are two customers showing with no orders. One of them has no orders. The other has no orders within the time frame specified in the query filter.

Before moving the order date condition into the FROM clause here are the steps my database was using to create my report dataset:

  1. Get a list of orders
  2. Join the customers to the orders; include all customers even if they do not have a matching order
  3. Apply the WHERE clause to restrict the results to those orders in the date range

Since the WHERE clause happens after the join has been done, it essentially invalidates the outer join process. Customers with no orders have a NULL order date, and NULL orders do not exist between any date range. Therefore the rows are dropped.

Here is the modified process with the order date moved into the from clause:

  1. Get a list of orders, limited to those orders within the date range (the date condition is in the FROM clause now)
  2. Get a list of customers, and join that list to the list of orders returned in the first step. Customers do not have to match up with an order

In the modified version, the order date filter is still used to reduce the list of order rows, but it happens before the join to the customer list. That means every customer still shows up. It’s all in the sequence of operations, and that sequencing is controlled by whether conditions appear in the FROM or WHERE clause.

Which version is correct?

Conclusion

Ultimately it is up to the business user community and the universe designer to figure out which answer is the right answer. I would submit that removing the outer join in this case is the best solution, rather than trying to tweak how it behaves. If I am asking for oder-related information, to me it stands to reason that I only want to see customers with orders. I would rather do that than try to explain all of the different possible answers that users could get with an outer join.

If I want to get a list of customers that have no orders, there are more effective ways to do that than use an outer join. That may be another blog post. 🙂 But if I need to include an outer join in my universe, at least it’s nice to know I have options as to how it behaves. In this example I only tweaked a single object but as I mentioned I can extend the behavior to an entire table by selecting a different option. The “Advanced” button has been available since version 6 (if my memory is correct) and certainly appears in all versions of XI.

Addendum

Like many posts, this one was inspired by a question posted on BOB. The poster tried to use the Advanced tab, and was told that the join was “too complex” because it was a two part key. In my simple example I only had one-part keys for my joins. The easy way to work around this is to create two simple joins between the tables instead of a single complex join. By doing that the Advanced button works as I described in this post. Just make sure both simple joins are in the same context(s) if the universe includes that feature.

33 Responses to “Handling Conditions on Outer Joins”

  1. Comment by Ton

    The date restriction is in the Where-clause which limits the output to customers with orders only. If the condition is moved to the outer join definition this query would produce customers without orders and customers that meet the condition on order date.

  2. Comment by Yoav

    Hi Dave,

    The row is missing because the customer doesn’t have an order date, therefore that row will not be found in the range you picked.
    To my opinion only solid RI (relationship integrity) is the real answer for getting rid of outer joins.
    Negative results when using Outer joins:
    1. Bad performance
    2. Writing Free hands (sometime the outer join is opposite to the one in the universe)
    3.Miss understanding of the users the business logic rather than using RI and giving “Unknown” or zero values for unmatched rows between the fact table to the dimension table .

  3. Comment by Dave Rathbun

    Hi, folks, make sure you read “page 2” of the post, because I do answer the question left at the bottom of page one. 🙂

  4. Comment by Bob

    Thanks for the tip Dave, that was a feature that I was unaware of. I think that this also illustrates the value of using properly designed star schemas for reporting. A lot of people are hesitant to snap their data off into a warehouse structure because they fear the complexity and time required to do it. In my experience, 90% of most reporting requirements can be dealt with using a very simple datamart that is loaded nightly. And the time spent building this is saved by the speed with which reports can be designed.

  5. Comment by Ton

    Thank’s Dave. I never explored the advanced tab (simply overlooked it for many years). That’s what you get when you think you know it all 🙂

  6. Comment by Al Gulland

    Hi, great article and always good to learn new tricks! My initial thought on this would to resolve the problem in the report: have two queries, one listing all customers and the other getting customers with orders in the data range. Then after merging on customer dimension, and setting the ‘extend merged dimensions’ option, we should be able to build table of required result.

  7. Comment by Dave Rathbun

    Hi, Al, thanks for your comment and welcome. Using two data providers is indeed one way to answer the question of “which customers have no orders” but it’s not the most efficient. If I want to see all customers and be able to identify those with no orders it works fine. But if I want only a list of customers with no orders, it’s going to download too much data and overwhelm me with data that exists when I am interested only in finding those customers where data does not exist.

    I’ll write up the techniques I have in mind in another blog post at some point.

  8. Comment by Chandra sekhar

    Hi,

    We can achieve the same using Derived table by pushing the filter condition (between) to the definiton of derived table and apply the outer join with customer table? isn’t it?

    We have overlooked this option for these years. Thanks throwing light on this option (reason being most cases we wont edit PARAMETER options (ANSI 92))

    Please share your thought on this view.

  9. Comment by Thani

    HI Dave..,
    I have encountered the same issue, i used to handle this by adding OR condition in Where Clause as below.This will get the result as expected.
    SSG.ORD.ORDERDATE BETWEEN ’01-01-1987 00:00:00′ AND ’31-03-1987 00:00:00′ or SSG.ORD.ORDERDATE is null

    Thanks,
    Thani

  10. Comment by Chuck

    Okay Dave….. I got a related issue that I had hoped would be solved with Nested Derived Tables and forcing the filter in the FROM clause. I had hoped you had an idea for me.

    I have 1a table that I alias twice to join to itself for VARIANCE reporting. There are 5 keys that may or may not be in both data sets. Thus I need to full outer join. However, I need to restrict the table BEFORE doing the full outer join (otherwise, the full outer join is negated.) I was hoping that I could expose a NESTED DERIVED table that the user can add FILTER CONDITIONS to and that “table” would be included in the derived table with the conditions nested, before I outer join. For the longest time I had belived Netsed Derived tables were going to work like this. They do not.

    Any ideas on getting user supplied Filters into the inline tables From cluase?

    thansk!

  11. Comment by Raj

    Dave, your blog provides so much information and very much happy with that. Thanks a ton.

    But i would like to request you to cover a topic in Designer for “Aggregate awareness for other purposes”, if possible.
    Some of the people think that it is just confined to store pre-calculated data to use sum() but i heard, not sure, it can be used for many other purposes.

    Hope you will take up this subject in your next post.

    Thanks
    Raj

  12. Comment by Dave Rathbun

    Hi, the @AggregateAware() has been on my list of topics to cover for quite some time. Unfortunately it never seems to bubble to the top.

  13. Comment by py

    This is good information. But I have a situation where there is a complex join (from two primary key columns) between the tables, and then a condition. Having the complex join, the Advanced tab is not allowing me to choose this drop down. Any ideas on how I can workaround this issue?

  14. Comment by Dave Rathbun

    Can you turn your complex join into two simple joins of one column each?

  15. Comment by py

    I tried doing that. But since all the other joins of the universe are in non-ANSI, and only this joins are ANSI type, my old reports starts to fail.

  16. Comment by Praveen Karnam

    Hi Dave, All the conditions move to ‘From’ instead of ‘ON’ when i use ‘OR’ in the filter conditions instead of ‘AND’. Is this a bug?

  17. Comment by Dave Rathbun

    Unfortunately I do not know. I have not tested it that thoroughly. Have you tried contacting SAP about this?

  18. Comment by franlin dsuza

    Hi Dave,
    You have mentioned earlier(refer comment Dave Rathbun August 20th, 2010 at 8:52 am ) that you would post as some point about techniques regarding getting customers with no orders, in cases where the customer table is huge and if one is interested in getting only customers with no orders. Could not locate the post.
    I have a similar situation and I have used a sub-query e.g select customer from cust_table where cust_no not in (select cust_no from orders). Would appreciate your views.
    Have a good day ahead.
    Thanks
    Franklin.

  19. Comment by Kuntal Bhowmick

    I have tables A and B. They are joined like this… A right outer join B on (A.PK_ID=B.PK_ID).Now I have a self restricting join on B as B.FK_ID=2.So when I use objects from A and B that generates a join like this…A right outer join B on (A.PK_ID=B.PK_ID and B.FK_ID=2).But because of the outer join,self join attributes are getting ignored by oracle.I think one solution can be creation of a derived table of B with self join.
    Please let me know your view on this and if any other solution is possible for this.

  20. Comment by Dave Rathbun

    In the “old days” we might use the (+) syntax on a self restricting (stub) join to include it as an outer restriction. Since I have been primarily working with Teradata lately I have not had a need to update this technique. If I get time, I will try some experiments. But I don’t have a solution for you right now.

  21. Comment by Dave Rathbun

    The fastest way is generally to use a relational operator like MINUS or EXCEPT, whichever your database offers. If this is required often I will built a pre-defined condition object in my universe that contains the logic required. Something like:

    cust.cust_id in (select cust.cust_id c MINUS select cust.cust_id from cust, orders where cust.cust_id = orders.cust_id)

    That will generate a list of customers that have no orders.

  22. Comment by Ramanathan S

    Hi,

    Hope you are fine 🙂

    Thanks for this useful post.

    Can you please clarify/ give suggestion for the following:

    There are only two tables in universe which are joined with one to one cardinality.

    One of the table has 33 million of records, so that even when we ran the report for one region we are not able to see the data in infoview. The report keeps on running and not able to see the data (report keeps on running more than 5 mins).

    Note: In report level there is no calculations it is a drag and drop report

    In that huge table no constraints defined but index defined for some of the columns.

    Any suggestions to improve the performance of the report, so that the user able to see the data in 1 or 2 mins.

    How to determine/set the array fetch size? Is it possible to increase it more than 500?

    Thanks,
    Ramanathan S

  23. Comment by Dave Rathbun

    Performance questions like that are best presented to the DBA team which should have access to table and index structures, database statistics, and so on. Array fetch size is specified as a part of the connection, and yes, it is possible to set it to more than 500, but it has very little impact on query execution speed. It only impacts how big of a “chunk” of data is retrieved from the database in each request.

  24. Comment by Ganesh

    1000 Likes for this post.

  25. Comment by Boppana

    One more way of doing this. Create a data provider with customer (in this case). Merge the new data provider with the main data provider on customer and use the Customer merged dimension in report. This way helps to show all the customer even if they don’t have data also applying any type of filter on the main data provider wont affect the result.

  26. Comment by Peter

    Comment by Ramanathan S September 11th, 2013 at 9:10 pm
    Hi,

    Hope you are fine 🙂

    Thanks for this useful post.

    Can you please clarify/ give suggestion for the following:

    There are only two tables in universe which are joined with one to one cardinality.

    One of the table has 33 million of records, so that even when we ran the report for one region we are not able to see the data in infoview. The report keeps on running and not able to see the data (report keeps on running more than 5 mins).

    Note: In report level there is no calculations it is a drag and drop report

    In that huge table no constraints defined but index defined for some of the columns.

    Any suggestions to improve the performance of the report, so that the user able to see the data in 1 or 2 mins.

    How to determine/set the array fetch size? Is it possible to increase it more than 500?

    Thanks,
    Ramanathan S

    I hope you found an answer already…..
    but one first thing, is there a index on the filter for region?
    second..is there a function on the region filter, like trim, substr, etc?

  27. Comment by Dave Rathbun

    Hi, typically I would talk to the DBA team about this. If you’re trying to do a full table scan of 33 million rows, it’s going to take a while. Do you have any conditions at all? Are the columns referenced in those conditions indexed?

  28. Comment by Allon

    Hi Dave,

    Thanks for a well written and helpful article. I have a related question. Using your example of the CUSTOMER, ORD and ITEM tables: Lets say we made, as you have done, the CUSTOMER to ORD join an outer join but left the ORD to ITEM join as an inner join. Correct me if I’m wrong but I would think any query that involved all 3 tables would only return customers that had orders. Even though the CUSTOMER to ORD join is an outer join, the inner join from ORD to ITEM would mean that the query across all 3 tables would not return customers without orders.

    If I am right, and this is not what we desire, how would we avoid this? Would we need to make the ORD to ITEM join an outer join even though the business rule is that every ORD will have at least one item?

    Thanks,
    Allon

  29. Comment by Allon

    Hi,

    I have answered my own question. From what I can see, BO is clever enough that when a query has a combination of inner and outer joins, it builds the query in such a way that the outer joins will not fail as a result of other failed inner joins. It seems to do this by always putting all the inner joins before the outer joins. That way, there is no need for sub-queries.

    I haven’t found documentation explaining this but have found it by testing to see what BO does. I would be interested to know if anyone can point me to a more formal explanation of this.

    Thanks,
    Allon

  30. Comment by Ananth

    Hi Dave,

    Nice explanation..!

    I’m Using IDT and how do I implement the same in IDT ? I dont see similar options there.

  31. Comment by wenlei

    HI-, Dave,

    thanks a lot for your post on outer join. I face the similar problem and I do appreciate your insight in this regard.
    You said you prefer not using left outer join and there are some better way to show null record. I would very much to know the alternative.
    I try to find your blog on that but not able to. Would you give the link please?
    thanks a lot

  32. Comment by Dave Rathbun

    One way to find missing records is to do a “Minus” or “Except” query. That process can be far more effective than an outer join.

  33. Comment by wenlei

    Hi-, Dave,
    thank you very much for your reply.
    if we are using the minus set function to find out the missing records, that means we create a separate universe (or free hand query in webi), then create a query in webi for missing record, finally combine the result with the query from universe using inner join.
    is this the way you do it?
    again thank you very much.

    Wenlei