Jul 14 2009

What does JOIN_BY_SQL do?

Categories: Universe Design Dave Rathbun @ 8:05 am

We got a new feature / parameter in Designer a while back called JOIN_BY_SQL. This feature allows a universe designer to turn on an option that changes how multiple SQL statements are processed. This is related to the multi-path SQL generation provided by Business Objects, which frankly I think is one of the most powerful features of the product. What is multi-path SQL? What does JOIN_BY_SQL do to change that? Read on for details. 🙂

Multi-Path SQL Generation

There are several types of “traps” that can occur in a relational database. These traps are recognized by Business Objects and even have special names like Fan and Chasm traps. In a nutshell, you cannot use certain tables in certain types of relationships in a single SQL statement and get the correct answer. What Business Objects does is determine how to split the SQL into multiple queries (paths), run each path separately (and sequentially, not in parallel), and then combine the results back into a single cube in the report. This is done in Desktop Intelligence as well as Web Intelligence. This can be used to solve chasm or fan traps, or to automatically generate separate queries for contractory conditions like “current year” and “prior year” results.

This is all great, but what happens if your individual queries bring back tens of thousands of rows? I’ve seen reports with Current Year and Prior Year queries that are synchronized on the report… and each query had about 20,000 rows of data. The results were not pretty. 🙂 The effort made by the client to keep those rows in sync can be substantial, so there is definitely overhead involved. In most cases the dimensional structure of each query is the same and it’s only measures that are different.

JOIN_BY_SQL Description

With the advent of the JOIN_BY_SQL option this can change, at the designer’s option. You can ask the database to do the synchronization of the two (or more) parts of the query paths. This is done by creating a full outer join of two (or more) SQL scripts on the database server rather than on the reporting client. So instead of this:

SELECT dimension, sum(measure1) 
FROM table1, table2 
WHERE table1.id = table2.id
GROUP BY dimension

and this:

SELECT dimension, sum(measure2) 
FROM table1, table3 
WHERE table1.id = table3.id
GROUP BY dimension

and finally a join (or synchronize) option in the client, you get this instead:

SELECT
  COALESCE( F__1.Axis__1,F__2.Axis__1  ),
  F__1.M__1281,
  F__2.M__1284
FROM
     ( 
     SELECT
       dimension1 AS Axis__1,
       sum(measure1) AS M__1281
     FROM
       table1,
       table2
     WHERE
       ( table1.id = table2.id  )
     GROUP BY 1
       )
      F__1
      FULL OUTER JOIN 
     ( 
     SELECT
       dimension1 AS Axis__1,
       sum(measure3) AS M__1284
     FROM
       table1,
       table3
     WHERE
       ( table1.id = table3.id  )
     GROUP BY 1
       )
      F__2
      ON ( F__1.Axis__1=F__2.Axis__1  )

What this does, in a nutshell, is run the two queries separately from each other and then combine the result rows using a full outer join process. A full outer join is required to allow rows to be missing from either of the two selects. This then returns a combined result set with values from dimension 1 combined with total values from measure 2 and measure 3, and saves the report client from having to manage the data merge / sort / display process.

JOIN_BY_SQL Bug Reports

In earlier versions of XIR2 there was a bug in this feature that would mix up dimension and measure column results. It was fairly easy to see this in action as there would be measure values in the dimension columns on a report and vice versa. There was an additional bug that would prevent the SQL from being executed if there were more than 9 passes, as the alias creation algorithm failed to generate a unique and correct SQL syntax at that point.

Query versus Report Performance Considerations

If the individual data providers return a lot of rows, it seems that setting this parameter in your universe can result in a substantial performance improvement. But as I wrote here for smaller data sets this feature can actually result in decreased performance. As a result, I can’t make a definite statement that performance will be better if you activate this parameter in a universe.

Tool Restrictions

This feature only works in Crystal and in Web Intelligence. If you turn this feature on and generate reports in Desktop Intelligence you will not see the alternate query path as shown above. (This is yet another indication to me that Desktop Intelligence is probably on its way out.) Web Intelligence will fall back to the standard query generation if you have a synchronization operation instead of a join. Crystal requires this feature as it will not generate / synchronize multiple query paths.

Conclusion

As with many things, your mileage may vary with this feature. It does allow Crystal to handle multi-pass SQL and it can result in a performance boost when large amounts of data are being merged by the client (or Web Intelligence server). But there are also cases where it may hinder performance as well. This post was intended to serve as a review of how the feature works. Whether you use it or not is ultimately best determined with appropriate performance testing in your specific environment.

21 Responses to “What does JOIN_BY_SQL do?”

  1. Comment by Josh Fletcher

    Good post Dave, I like this feature though haven’t had to use it too much yet. I’ve been compiling a list of universe best practices from my experience and the BusinessObjects training material that I’ll post up shortly.

    – Josh

  2. Comment by Ethan

    Thanks for this info,
    Can you direct me toward where I’d set this option and see it in action?

  3. Comment by Dave Rathbun

    Hi, Ethan, and welcome. Always nice to have another XKCD fan on board. 😉

    To see JOIN_BY_SQL in action, you need to add it to your universe parameters and then set the value to Yes. I covered how to do this in an earlier presentation and neglected to repeat it here. The presentation has screen shots that help make it all clear.

    If you go to my presentations page and download “Tales From A Universe Ninja, Part One” and review pages 45 through 51 it includes more details as well as pictures.

    Pictures help. 😎

  4. Comment by Udi

    Hi, and thanks for your posts, Dave.
    A question I have – the combination between the queries is with FULL OUTER JOIN.
    Can that behavior be changed to combine the queries with INNER JOIN ?
    This cause us lots of troubles…

    Thanks !

  5. Comment by Dave Rathbun

    The standard is to do a FULL OUTER JOIN in order to replicate what Web Intelligence does. In all the times I have used this, I have wanted a full outer join. 🙂 I can’t think of any reasons off the top of my head to want to change this.

  6. Comment by Ganesh

    Hi Dave,

    Thanks for this useful information. It’s really very useful. But just want to know if there is any way to use LEFT OUTER JOIN or RIGHT OUTER JOIN instead of FULL OUTERJOIN in this case. We have some performance issues with FULL OUTER JOIN. If I modify the same query to use LEFT OUTERJOIN it just takes 2 minutes whereas the FULL OUTERJOIN query takes around 20 min…

    Please share your inputs or any alternatives available.

    Thanks!

  7. Comment by Dave Rathbun

    Ganesh, thanks for your comment. As I am sure you realize, if you do a left join you’re not getting the same results. In my opinion, full outer join is the proper syntax to use for this feature. It is designed to replace the data provider synchronization; if you need some other result then you probably need to investigate views or derived tables or some changes to your ETL.

  8. Comment by Shiva

    Hi Dave,

    After reading your blog i know that JOIN_BY_SQL is different from Synchronize…..in term of performance(for large amount of data) as it occurs at database level.

    But what are the other benefits from it?
    Could you plase give another small example on what kind of problems it can resolve(Consider a real time DW scenario)? What is the best substitute for this if this is not giving the desired result?

    Thanks a ton.

    Regards,
    Shiva

  9. Comment by Andreas

    @Siva:
    It will create just ONE microbe at the report level, thereby increasing performance.
    And would do you mean exactly by: “not giving the right result”, please?

  10. Comment by Dave Rathbun

    JOIN_BY_SQL can have a performance impact but the primary function is to provide multi-pass SQL capabilities to tools (like Crystal) that don’t otherwise support it.

  11. Comment by Shiva

    Thankyou Andreas.

    Thankyou Dave.

  12. Comment by Rakesh_K

    Hi Dave,

    Can you put some light on disadvantage of this parameter. Why by default it’s not been set by BusinessObjects itself.

  13. Comment by Dave Rathbun

    I can only speculate as I don’t know for sure.

    Typically when Business Objects introduces a new parameter they leave it off unless it’s adressing a specific bug. The reason for that would be they have no idea how it’s going to impact all of the client universes on multiple different platforms. It is also possible for this parameter to have a negative impact on performance.

    The primary advantage of this feature was posted in comment #10 earlier.

  14. Comment by gobinath

    Hi,
    I have single report with multiple data providers(same database(3 sql)),I want to know if i schedule this report to user it will run parallel or serial.Shall i see the execution,If its is run serially is there any possibility to make it as parallel.

    My report take more than 3 hours to run how can i reduce the time

  15. Comment by Dave Rathbun

    Whether you use JOIN_BY_SQL or not, the data providers will still run serial.

  16. Comment by Neeta

    Hi Dave,

    Thanks for the detailed information.
    I have an observation. When I create a similar query as above using Query Panel from within the Designer I get “Incompatible Combination of Objects” error whereas the query runs successfully whether JOIN_BY_SQL is Yes or No from Webi. I checked the SQL from Query Panel in designer and I can see two select statements which is case in Webi also if JOIN_BY_SQL = No but webi generates no error of incompatible objects. Why so?

    Thanks,

  17. Comment by Dopple

    When I try to use combined queries with a query where join_by_sql is used I get an error “An internal error occured while calling ‘processDPCommands’ API. (Error: ERR_WIS_30270)”. Happens on all of our environments (Dev, Test, Production) and happens on all universe I tested which us JOIN_BY_SQL=’Yes’.

    Has anyone else experienced this?

  18. Comment by Phil Morris

    Hi Dave,

    Using this seem to clobber any Query Sorts. The order by simply doesn’t appear in the SQL – in V4.0 SP2 anyway! This isn’t usually a problem in WebI land, but I have a dashboard that depends on ordering, but has the exact problem that JOIN_BY_SQL solves. Just can’t win!
    Have you come across this issue, and possibly found a solution?

  19. Comment by Dave Rathbun

    Phil, I have run across a “hack” years back that allowed a sort to appear even when the requesting tool (query as a web service in this case) did not support it. It involves using the END_SQL parameter in the universe to append an Order By clause to every SQL generated by that universe. It’s extra overhead for non-dashboard queries, and in the case of Web Intelligence the order by is pointless because the report engine will re-sort the rows before displaying them anyway. But if you absolutely need an order by clause, that’s one option that I have seen used.

    You have to make sure that your desired sort column is the first result object in your query definition of course.

  20. Comment by Rene Brito

    Dave,
    What is the difference between MultipleFlows and Full Outer Join when generating multiple SQL statements on a Webi report? Are there any pros and cons?

  21. Comment by Neeraj Singla

    Hi Dave,
    is it possible to create flexible joins?
    We have one table named as busy hour table where we have busy hour data on different granularity like cell, wbts, rnc.
    we have another table named as PM table where we have hour raw data for whole network.
    Now, we want flexible joins as per user want.
    If user want cell level data then accordingly 3 level of joins along with date-time join should implement.
    if user select wbts level data then accordingly 2 level of joins along with date-time join must come in picture.
    and if user select rnc level data then only rnc and date-time joins should be used in the SQL.
    So basically can we manage joins efficiently?