Jul 14 2009
What does JOIN_BY_SQL do?
We got a new feature / parameter in Designer a while back called JOIN_BY_SQL. This feature allows a universe designer to turn on an option that changes how multiple SQL statements are processed. This is related to the multi-path SQL generation provided by Business Objects, which frankly I think is one of the most powerful features of the product. What is multi-path SQL? What does JOIN_BY_SQL do to change that? Read on for details.
Multi-Path SQL Generation
There are several types of “traps” that can occur in a relational database. These traps are recognized by Business Objects and even have special names like Fan and Chasm traps. In a nutshell, you cannot use certain tables in certain types of relationships in a single SQL statement and get the correct answer. What Business Objects does is determine how to split the SQL into multiple queries (paths), run each path separately (and sequentially, not in parallel), and then combine the results back into a single cube in the report. This is done in Desktop Intelligence as well as Web Intelligence. This can be used to solve chasm or fan traps, or to automatically generate separate queries for contractory conditions like “current year” and “prior year” results.
This is all great, but what happens if your individual queries bring back tens of thousands of rows? I’ve seen reports with Current Year and Prior Year queries that are synchronized on the report… and each query had about 20,000 rows of data. The results were not pretty.
The effort made by the client to keep those rows in sync can be substantial, so there is definitely overhead involved. In most cases the dimensional structure of each query is the same and it’s only measures that are different.
JOIN_BY_SQL Description
With the advent of the JOIN_BY_SQL option this can change, at the designer’s option. You can ask the database to do the synchronization of the two (or more) parts of the query paths. This is done by creating a full outer join of two (or more) SQL scripts on the database server rather than on the reporting client. So instead of this:
SELECT dimension, sum(measure1) FROM table1, table2 WHERE table1.id = table2.id GROUP BY dimension
and this:
SELECT dimension, sum(measure2) FROM table1, table3 WHERE table1.id = table3.id GROUP BY dimension
and finally a join (or synchronize) option in the client, you get this instead:
SELECT
COALESCE( F__1.Axis__1,F__2.Axis__1 ),
F__1.M__1281,
F__2.M__1284
FROM
(
SELECT
dimension1 AS Axis__1,
sum(measure1) AS M__1281
FROM
table1,
table2
WHERE
( table1.id = table2.id )
GROUP BY 1
)
F__1
FULL OUTER JOIN
(
SELECT
dimension1 AS Axis__1,
sum(measure3) AS M__1284
FROM
table1,
table3
WHERE
( table1.id = table3.id )
GROUP BY 1
)
F__2
ON ( F__1.Axis__1=F__2.Axis__1 )
What this does, in a nutshell, is run the two queries separately from each other and then combine the result rows using a full outer join process. A full outer join is required to allow rows to be missing from either of the two selects. This then returns a combined result set with values from dimension 1 combined with total values from measure 2 and measure 3, and saves the report client from having to manage the data merge / sort / display process.
JOIN_BY_SQL Bug Reports
In earlier versions of XIR2 there was a bug in this feature that would mix up dimension and measure column results. It was fairly easy to see this in action as there would be measure values in the dimension columns on a report and vice versa. There was an additional bug that would prevent the SQL from being executed if there were more than 9 passes, as the alias creation algorithm failed to generate a unique and correct SQL syntax at that point.
Query versus Report Performance Considerations
If the individual data providers return a lot of rows, it seems that setting this parameter in your universe can result in a substantial performance improvement. But as I wrote here for smaller data sets this feature can actually result in decreased performance. As a result, I can’t make a definite statement that performance will be better if you activate this parameter in a universe.
Tool Restrictions
This feature only works in Crystal and in Web Intelligence. If you turn this feature on and generate reports in Desktop Intelligence you will not see the alternate query path as shown above. (This is yet another indication to me that Desktop Intelligence is probably on its way out.) Web Intelligence will fall back to the standard query generation if you have a synchronization operation instead of a join. Crystal requires this feature as it will not generate / synchronize multiple query paths.
Conclusion
As with many things, your mileage may vary with this feature. It does allow Crystal to handle multi-pass SQL and it can result in a performance boost when large amounts of data are being merged by the client (or Web Intelligence server). But there are also cases where it may hinder performance as well. This post was intended to serve as a review of how the feature works. Whether you use it or not is ultimately best determined with appropriate performance testing in your specific environment.
Good post Dave, I like this feature though haven’t had to use it too much yet. I’ve been compiling a list of universe best practices from my experience and the BusinessObjects training material that I’ll post up shortly.
- Josh
Thanks for this info,
Can you direct me toward where I’d set this option and see it in action?
Hi, Ethan, and welcome. Always nice to have another XKCD fan on board.
To see JOIN_BY_SQL in action, you need to add it to your universe parameters and then set the value to Yes. I covered how to do this in an earlier presentation and neglected to repeat it here. The presentation has screen shots that help make it all clear.
If you go to my presentations page and download “Tales From A Universe Ninja, Part One” and review pages 45 through 51 it includes more details as well as pictures.
Pictures help.
Hi, and thanks for your posts, Dave.
A question I have – the combination between the queries is with FULL OUTER JOIN.
Can that behavior be changed to combine the queries with INNER JOIN ?
This cause us lots of troubles…
Thanks !
The standard is to do a FULL OUTER JOIN in order to replicate what Web Intelligence does. In all the times I have used this, I have wanted a full outer join.
I can’t think of any reasons off the top of my head to want to change this.
Hi Dave,
Thanks for this useful information. It’s really very useful. But just want to know if there is any way to use LEFT OUTER JOIN or RIGHT OUTER JOIN instead of FULL OUTERJOIN in this case. We have some performance issues with FULL OUTER JOIN. If I modify the same query to use LEFT OUTERJOIN it just takes 2 minutes whereas the FULL OUTERJOIN query takes around 20 min…
Please share your inputs or any alternatives available.
Thanks!
Ganesh, thanks for your comment. As I am sure you realize, if you do a left join you’re not getting the same results. In my opinion, full outer join is the proper syntax to use for this feature. It is designed to replace the data provider synchronization; if you need some other result then you probably need to investigate views or derived tables or some changes to your ETL.
Hi Dave,
After reading your blog i know that JOIN_BY_SQL is different from Synchronize…..in term of performance(for large amount of data) as it occurs at database level.
But what are the other benefits from it?
Could you plase give another small example on what kind of problems it can resolve(Consider a real time DW scenario)? What is the best substitute for this if this is not giving the desired result?
Thanks a ton.
Regards,
Shiva
@Siva:
It will create just ONE microbe at the report level, thereby increasing performance.
And would do you mean exactly by: “not giving the right result”, please?
JOIN_BY_SQL can have a performance impact but the primary function is to provide multi-pass SQL capabilities to tools (like Crystal) that don’t otherwise support it.
Thankyou Andreas.
Thankyou Dave.
Hi Dave,
Can you put some light on disadvantage of this parameter. Why by default it’s not been set by BusinessObjects itself.
I can only speculate as I don’t know for sure.
Typically when Business Objects introduces a new parameter they leave it off unless it’s adressing a specific bug. The reason for that would be they have no idea how it’s going to impact all of the client universes on multiple different platforms. It is also possible for this parameter to have a negative impact on performance.
The primary advantage of this feature was posted in comment #10 earlier.