Jul 14 2009
We got a new feature / parameter in Designer a while back called JOIN_BY_SQL. This feature allows a universe designer to turn on an option that changes how multiple SQL statements are processed. This is related to the multi-path SQL generation provided by Business Objects, which frankly I think is one of the most powerful features of the product. What is multi-path SQL? What does JOIN_BY_SQL do to change that? Read on for details.
Multi-Path SQL Generation
There are several types of “traps” that can occur in a relational database. These traps are recognized by Business Objects and even have special names like Fan and Chasm traps. In a nutshell, you cannot use certain tables in certain types of relationships in a single SQL statement and get the correct answer. What Business Objects does is determine how to split the SQL into multiple queries (paths), run each path separately (and sequentially, not in parallel), and then combine the results back into a single cube in the report. This is done in Desktop Intelligence as well as Web Intelligence. This can be used to solve chasm or fan traps, or to automatically generate separate queries for contractory conditions like “current year” and “prior year” results.
This is all great, but what happens if your individual queries bring back tens of thousands of rows? I’ve seen reports with Current Year and Prior Year queries that are synchronized on the report… and each query had about 20,000 rows of data. The results were not pretty. The effort made by the client to keep those rows in sync can be substantial, so there is definitely overhead involved. In most cases the dimensional structure of each query is the same and it’s only measures that are different.
With the advent of the JOIN_BY_SQL option this can change, at the designer’s option. You can ask the database to do the synchronization of the two (or more) parts of the query paths. This is done by creating a full outer join of two (or more) SQL scripts on the database server rather than on the reporting client. So instead of this:
SELECT dimension, sum(measure1) FROM table1, table2 WHERE table1.id = table2.id GROUP BY dimension
SELECT dimension, sum(measure2) FROM table1, table3 WHERE table1.id = table3.id GROUP BY dimension
and finally a join (or synchronize) option in the client, you get this instead:
SELECT COALESCE( F__1.Axis__1,F__2.Axis__1 ), F__1.M__1281, F__2.M__1284 FROM ( SELECT dimension1 AS Axis__1, sum(measure1) AS M__1281 FROM table1, table2 WHERE ( table1.id = table2.id ) GROUP BY 1 ) F__1 FULL OUTER JOIN ( SELECT dimension1 AS Axis__1, sum(measure3) AS M__1284 FROM table1, table3 WHERE ( table1.id = table3.id ) GROUP BY 1 ) F__2 ON ( F__1.Axis__1=F__2.Axis__1 )
What this does, in a nutshell, is run the two queries separately from each other and then combine the result rows using a full outer join process. A full outer join is required to allow rows to be missing from either of the two selects. This then returns a combined result set with values from dimension 1 combined with total values from measure 2 and measure 3, and saves the report client from having to manage the data merge / sort / display process.
JOIN_BY_SQL Bug Reports
In earlier versions of XIR2 there was a bug in this feature that would mix up dimension and measure column results. It was fairly easy to see this in action as there would be measure values in the dimension columns on a report and vice versa. There was an additional bug that would prevent the SQL from being executed if there were more than 9 passes, as the alias creation algorithm failed to generate a unique and correct SQL syntax at that point.
Query versus Report Performance Considerations
If the individual data providers return a lot of rows, it seems that setting this parameter in your universe can result in a substantial performance improvement. But as I wrote here for smaller data sets this feature can actually result in decreased performance. As a result, I can’t make a definite statement that performance will be better if you activate this parameter in a universe.
This feature only works in Crystal and in Web Intelligence. If you turn this feature on and generate reports in Desktop Intelligence you will not see the alternate query path as shown above. (This is yet another indication to me that Desktop Intelligence is probably on its way out.) Web Intelligence will fall back to the standard query generation if you have a synchronization operation instead of a join. Crystal requires this feature as it will not generate / synchronize multiple query paths.
As with many things, your mileage may vary with this feature. It does allow Crystal to handle multi-pass SQL and it can result in a performance boost when large amounts of data are being merged by the client (or Web Intelligence server). But there are also cases where it may hinder performance as well. This post was intended to serve as a review of how the feature works. Whether you use it or not is ultimately best determined with appropriate performance testing in your specific environment.