Friday, March 30, 2012

One More Question On Running Parallel Queries

An Execute SQL task takes 1 min to run a statement "insert into Mytable select * from view_using_joins"

Output: 10,225 rows affected.

But a Dataflow task configured to fetch data from the same view_using_joins into MyTable takes hours to do the same.

Could you please explain why is it so ?

Thanks

Subhash Subramanyam

I am not surprised that the Execute SQL Task is quicker. When you're inserting from one table to another in the same database then SSIS isn't going to outperform the database engine.

I don't know why it is hours quicker. There isn't really enough information here to say. What destination adapter are you using? Is the package running on the same machine as the database? Are you doing transformations on the way?

-Jamie

|||

Hi Jamie,

Thanks for your reply.

1) I am running packages on a different server.

2) Using OLEDB adapters for Source and Destination . Here Database as well as the server are same for Source and Destination

3) No transformations in between

Wonder if we have to do some extra settings to here to achieve the same duration as that of execute sql task ?

Thanks

Subhash Subramanyam

|||

Have you chosen Fast Load on your destination?

Thanks.

|||

Subhash512525 wrote:

Hi Jamie,

Thanks for your reply.

1) I am running packages on a different server.

2) Using OLEDB adapters for Source and Destination . Here Database as well as the server are same for Source and Destination

3) No transformations in between

Wonder if we have to do some extra settings to here to achieve the same duration as that of execute sql task ?

Thanks

Subhash Subramanyam

You're running the package on a different server? I'd suggest that's yur problem right there. The data is going to have to go over the network - obviously this is going to take time.

Also, in your destination are you inserting with Fast Load?

I say again, in this scenario the data-flow isn't going to outperform the Execute SQL Task.

-Jamie

|||

Great Jamie, you figured out.

If you don't mind spending few minutes here, I am coming back to my actual scenario.

I surely expect specific views from experts here for each of the questions here: Phil B, Rafael S, Darren G, Jwelch, JayH, Scott B, Ashwin S, Brian, Bob, Donald F and many others I am still not aware of.

Scenario:

1) My SSIS Packages are run at US server. scheduleld during Nights.

2) Each Package runs 6-8 queries each having Joins Parallelly pulling data from Oracle Database Source (UNIX) in Europe, Total Data extracted do not exceed 5 Million rows)

3) Destination Database is at US.

4) Network Bandwidth (2 Mbps)

Problem is that It almost takes ages to execute these Packages (Ranging from 25 hours to 30 hours)

Questions are:

1) Where should I expect to run the SSIS Packages to give a better performance?

2) How can I perform only incremental load (using Dataflow task) taking into consideration performance aspects? (Any links for this can help)

3) Does the overlap of the Schedules for SSIS packages afffect the performance?

4) Are there any limits on running number of queries in parallell to pull data from oracle source

5) Will it be the best way, If I spool the query results into flat files on a local system where the source (oracle database) runs at Europe and then ftp them to a shared server at US, which I can use it for importing into Destination table

Waiting for your reply,

Many Thanks and Regards

Subhash Subramanyam

|||Thanks Bob, Please give your answers for my below questions if don't mind.|||The more work you can do to prevent keeping the data transmission "pipe" open, the better.

Perform your source query in Europe, export that to a file, compress it, and then FTP it to the US. Then uncompress it, and load it with SSIS.

The idea is to keep your transmissions across "the pond" as short as possible.|||

From your question #2, I'm assuming you are pulling all rows every night. As Phil mentioned, you want to minimize how much data you are actually moving, so I'd definately make this incremental. A common way to implement that is by checking modified dates on the source system via a WHERE clause in your source SELECT statements. Store the range of modified dates that you retreive, and when the package is run the next night, start from the end of the previous range.

If you don't have modified dates in the source system, consider adding them. Alternatives are using triggers to track changes, or using a change data capture tool - I believe Oracle has one, and SQL Server will have one with SQL Server 2008.

|||

One more question:

6) If I have 6-8 queries running in parallel, Whether having a common connection Manager (for an Oracle source) for all performs better or having Distinct Connection Manager performs better ?

Still expecting suggestions and the views of rest of the experts for six questions listed here.

Regards

Subhash Subramanyam

|||

Subhash512525 wrote:

6) If I have 6-8 queries running in parallel, Whether having a common connection Manager (for an Oracle source) for all performs better or having Distinct Connection Manager performs better ?

It depends Smile Using a single one should result in the same performance as having several, assuming you are not using RetainSameConnection on them. Having a single connection manager doesn't mean that SSIS won't open multiple connections to the database. A Connection Manager manages multiple connections to the database, unless you force it to use only a single connection with RetainSameConnection.

A related note - in your scenario, have you tested whether performance is better if you run all queries sequentially or in parallel (by using precedence constraints on the data flow tasks)?

|||

jwelch wrote:

A related note - in your scenario, have you tested whether performance is better if you run all queries sequentially or in parallel (by using precedence constraints on the data flow tasks)?

Jwelch, This seem more practical. I'll test this and let you know..

Thanks

Subhash

No comments:

Post a Comment