Bulk insert into OpenEdge database table

Posted by dbeavon on 05-Dec-2019 23:12

SQL Server has a network client API for inserting large numbers of rows in a single round trip to the database.  (aka bulk insert).

Is there any capability in OpenEdge for something like this?  (I'm looking for something that exists either on the ABL *or* pn the SQL92 side of things)

Bulk deletes are possible via a delete statement in SQL92.  But I'm not certain how I can accomplish bulk inserts.  The best I've ever been able to do is use ABL but I think that is doing about 3 round trips over the network for every record that is created.  The performance of that will hit a limit at around ~2000 records per second.

Ideally a bulk delete and insert could be performed within a single transaction.  But that isn't strictly necessary for my current use-case.  I don't mind separate transactions if they are fast.

I see that there is a Progress DataDirect driver that supports bulk load, but it doesn't seem that this driver supports the Progress database.

Any pointers would be appreciated.

All Replies

Posted by Mohd Sayeed Akthar on 06-Dec-2019 05:24

>> Is there any capability in OpenEdge for something like this?

You are right that, OE-SQL does not support bulk Insert.  We do have a feature in pipeline to support bulk Insert from OE-SQL.

>> Any pointers would be appreciated.

I am sure you might have thought about this, still just wanted to share as number of transactions does not matter here. "Insert Load" can be divided into 3-4 groups and can be fed to 3-4 different SQL clients. This might not reduce overall round trips, but overall loading time might get reduced a bit....

Thanks,

Akthar.

Posted by Rob Fitzpatrick on 06-Dec-2019 13:36

> The best I've ever been able to do is use ABL but I think that is doing about 3 round trips over the network for every record that is created.

Does your bulk insert need to happen via client/server?  A shared memory connection will be faster.

> Ideally a bulk delete and insert could be performed within a single transaction.

Why is this ideal? A single long-running transaction with a lot of notes can cause BI growth and, in the event of a crash, could take a long time (and even more BI space) to undo.

> But that isn't strictly necessary for my current use-case.

Can you say more about your use-case?  Is this something you would do regularly in the normal course of business, or more once-in-a-blue-moon, e.g. an application upgrade scenario?  Do you have the option of taking any downtime around the time of the bulk insert?  

There are some techniques you can leverage, though you want to carefully weigh their costs and benefits.  Specifically, you can increase the BI cluster size, so you are checkpointing less frequently (but you need downtime for the truncate bi to change the cluster size before and after the insert).  Or you can use buffered I/O (-r) to speed updates but this means taking a full backup before you begin (as well as DB restart before and after insert), and keeping the application offline during the bulk insert, as you never want clients doing work while you're running with -r.  But it can make a mass load significantly faster.  Caveat: this is a very brief description and you should research -r carefully to understand the risks of using it in production.  Some links:

slides: pugchallenge.org/.../321_Database_IO.pptx

audio: pugchallenge.org/.../321_Database_IO.mp3

I suggest you benchmark these techniques in a non-prod environment with prod-like configuration to weigh the performance benefits.

> I am sure you might have thought about this, still just wanted to share as number of transactions does not matter here.

The number of transactions for a bulk insert determines the number of RL_TBGN and RL_TEND notes written to BI and AI.  Fewer transactions (via transaction batching) does mean fewer of these writes.  How much this matters would be a function of the size of a begin and end note, compared with the volume of notes written for one record insert and its index keys.  Note, again, that BI growth may be a side-effect of transaction batching, depending on your batch size.

Posted by dbeavon on 06-Dec-2019 15:08

[mention:2e60e362340349ccb92a4f12f832fb1c:e9ed411860ed4f2ba0265705b8793d05]

>> share as number of transactions does not matter here. "Insert Load" can be divided into 3-4 groups and can be fed to 3-4 different SQL clients

While the "delete" and "insert" operations can happen in separate transactions, it is better if each of the two of those were atomic.  I already do use parallel operations quite often when I'm interacting with the OE database (especially when using the "client/server" connectivity from ABL).  But normally this is NOT for database updates - it is just when I'm pulling NO-LOCK (uncommitted) data out as fast as possible via FIND statements.

[mention:4b133177ec1e4b6f9a6a7832e6f29913:e9ed411860ed4f2ba0265705b8793d05]

>> Why is this ideal? A single long-running transaction with a lot of notes can cause BI growth and, in the event of a crash, could take a long time (and even more BI space) to undo.

My point is that it should NOT be "long-running" in terms of duration ... but must still insert lots of data.  Whatever is reasonable to do for a "shared memory" client should be just as reasonable for a remote client.   IE.  I'm looking for an insert operation that can happen over the network with comparable performance to doing it in "shared-memory".  Our PASOE servers (where all our ABL logic is hosted) are remote from the database and are load-balanced and fault-tolerant.   They can accomplish just about everything that they could when connected in "shared memory" with a few exceptions (like large inserts).  Sometimes we have to fall back on SQL92 for these types of things ... although in the case of inserts SQL does NOT seem to have the bulk insert functionality yet either.

>> Can you say more about your use-case?

Normal business operations like calculating a large financial Journal Entry at the end of a fiscal week and inserting the calculated results quickly in bulk (ie. 100's or 1000's or records).  

Another would be for the purpose of running server-side joins.  Lets say you have a U/I screen where the user has prepared a filtered list of 1000 arbitrary sales orders ... and they would like to see all the related details.  Instead of the (typical) approach of looping thru these individually and gathering the data as separate round-trips (parallel or otherwise) it would be nice to insert the order numbers back INTO the database in bulk (into a utility table of some kind) and that would facilitate a subsequent server-side-join queries that would gather all the related records that were joined to those order numbers.

The bottlenecks are NOT really at the database level (I/O or transaction limitations).  They are related to application logic (limitations in "client-server" ABL ... and limitations in SQL features.)

Conceptually speaking, bulk inserts aren't profoundly new or interesting for database engines.  I think Progress should have this capability on *both* the SQL and ABL side of things.  For example, perhaps there could be a "COPY-BUFFER(S)" operation where you send an entire TT of data back into a database table from ABL in a single logical round-trip from the programmer's perspective.  (_mprosrv should not have to rely on a long, chatty conversation with PASOE before it inserts every single record ).

Of course the ABL bulk inserts might seem more easy than they really are.  I would guess it would have to be disallowed in the presence of ABL triggers and such.

Posted by Thomas Mercer-Hursh on 06-Dec-2019 15:24

One might note that a dictionary load is a form of bulk insert and could easily be automated with a file transfer.

Posted by Rob Fitzpatrick on 06-Dec-2019 15:41

> Conceptually speaking, bulk inserts aren't profoundly new or interesting for database engines.  I think Progress should have this capability on *both* the SQL and ABL side of things.

Definitely agree, it's an area for improvement.

Posted by gus bjorklund on 06-Dec-2019 19:04

perhaps - but i think that is stretching the definition of "bulk insert"

a bit.

>

>

>

>

Posted by dbeavon on 06-Dec-2019 19:21

@gus I think you are saying that we shouldn't have to resort to using FTP and data-dictionary tools to accomplish our record insertions.   I agree.

Nobody should need to "reinvent the wheel" and create their own appserver-like utility just to insert a bunch of records from a remote client application.

What is quite frustrating to me are the limitations of DataDirect drivers where the OpenEdge database is concerned.  This isn't the first time that the ODBC excludes the Progress OE database from a feature that is available for other databases.  Here is another example (DataDirect-implemented connection pooling) : knowledgebase.progress.com/.../000038726

It would be very interesting to hear the reasoning that is used by DataDirect when they take a pass on implementing an important feature for the OE database.  It would seem to me that the OE database should be getting *preferential* treatment rather than waiting at the bottom of the list.  It seems like DataDirect's goals aren't very well aligned with the goals for the OE database.

Posted by ske on 10-Dec-2019 13:13

You seem to run into various problems all the time due to connecting your PASOE servers to the database over the network in stead of through shared memory.

I always thought, since I first read documentation for the original Progress Appserver, that the intended/optimal architecture would be to run an Appserver on the same machine as the database with shared memory, as an extension of the capabilities of the database server itself. And then, if you really needed more than one Appserver on different machines, for performance or for load balancing or extra computation power or whatever, for end-user clients to connect to, then it appeared to me that you could have that too, and let those Appservers talk to the main Appserver over the network for the more database-heavy operations that really need the performance of basically running kind-of on the database server itself through shared memory. For instance to do such things as bulk inserts, that you mention now. At least while the database server does not have that operation natively. And maybe for some other things you have asked about previously too. (Other operations can still connect directly to the database alone over the network, when the services of the main Appserver is not needed.)

Even though that main Appserver may be viewed as a possible single point of failure, I don't see how that is any worse than the database itself already being a possible single point of failure. It's just a way of being able to extend the functionality of the database server without having to wait for Progress to do it for you. Just don't put all processing of your application on the main Appserver, only the operations that really need to be there.

I haven't tried this architecture myself though, and I haven't seen it being discussed since then, and I have no idea if it fits PASOE too, so I don't know if it works for real. It is just an idea I always thought sounded interesting, and seemed to be implied by the documentation. It would be interesting to know if it works.

Posted by dbeavon on 10-Dec-2019 14:37

You are right that I prefer to treat PASOE as a middle tier for business logic.  Many of the features of web servers (load balancing and fault tolerance) work their best in the middle tier.  Moreover, hardware is provisioned VERY differently for web servers vs. database servers.  (IE.  You build a big, expensive server to run the database and it is configured with lots of good disk, RAM, etc.  Its failover model is quite different as well - based on failover clustering of the disk resource).

The PASOE web servers, on the other hand, should typically live on very inexpensive hardware.  There should be redundant copies for failover and load balancing (via Citrix netscaler or whatever).  Unlike the database server, they would often be provisioned by way of docker containers.  They would have virtually no disk.  Unlike the database server, the disk would be inexpensive and expendable.

Also, the installation of third-party software is quite different between the two.  The PASOE web servers are typically loaded with LOTS of third-party cruft that is a dependency of the given application (plugins for FTP, email, excel, custom services, custom utilities, etc).  You do NOT want that type of a mess living on your database server.  Nor do you want any inefficient or runaway ABL processes (_progres, _proapsv, _mproapsv with custom code) to compete for the RAM, CPU and network of the database server processes (ie. don't interfere with _mprosrv, _sqlsrv, and similar).

I agree with you that historically Progress ABL was NOT designed very well for running in the middle tier.  At times it seems like it has more in common with a "data manipulation language" that would otherwise run within the memory of the database (ie. more in common with T-SQL than with a general-purpose language like C#.Net).

In that vein we have often written "classic" appserver programs just to serve as a substitute for a better query processing engine.  (Ie. a replacement for an inadequate SQL.)

But I think the "new" PASOE/tomcat product is forcing Progress to prioritize their "client-server" connectivity and start improving it.  Sysadmins, especially ones that aren't familiar with the evolution of OpenEdge, are probably very hesitant to be running tomcat and a database engine on the same server.  Remember that the most highly advertised feature of OE12 are related to improvements in "client-server" connectivity.  (IE. 300% improvement due to multithreading and server-side joins.)

Posted by ChUIMonster on 10-Dec-2019 14:59

In many ways the classic server can be viewed as Progress' answer to "stored procedures".

Things have changed considerably since those days and application architectures are very different now.  Instead of monolithic general purpose servers we have more and more specialized servers, virtual machines and containers.

You should not be thinking in terms of running PASOE on the db server with a shared memory connection.  Yes, you can do it but that takes away many of the advantages - especially in terms of resiliency but also scalability and cost.

Posted by gus bjorklund on 10-Dec-2019 22:04

Posted by Thomas Mercer-Hursh on 10-Dec-2019 22:15

Loquacious Gus!

Posted by ske on 12-Dec-2019 16:07

What's with all those empty messages from gus the last couple of weeks, in this thread and others?

Posted by dbeavon on 13-Dec-2019 15:22

Probably butt-dialing responses from his phone.  

When I got my first smart phone I was doing that all the time.  It stopped after I figured out how to enable the lock screen.

Posted by gus bjorklund on 13-Dec-2019 16:20

> On Dec 12, 2019, at 11:10 AM, ske wrote:

>

> What's with all those empty messages from gus the last couple of weeks, in this thread and others?

>

>

>

sorry about that, folks. i had no idea that was happening. those messages were not actually blank when they left my hands. so everything i've said since oct 26 was blanked out by something.

i switched to using thunderbird for email. guess maybe that is responsible.

Posted by ske on 17-Dec-2019 13:03

@gus: So what was it you tried to say in your empty post efter the discussion of appservers (as a solution/not a solution for bulk insert closer to the database).

Posted by ske on 17-Dec-2019 13:03

@gus: So what was it you tried to say in your empty post efter the discussion of appservers (as a solution/not a solution for bulk insert closer to the database).

Posted by gus bjorklund on 17-Dec-2019 15:22

> On Dec 17, 2019, at 8:05 AM, ske wrote:

>

> @gus: So what was it you tried to say in your empty post efter the discussion of appservers (as a solution/not a solution for bulk insert closer to the database).

>

>

>

what i tried to say, in reponse to ChUIMonster, was

Posted by ske on 27-Dec-2019 20:51

@gus: Your post ends after "was". Still some problem posting?

Posted by gus bjorklund on 27-Dec-2019 22:50

bah! no idea why my posts don't work.

chuimonster said that the classic appserver could be viewed as Progress' answer to stored procedures.

i said that that that was a fantasy of product management.=

Posted by dbeavon on 27-Dec-2019 23:02

>> fantasy of product management

I think I agree.  I suppose chuimonster's point is that some see ABL as a niche language that suits the purpose of a stored proc language like T-SQL.  But most ABL programmers are trying to accomplish lots more these days, rather than just add,edit, and delete some records.  All those other things they are trying to accomplish will eventually compete and conflict with resources that should be set aside for the database itself.

If ABL is analogous to T-SQL then keeping it close to the database makes sense but you have to accept a lot of restrictions.  Also you have to bundle the license of ABL, so that it is an integral part of the database.

Posted by ChUIMonster on 27-Dec-2019 23:58

ABL, as a whole and in today's world, is not analogous to T-SQL.  Nor was it 25 years ago.

But way back in the dark ages of the 90s developers were starting to see that there was a case to be made for separating layers and that it might not be such a good idea to have your UI, data access, and business logic all mushed together.  Especially if you were planning to deploy in a client-server environment.

Way back then one approach to that problem, in other environments, was to stick a lot of business logic into "stored procedures".  (I'm not saying that that was a good approach.)

From a certain point of view you could sort of look at Progress' Classic Appserver as a variety of that sort of thinking.

If you're a real glutton for punishment you could also probably trace the history of operating modes with footnotes explaining how they were allegedly related to various industry trends over the next 20+ years.

Thankfully I have no idea what product management was actually fantasizing about at the time.  I'm quite sure that I wasn't happy about any of it though ;)

This thread is closed