Bulk insert into OpenEdge database table - Forum - OpenEdge Development - Progress Community

Bulk insert into OpenEdge database table

 Forum

Bulk insert into OpenEdge database table

This question is not answered

SQL Server has a network client API for inserting large numbers of rows in a single round trip to the database.  (aka bulk insert).

Is there any capability in OpenEdge for something like this?  (I'm looking for something that exists either on the ABL *or* pn the SQL92 side of things)

Bulk deletes are possible via a delete statement in SQL92.  But I'm not certain how I can accomplish bulk inserts.  The best I've ever been able to do is use ABL but I think that is doing about 3 round trips over the network for every record that is created.  The performance of that will hit a limit at around ~2000 records per second.

Ideally a bulk delete and insert could be performed within a single transaction.  But that isn't strictly necessary for my current use-case.  I don't mind separate transactions if they are fast.

I see that there is a Progress DataDirect driver that supports bulk load, but it doesn't seem that this driver supports the Progress database.

Any pointers would be appreciated.

All Replies
  • >> Is there any capability in OpenEdge for something like this?

    You are right that, OE-SQL does not support bulk Insert.  We do have a feature in pipeline to support bulk Insert from OE-SQL.

    >> Any pointers would be appreciated.

    I am sure you might have thought about this, still just wanted to share as number of transactions does not matter here. "Insert Load" can be divided into 3-4 groups and can be fed to 3-4 different SQL clients. This might not reduce overall round trips, but overall loading time might get reduced a bit....

    Thanks,

    Akthar.

  • > The best I've ever been able to do is use ABL but I think that is doing about 3 round trips over the network for every record that is created.

    Does your bulk insert need to happen via client/server?  A shared memory connection will be faster.

    > Ideally a bulk delete and insert could be performed within a single transaction.

    Why is this ideal? A single long-running transaction with a lot of notes can cause BI growth and, in the event of a crash, could take a long time (and even more BI space) to undo.

    > But that isn't strictly necessary for my current use-case.

    Can you say more about your use-case?  Is this something you would do regularly in the normal course of business, or more once-in-a-blue-moon, e.g. an application upgrade scenario?  Do you have the option of taking any downtime around the time of the bulk insert?  

    There are some techniques you can leverage, though you want to carefully weigh their costs and benefits.  Specifically, you can increase the BI cluster size, so you are checkpointing less frequently (but you need downtime for the truncate bi to change the cluster size before and after the insert).  Or you can use buffered I/O (-r) to speed updates but this means taking a full backup before you begin (as well as DB restart before and after insert), and keeping the application offline during the bulk insert, as you never want clients doing work while you're running with -r.  But it can make a mass load significantly faster.  Caveat: this is a very brief description and you should research -r carefully to understand the risks of using it in production.  Some links:

    slides: pugchallenge.org/.../321_Database_IO.pptx

    audio: pugchallenge.org/.../321_Database_IO.mp3

    I suggest you benchmark these techniques in a non-prod environment with prod-like configuration to weigh the performance benefits.

    > I am sure you might have thought about this, still just wanted to share as number of transactions does not matter here.

    The number of transactions for a bulk insert determines the number of RL_TBGN and RL_TEND notes written to BI and AI.  Fewer transactions (via transaction batching) does mean fewer of these writes.  How much this matters would be a function of the size of a begin and end note, compared with the volume of notes written for one record insert and its index keys.  Note, again, that BI growth may be a side-effect of transaction batching, depending on your batch size.

  • >> share as number of transactions does not matter here. "Insert Load" can be divided into 3-4 groups and can be fed to 3-4 different SQL clients

    While the "delete" and "insert" operations can happen in separate transactions, it is better if each of the two of those were atomic.  I already do use parallel operations quite often when I'm interacting with the OE database (especially when using the "client/server" connectivity from ABL).  But normally this is NOT for database updates - it is just when I'm pulling NO-LOCK (uncommitted) data out as fast as possible via FIND statements.

    >> Why is this ideal? A single long-running transaction with a lot of notes can cause BI growth and, in the event of a crash, could take a long time (and even more BI space) to undo.

    My point is that it should NOT be "long-running" in terms of duration ... but must still insert lots of data.  Whatever is reasonable to do for a "shared memory" client should be just as reasonable for a remote client.   IE.  I'm looking for an insert operation that can happen over the network with comparable performance to doing it in "shared-memory".  Our PASOE servers (where all our ABL logic is hosted) are remote from the database and are load-balanced and fault-tolerant.   They can accomplish just about everything that they could when connected in "shared memory" with a few exceptions (like large inserts).  Sometimes we have to fall back on SQL92 for these types of things ... although in the case of inserts SQL does NOT seem to have the bulk insert functionality yet either.

    >> Can you say more about your use-case?

    Normal business operations like calculating a large financial Journal Entry at the end of a fiscal week and inserting the calculated results quickly in bulk (ie. 100's or 1000's or records).  

    Another would be for the purpose of running server-side joins.  Lets say you have a U/I screen where the user has prepared a filtered list of 1000 arbitrary sales orders ... and they would like to see all the related details.  Instead of the (typical) approach of looping thru these individually and gathering the data as separate round-trips (parallel or otherwise) it would be nice to insert the order numbers back INTO the database in bulk (into a utility table of some kind) and that would facilitate a subsequent server-side-join queries that would gather all the related records that were joined to those order numbers.

    The bottlenecks are NOT really at the database level (I/O or transaction limitations).  They are related to application logic (limitations in "client-server" ABL ... and limitations in SQL features.)

    Conceptually speaking, bulk inserts aren't profoundly new or interesting for database engines.  I think Progress should have this capability on *both* the SQL and ABL side of things.  For example, perhaps there could be a "COPY-BUFFER(S)" operation where you send an entire TT of data back into a database table from ABL in a single logical round-trip from the programmer's perspective.  (_mprosrv should not have to rely on a long, chatty conversation with PASOE before it inserts every single record ).

    Of course the ABL bulk inserts might seem more easy than they really are.  I would guess it would have to be disallowed in the presence of ABL triggers and such.

  • One might note that a dictionary load is a form of bulk insert and could easily be automated with a file transfer.

    Consulting in Model-Based Development, Transformation, and Object-Oriented Best Practice  http://www.cintegrity.com

  • > Conceptually speaking, bulk inserts aren't profoundly new or interesting for database engines.  I think Progress should have this capability on *both* the SQL and ABL side of things.

    Definitely agree, it's an area for improvement.

  • perhaps - but i think that is stretching the definition of "bulk insert"

    a bit.

    >

    >

    >

    >

  • @gus I think you are saying that we shouldn't have to resort to using FTP and data-dictionary tools to accomplish our record insertions.   I agree.

    Nobody should need to "reinvent the wheel" and create their own appserver-like utility just to insert a bunch of records from a remote client application.

    What is quite frustrating to me are the limitations of DataDirect drivers where the OpenEdge database is concerned.  This isn't the first time that the ODBC excludes the Progress OE database from a feature that is available for other databases.  Here is another example (DataDirect-implemented connection pooling) : knowledgebase.progress.com/.../000038726

    It would be very interesting to hear the reasoning that is used by DataDirect when they take a pass on implementing an important feature for the OE database.  It would seem to me that the OE database should be getting *preferential* treatment rather than waiting at the bottom of the list.  It seems like DataDirect's goals aren't very well aligned with the goals for the OE database.

  • You seem to run into various problems all the time due to connecting your PASOE servers to the database over the network in stead of through shared memory.

    I always thought, since I first read documentation for the original Progress Appserver, that the intended/optimal architecture would be to run an Appserver on the same machine as the database with shared memory, as an extension of the capabilities of the database server itself. And then, if you really needed more than one Appserver on different machines, for performance or for load balancing or extra computation power or whatever, for end-user clients to connect to, then it appeared to me that you could have that too, and let those Appservers talk to the main Appserver over the network for the more database-heavy operations that really need the performance of basically running kind-of on the database server itself through shared memory. For instance to do such things as bulk inserts, that you mention now. At least while the database server does not have that operation natively. And maybe for some other things you have asked about previously too. (Other operations can still connect directly to the database alone over the network, when the services of the main Appserver is not needed.)

    Even though that main Appserver may be viewed as a possible single point of failure, I don't see how that is any worse than the database itself already being a possible single point of failure. It's just a way of being able to extend the functionality of the database server without having to wait for Progress to do it for you. Just don't put all processing of your application on the main Appserver, only the operations that really need to be there.

    I haven't tried this architecture myself though, and I haven't seen it being discussed since then, and I have no idea if it fits PASOE too, so I don't know if it works for real. It is just an idea I always thought sounded interesting, and seemed to be implied by the documentation. It would be interesting to know if it works.

  • You are right that I prefer to treat PASOE as a middle tier for business logic.  Many of the features of web servers (load balancing and fault tolerance) work their best in the middle tier.  Moreover, hardware is provisioned VERY differently for web servers vs. database servers.  (IE.  You build a big, expensive server to run the database and it is configured with lots of good disk, RAM, etc.  Its failover model is quite different as well - based on failover clustering of the disk resource).

    The PASOE web servers, on the other hand, should typically live on very inexpensive hardware.  There should be redundant copies for failover and load balancing (via Citrix netscaler or whatever).  Unlike the database server, they would often be provisioned by way of docker containers.  They would have virtually no disk.  Unlike the database server, the disk would be inexpensive and expendable.

    Also, the installation of third-party software is quite different between the two.  The PASOE web servers are typically loaded with LOTS of third-party cruft that is a dependency of the given application (plugins for FTP, email, excel, custom services, custom utilities, etc).  You do NOT want that type of a mess living on your database server.  Nor do you want any inefficient or runaway ABL processes (_progres, _proapsv, _mproapsv with custom code) to compete for the RAM, CPU and network of the database server processes (ie. don't interfere with _mprosrv, _sqlsrv, and similar).

    I agree with you that historically Progress ABL was NOT designed very well for running in the middle tier.  At times it seems like it has more in common with a "data manipulation language" that would otherwise run within the memory of the database (ie. more in common with T-SQL than with a general-purpose language like C#.Net).

    In that vein we have often written "classic" appserver programs just to serve as a substitute for a better query processing engine.  (Ie. a replacement for an inadequate SQL.)

    But I think the "new" PASOE/tomcat product is forcing Progress to prioritize their "client-server" connectivity and start improving it.  Sysadmins, especially ones that aren't familiar with the evolution of OpenEdge, are probably very hesitant to be running tomcat and a database engine on the same server.  Remember that the most highly advertised feature of OE12 are related to improvements in "client-server" connectivity.  (IE. 300% improvement due to multithreading and server-side joins.)

  • In many ways the classic server can be viewed as Progress' answer to "stored procedures".

    Things have changed considerably since those days and application architectures are very different now.  Instead of monolithic general purpose servers we have more and more specialized servers, virtual machines and containers.

    You should not be thinking in terms of running PASOE on the db server with a shared memory connection.  Yes, you can do it but that takes away many of the advantages - especially in terms of resiliency but also scalability and cost.

    --
    Tom Bascom
    tom@wss.com

  • Loquacious Gus!

    Consulting in Model-Based Development, Transformation, and Object-Oriented Best Practice  http://www.cintegrity.com

  • What's with all those empty messages from gus the last couple of weeks, in this thread and others?

  • Probably butt-dialing responses from his phone.  

    When I got my first smart phone I was doing that all the time.  It stopped after I figured out how to enable the lock screen.

  • > On Dec 12, 2019, at 11:10 AM, ske wrote:

    >

    > What's with all those empty messages from gus the last couple of weeks, in this thread and others?

    >

    >

    >

    sorry about that, folks. i had no idea that was happening. those messages were not actually blank when they left my hands. so everything i've said since oct 26 was blanked out by something.

    i switched to using thunderbird for email. guess maybe that is responsible.