Loading schema into a new, empty database is slow (prodict/load_df.p) - Forum - OpenEdge Development - Progress Community

Loading schema into a new, empty database is slow (prodict/load_df.p)

 Forum

Loading schema into a new, empty database is slow (prodict/load_df.p)

This question is answered

We have some build automation tools that create an empty OE database, prior to compiling ABL code.  It starts with a df file (or files) that contains all the desired target database schema.  

One problem I've noticed in the automatic build operations is that the creation of an empty database is a bit slow.  Are there any tricks to improving performance (possibly loading certain segments of the the schema concurrently on separate threads)?  When the schema is loading, it seems that the operation is CPU-bound, on a single CPU core.  I had first expected to find that disk would be the bottleneck, but that doesn't appear to be the case.

I'm also taking a peak at PCT in github and it looks like it relies on prodict/dump_df.p and prodict/load_df.p as well.  It appears to have wrappers, dmpSch.p and loadSch.p.  I haven't tried running PCT yet myself, but I don't see any optimizations in there that would improve the performance of creating a full/empty database schema.  The task that I was investigating is called PCTCreateBase ... https://github.com/Riverside-Software/pct/wiki/PCTCreateBase 

The only optimization I can think of is to attempt to manage *incremental* schema DF's, and use that for automated builds, but that seems like it adds extra complexity.  I'd rather just send the full database DF as the input for our automated builds.

Another idea I had is to create some additional service that periodically checks for OE schema changes and extracts/creates the corresponding empty database with the same schema whenever a change is detected. That way a database is already prepared and ready to go whenever a new build is started.  This seems a bit overly complex as well.  

Yet another idea is to create multiple empty databases and split the schema arbitrarily across them.  Then we could compile code that connects to these multiple databases.  I think the resulting r-code would be the same as if it was all in a single database.  I haven't tried this myself.

The reason we create an empty database is because we need the OE database to be local while compiling or else the runtime environment does a lot more work to pull schema over the network on the fly.

Any ideas would be appreciated.

Verified Answer
  • How slow is slow?

    Since it is a new, empty database with no user data and you can therefore just throw it away and try again if an unexpected problem occurs go ahead and use the -r startup parameter.  -r provides almost all of the same performance benefits as -i but gives somewhat more in the way of recovery options.

    Do NOT use -r or -i with any database that you are not prepared to throw away if something goes wrong.

    You might also want to ensure that you have adjusted the bi cluster size prior to loading the .df.

    --
    Tom Bascom
    tom@wss.com

  • A proserve isn't necessarily the answer.  I tried various combinations and permutations of Type I, Type II, online, offline, different block sizes, etc.  Some of them made less difference than I expected, maybe because the process is CPU-bound and there's a limit to what my single-hamster-driven machine will do.

    You're using 4 KB block size.  You didn't mention the contents of SimpleStructure.st.  Are all the objects put into the schema area?  Or are they put in application areas?  Type I or II?  *In theory*, using Type II will do more I/O as each object occupies a minimum of one cluster (8 or 64 or 512 blocks), even if empty.  

    As an example, when I load my schema into the schema area of an empty 4 KB database, the area grows from 2 MB to 10 MB.  But when I have the tables and indexes assigned to Type II areas with 8 BPC, I get a 6 MB schema area, and areas 7 and 8 are 17 and 30 MB respectively.  There is more I/O to do.  That alone was good for about a 6.5 second difference (Type I being faster).

    So try loading a .df with no area assignments; if your schema has them already, grep them out to create a new .df and load that.  Add -r on your command line.  Move your truncate bi before the load; try specifying -biblocksize 16 -bi 16384.  See what difference those changes make for you.

All Replies
  • About the cores. It's just an 4GL program which loads the schema. We all know that A. 4GL is not particularly fast and B. it always on a single core since the AVM is not MT. 1200 tables in 60s is not that bad.

  • @onnodehaan

    For anything CPU-bound you should always be able to scale-up the workload on a big server with multiple CPU's.  The first thing to do is just open the task manager while that "build street" is running and see if the CPU is fully utilized (all cores simultaneously).  If not then there may be something fishy going on.

    In your case, the limiting factor might be some third-party dependencies that your colleages might not be able to control (eg PCT or similar).  There may be limitations in these third-party dependencies that limit the utilization of CPU (similar to the limitation of OE when it is creating an empty schema).

    Otherwise the steps for OE alone should be something like so:

    • Delete pre-existing r-code
    • Enumerate all the sources/programs you want to compile
    • Divide them up into batches (eg. 1000 things to compile in each batch)
    • For each batch, output a new ABL program that compiles all the sources/programs (example below)
    • Compile all batches in separate concurrent _progres processes, while using -RO against a database connected with -1 (the db does *not* need to be a served database when using -RO)
    • Make sure to use load-balancing so if the batches aren't equal, then additional batches will be assigned to worker threads that finish early
      (ie. in .Net you can use Parallel.ForEach and specify parallelization and load-balancing options)

    Here is an example of a compile program that would build and save all 1000 source programs in a batch.

    SESSION:SUPPRESS-WARNINGS-LIST = "15090".
    PROPATH = "Whatever".
    MESSAGE "...compiling... cyc\p\cyc0001.p".
    COMPILE cyc\p\cyc0001.p  GENERATE-MD5 SAVE.
    MESSAGE "...compiling... cyc\p\cyc0002.p".
    COMPILE cyc\p\cyc0002.p  GENERATE-MD5 SAVE.
    MESSAGE "...compiling... cyc\p\cyc0003.p".
    COMPILE cyc\p\cyc0003.p  GENERATE-MD5 SAVE.
    ...

    If there are no unusual third-party dependencies, then you should be able to double-check your colleage's findings.  Something good will probably come out of it.  There is always room for improvement.   If you spend some of your day waiting on builds to finish, then there is ROI in improving the build process.  Even the saving of 10 seconds can add up ... if you do builds for four staging environments, with five databases in each environment (which need their own r-code because of schema drift) and multiple iterations of this is required on a given day (a bad bug-day).

    Below is what the task manager looks like during compiles (eight cores with full utilization).  Make sure that disk isn't ever the bottleneck by using SSD.  Also, make sure you have plenty of RAM because the file system cache is even faster than the SSD.

    If you are the one that has to wait on builds, It is important that you have some control over improving them.  If someone else is managing your builds for you, then you will have the troublesome task of trying to convince them why 50 seconds is better than 60 seconds.  (Although they would probably understand if it was their own time that was being wasted ;)  Good luck.

  • > On Dec 6, 2018, at 10:33 AM, dbeavon wrote:

    >

    > It doesn't seem like there is any technical reason why the schema tables should be created one at a time rather than concurrently.

    nor has it been thought to be important to optimize this process. also, if you are making schema changes so frequently that it interferes with your build process, maybe you should spend more time thinking and less time making schema changes.

  • >> nor has it been thought to be important to optimize this process

    Clearly.  

    Waiting an software builds is a real-world problem.  It is probably the least enjoyable aspect of software development.  And because OE source code has such tight dependencies on OE schema, then it is difficult to tear apart these two concerns.  I should also say that the act of getting the empty schema, and making a *local* copy on a build server is NOT actually my first choice.  But it is necessary, since ABL compiles itself so inefficiently against a *remote* schema.  (I suppose the performance of compiling ABL code isn't thought to be important by Progress either.... PDSOE takes over an hour to compile our ABL project, the last time I dared to test it.)

    I think my current plan is to wait for Progress to formally adopt (& support) PCT and then start submitting all my build-related support issues thru that channel.  

  • A best-practice build process is to have everything that is required to build the artifacts inside source control, or build-able from items inside of source control. Relying on a fresh database is the best-practice approach. In a world where there could be multiple builds all running concurrently on the same build slave, all potentially requiring a different schema definitions, the only way to do it is to recreate the database each time.

    The database needs to be treated like code, versioned and reproducible from the repository. I don't think a solution that relies on something outside of source control is a workable solution.

  • dbeavon

    I think my current plan is to wait for Progress to formally adopt (& support) PCT and then start submitting all my build-related support issues thru that channel.  

    As far as I know, PCT will be part of OE 11.7.4. No idea about support. But why wait? You can already try it and submit your issues/ideas at Riverside-Software (aka Gilles). 
  • PCT IS part of 11.7.4. It was released end of October and is formally included in the download.

    are very good at offering pro bono support for it too.

  • > making a *local* copy on a build server is NOT actually my first choice.

    > But it is necessary, since ABL compiles itself so inefficiently against a *remote* schema.

    In that case, maybe all you need is a local schema cache?

    documentation.progress.com/.../index.html

  • @ske

    That is a very good thought, and I have made some attempts in the past to leverage schema caching for these types of things.

    But schema cache doesn't work well (or at least not consistently well) from my experience.  There must be a lot of special-case rules in the AVM about when/where it is allowed to use the schema cache.  Even "DYNAMIC QUERIES" will ignore the schema cache at run-time, and that seems to be the scenario where the cache is intended to have its greatest effect.  It has been a while but I think I had tried improving the performance of PDSOE compiles by using schema caching and wasn't very successful.    I suspect that ABL compile operations don't use the cache in general.  Perhaps it is only used by the AVM's runtime validation, whereby comparisons are made between the r-code and the database structure.  

    The docs say that the cache is a "complete schema or subschema for a single OpenEdge database", but I don't know if it is used as the "authoritative" schema, or as a perfect substitute for accessing the database directly.

  • @James  

    I had seen that PCT started being installed as part of OE.  But Progress didn't seem to be ready to offer direct support.   I would be happy talk to Riverside about the performance problems in the context of schema operations, but I suspect they would redirect me back to Progress.  I think I'll have more luck once there is a single point of contact that takes responsibility for the whole process (the build performance, along with the related schema operations)

    @danielb  

    I think you are making the same case I am.  If we want our automated builds to use inputs that consist of both code *and* schema from source control, then we will often be creating fresh OE databases from the related schema.  Optimizing the creation of fresh OE databases is key to having good build performance.

    My point earlier was that if I compile against local schema, there is a price to pay; but if I compile 25,000 programs against remote schema there is an even higher price to pay.  

    Ideally I'll find a way to optimize the creation of the local schema.  Perhaps we'll need to start using our source-control tools (or nuget package control?) to check-in the schema along with a full zip of the corresponding OE database.  Then the only CPU is cost at the time of build is the unzipping of the OE database.  Its a bit ugly, but our build operations would be faster.  

  • We have a compressed, empty database. When needed, we erase the current database and decomp the ZIP. Very quick.

    Ezequiel Montoya
    Lima - Perú