Regarding IO, I see frequent discussion of write behavior in the documentation (direct IO, sync, buffer flush timing, etc). What about reads? Does Progress support any kind of concurrent or asynchronous reads?
> Does Progress support any kind of concurrent or asynchronous reads?
Another truss has found significant time in an OS "__semop" function. Sounds like maybe it's waiting on semaphores?
Did you ever get any information from the DBA side? Progress does use semaphores for shared memory functions.
The really interesting things are going to be in promon. Did you open a call with PSC?
We have an open call and the DBAs are working with them. Slow for the holidays. Just researching further.
Did you notice that promon hangs for some time while connecting database or while updating statistics on /Activity/ screens?
Processes that are sleeping on __semop() call should be reported in promon/R&D/1/4/2. Status: Blocked Clients.
I would also check promon/R&D/debghb/6/15. Status: Buffer Lock Queue
Is there any kind of readahead at all? It looks like it requests a 8k block from disk, and waits until it is returned for every read.
OpenEdge does nothing to specifically take advantage of “readahead” or “concurrent I/O”.
To give some feedback if anyone else has similar issues, I can confirm that the issues we are experiencing were not related to NUMA, the IBM AIX platform, or our IO layer. We are troubleshooting a long duration single threaded job and we are read latency bound.
The lack of any kind of asynchronous reads and the single threaded nature of the job meant we only ever had one I/O outstanding at a time. Though we are throwing SSD at the problem to reduce the read latency, I believe this does not solve the underlying issue of the IO model. SSD only provides some relief for the symptoms. Splitting up the job may also be an option.
I'm surprised the topic of NUMA came up before the single threading of IO.
I specialize in AIX not Progress, so please interpret this as constructive criticism. I've never seen a database on the AIX platform that did not support some form of async I/O, whether provided by an OS library or internal to the application. We ultimately have too large of a system and SAN for this application, because the database cannot effectively utilize the hardware.
A proper analogy would be that I am providing a fleet of hundreds of long haul semi trucks, but I'm shipping a single can of soup at a time in one truck while the rest idle at the dock. That a very poor ROI on the rest of the fleet.
I find Progress' programming language and Text UI (hurrah!) interface appealing, but I now doubt the scalability of the backend database. Please correct me if I'm wrong, but I don't expect this to start a discussion. I wanted to ensure that I could post the issue was solved for future reference.
If I understand your last post, you have an _progres process that is doing massive database reads and none (or few) of those blocks are cached in the DB buffer pool. Therefore you are read latency bound: like a backup, you are reading a block from disk and never using it again in it's lifetime in the buffer pool. Am I interpreting this correctly?
What are your DB start-up params? How much I/O (logical and physical) are you doing? If you watch the batch job(s) in real time with ProTop then you'll see fine grained stats right at the individual _progres process level including logical (table and index usage) and physical. Something smells bad here.
If my understanding is correct then there should be some tuning opportunities at the database or query level. I re-read the entire thread and what struck me is that the task took much less time on the old hardware (4h vs. 12-16h). You did not answer if a dump and load of the DB was done during the migration. Was it? You also seemed said that it was a string of batch jobs and I had asked if all these jobs were 3-4X slower or if maybe it was just a subset of jobs that were adding all the extra time. It was also not clear if the the DBA confirmed that all the DB start-up parameters are the same as before. What is the DB block size?
I guess I'm saying that you shouldn't need asynchronous reads and that, like SSDs, AIO would just mask the symptoms rather than address the core issue(s).
If he went from running the jobs in sequence to running the jobs all at once, that could be a big enough change to cause -B thrashing and explain the run-time difference.
Also, there have been changes in parameter behavior between versions such that using the same parameter value in both versions can result in a negative performance impact.
I wouldn't write off OpenEdge as lacking scalability just yet; that isn't the only reasonable conclusion based on the posts in this thread. You may be running into an OE architectural limit; but you may have a configuration issue.
We have no idea how your client application or DB are configured/structured. We never did get your client or broker startup parameters from prod and test.
@Paul: I entirely agree there are tuning opportunities at the database level. The old hardware was benefiting from SAN technology which moved disk hot spots to SSD, and the new system though on the same SAN did not yet have the "data temperature" to move those same hot spots up.
As the systems engineer, I can't speak to all the database portions. Let me just say that our team had IBM and EMC review these systems at length for a problem that was at the application layer.
We have since used a Progress consultant to help review the Progress side for tuning and after putting all SSD storage into service we hope to see significantly better performance.
This thread was begun with the problem being a potential issue with the new platform, and I just wanted to close that with some feedback.
I will respectfully disagree regarding AIO. The SAN can perform over 40,000 IOPS without any significant increase in latency, we had to benchmark it extensively during our troubleshooting. I primarily work with Oracle and Intersystems Cache customers on AIX, and I've never had a database that couldn't make use of the storage bandwidth provided.
There are tuning opportunities at the application, job, and database level, yet I don't believe further tuning will be able to leverage all the storage bandwidth this platform provides.
@Tim (sorry, quoting is different in every forum): the differences were minimized during the testing. We restored the DB and ran the same jobs between the two systems. Only one service pack level different in the Progress software as I understand it.
@Rob: I certainly didn't say it's written off. I said I have concerns about the scalability of the DB level without asynchronous IO. My goal is to provide only some productive feedback and my opinions based on our troubleshooting of the issue. I am optimistic we will see performance improvements with the new tuning parameters and SSDs.
From my perspective at the system layer, we just threw hardware at a software issue (SSD vs IO patterns). I have very large customers on other AIX systems with databases that dwarf this one, and they get excellent IO performance on similarly configured high class storage. This was unusual that such a small (<1TB) database was having such difficulty on a new system and it caused significant difficulty to everyone involved to isolate the cause.
I appreciate everyone's feedback while we were troubleshooting, this forum was a good place to get ideas.