Voina Blog (a tech warrior's blog) The Case of the Frozen Database: How Tuning NVMe I/O Saved a Critical JBoss Application

In high-performance enterprise environments, the chain is only as strong as its weakest link. I recently faced a critical issue where a long-running task in my development machine running in JBoss Application Server was failing consistently. The investigation led me down a path from application timeouts deep into the core I/O mechanics of the Oracle Database.

This post details the entire troubleshooting journey and the specific configuration changes that resolved the issue, moving our database from a state of constant freezing to peak performance on modern NVMe hardware.

The Symptom: JBoss Transaction Reaper

The issue began with a critical background task in JBoss deployed application failing repeatedly after running for exactly 5 minutes. The application logs pointed to a transaction timeout:

WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check processing TX ... in state RUN
WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012121: ... successfully canceled TX ...
...
sun.nio.ch.FileDispatcherImpl.read0(Native Method)

Analysis:

Transaction Reaper: JBoss’s internal monitor was killing the transaction because it exceeded the default 300-second timeout.
The Database issue: The stack trace showed the thread was stuck in sun.nio.ch.FileDispatcherImpl.read0. This indicated the application was waiting on a network socket—specifically, waiting for a response from the Oracle database.

The application wasn’t the problem; it was the victim. The database was not responding fast enough.

The Root Cause: Oracle Database “Freezes”

I turned to the Oracle Alert Log (/u01/app/oracle/diag/rdbms/orcl/orcl/trace/alert_orcl.log) and found the culprit immediately. The log was flooded with these errors during the time of the application failure:

Thread 1 cannot allocate new log, sequence 10043
Checkpoint not complete
Current log# 3 seq# 10042 mem# 0: /u01/.../redo03.log
...
Thread 1 cannot allocate new log, sequence 10044
Private strand flush not complete

Analysis:

This is a classic redo log bottleneck.

Checkpoint not complete: The database tries to switch to a new log file, but the old one still contains data that hasn’t been written to the data files on disk. The DB must pause operations until the disk catches up.
Private strand flush not complete: This is a more acute version of the same problem. The in-memory redo buffers for individual sessions couldn’t be cleared into the log file fast enough.

Essentially, application transaction volume was generating data faster than the database could write it to the disk. The database was periodically “freezing” to catch its breath, causing JBoss to time out.

The Solution: A Multi-Layered Tuning Approach

Fixing this required addressing the problem at three distinct layers: Storage Hardware, Database Capacity, and Database Throughput.

Step 1: The Hardware Upgrade (Storage)

The initial redo logs were on a standard SSD. To support the high write throughput the application needed, I moved the redo logs to a dedicated, high-performance Kioxia NVMe drive (the main drive of the machine).

Action: I created a new path under the / mounted NVMe drive , e.g., /redologs/.

Step 2: Solving the Capacity Problem (Redo Log Sizing)

Moving to a faster disk didn’t solve the problem initially. I discovered that the redo log files were far too small (the default installation uses 200MB logs). A heavy transaction would fill a 200MB file in seconds, causing constant log switches and immediate freezing.

Check Log Switch Frequency

Run this query to see how often you are switching logs per hour:

SELECT TO_CHAR(first_time, 'YYYY-MM-DD HH24') AS Hour, 
       COUNT(*) AS Switches 
FROM v$log_history 
GROUP BY TO_CHAR(first_time, 'YYYY-MM-DD HH24') 
ORDER BY 1 DESC;

If you see > 10 switches per hour: Your logs are too small.

If you see > 30 switches per hour: This is critical; resize immediately.

Action: I created new, massive log groups on the NVMe drive and dropped the old ones.
Configuration as under sqlplus as sys dba:

-- Created 3 new groups of 2GB each 
ALTER DATABASE ADD LOGFILE GROUP 4 ('/redologs/redo04.log') SIZE 2G; 
ALTER DATABASE ADD LOGFILE GROUP 5 ('/redologs/redo05.log') SIZE 2G; 
ALTER DATABASE ADD LOGFILE GROUP 6 ('/redologs/redo06.log') SIZE 2G; 
-- Switched to new logs and dropped old groups (e.g., 1, 2, 3) 
-- Execute the bellow command several times until all the old logs 1,2,3 are in INACTIVE status
ALTER SYSTEM SWITCH LOGFILE; 
-- Check the status with the bellow commands
SET LINESIZE 200
COL MEMBER FORMAT a60
SELECT GROUP#, MEMBERS, BYTES/1024/1024 AS SIZE_MB, STATUS 
FROM V$LOG;
-- remove the old logs ONLY if INACTIVE
ALTER DATABASE DROP LOGFILE GROUP 1; 
ALTER DATABASE DROP LOGFILE GROUP 2;
ALTER DATABASE DROP LOGFILE GROUP 3;

Delete the old logs physical files from the OS.

Step 3: Solving the Throughput Problem (I/O Tuning)

Even with big logs on a fast disk, the errors persisted. This was the critical turning point. I realized Oracle was not configured to utilize the massive parallelism of the NVMe drive. It was like driving a sports car in first gear.

I had to enable Asynchronous I/O and increase the number of background writer processes.

Action: Modified system parameters in the SPFILE and performed a clean DB restart.
Configuration as under sqlplus as sys dba:

-- Enable Asynchronous and Direct I/O for Linux filesystems 
ALTER SYSTEM SET filesystemio_options=setall SCOPE=SPFILE; 
-- Increase Database Writers to handle parallel writes (depends on CPU cores) 
ALTER SYSTEM SET db_writer_processes=8 SCOPE=SPFILE; 
-- Clean Restart Required 
SHUTDOWN IMMEDIATE; 
STARTUP;

Note: The SHUTDOWN IMMEDIATE took a significant amount of time as it had to flush a massive backlog of dirty buffers to disk, confirming the severity of the previous I/O bottleneck.

The Result: Stability and Performance

After bringing the database back up with the new configuration:

Alert Log is Clean: The Checkpoint not complete and Private strand flush errors completely disappeared.
Log Switches are Sane: Instead of switching every minute, log switches now occur comfortably once or twice an hour.
JBoss Success: The long-running application task, which previously failed after 5 minutes, completed successfully. The database no longer freezes, and JBoss receives responses promptly.

Summary of Changes

Component	Original State	New State	Purpose
Redo Log Storage	Standard SSD	Dedicated Kioxia NVMe	Provide low-latency, high-bandwidth physical storage.
Redo Log Size	Small (e.g., 200MB)	Large (2GB)	Increase capacity to absorb heavy write bursts without frequent switching.
I/O Mode	Default (Sync/Async)	`filesystemio_options = setall`	Enable Async & Direct I/O to fully utilize NVMe capabilities.
DB Writers	`db_writer_processes = 1`	`db_writer_processes =` 8	Increase parallelism for flushing data from RAM to disk.

This case serves as a perfect example that high-performance hardware is useless without the correct software configuration to exploit it. By tuning the entire I/O stack, I resolved a critical business issue and unlocked the full potential of my infrastructure.

The Case of the Frozen Database: How Tuning NVMe I/O Saved a Critical JBoss Application

The Symptom: JBoss Transaction Reaper

The Root Cause: Oracle Database “Freezes”

The Solution: A Multi-Layered Tuning Approach

Step 1: The Hardware Upgrade (Storage)

Step 2: Solving the Capacity Problem (Redo Log Sizing)

Step 3: Solving the Throughput Problem (I/O Tuning)

The Result: Stability and Performance

Summary of Changes

Like this:

Related

Leave a ReplyCancel reply

The Symptom: JBoss Transaction Reaper

The Root Cause: Oracle Database “Freezes”

The Solution: A Multi-Layered Tuning Approach

Step 1: The Hardware Upgrade (Storage)

Step 2: Solving the Capacity Problem (Redo Log Sizing)

Step 3: Solving the Throughput Problem (I/O Tuning)

The Result: Stability and Performance

Summary of Changes

Share this:

Like this:

Related

Leave a ReplyCancel reply