ORA-1555 after switchover to standby

This week started with diagnosing an interesting situation that didn’t seem to be talked about much (maybe because it is a relatively “old” problem–explanation in a minute), so I thought I’d share the experience for everyone’s benefit.

The situation involves a 10.2.0.1 (<gasp>) database on Linux 32-bit RHEL 4. The database is in flashback mode and has a single physical standby database with Data Guard configured in maximum availability mode. The standby was configured relatively recently and had never been tested with a switchover. The switchover was scheduled and conducted without much issue.

The switchover to the standby database was done, application brought up against the now activated standby database and there were no issues reported. Some basic testing was done for about 30 minutes, no application errors, no database errors.

Once satisfied that the standby database was viable and worked, the customer wanted the original production database server to be put back into the primary role. Another switchover was executed without issue and the applications were again reconfigured to access the primary database and brought up. It was at this point that the alert log started showing many (100+ per hour, during low utilization period) errors like these:

ORA-01555 caused by SQL statement below (SQL ID: 74rfy9pxwabb4, Query Duration=0 sec, SCN: 0x0003.52c296c8):

This error was puzzling since the query duration is 0 sec, so it isn’t a undo_retention issue. These statistics from v$undostat were also puzzling.


                            Tuned Undo # Unexpired  # Expired   ORA-1555 Out-Of-space  Max Query
   INST_ID BEGIN_TIME        Retention      Stolen     Reused      Error        Error     Length
---------- ---------------- ---------- ----------- ---------- ---------- ------------ ----------
         1 09/02/2008 12:59       1500           0          0          7            0        136
         1 09/02/2008 13:09       1500           0          0          3            0        137
         1 09/02/2008 13:19       1500           0          0          4            0         37
         1 09/02/2008 13:29       1500           0          0         12            0         27
         1 09/02/2008 13:39       1500           0          0          5            0         19
         1 09/02/2008 13:49       1500           0          0          0            0         23
         1 09/02/2008 13:59       1500           0          0          0            0         66
         1 09/02/2008 14:09       1500           0          0         19            0        157
         1 09/02/2008 14:19       1500           0          0         22            0        142
         1 09/02/2008 14:29       1500           0          0         14            0          0
         1 09/02/2008 14:39       1500           0          0         10            0          0
         1 09/02/2008 14:49       1500           0          0         17            0        114
         1 09/02/2008 14:59       1500           0          0          3            0          0

Seeing that Max Query Length is at least an order of magnitude less than the undo_retention (1500), there’s no reason why the ORA-1555 errors should be occurring–especially more than 20 in a 10-minute interval. At this point, I concluded that we must be hitting a bug. After all, these statistics were complete inconsistent with logic and we are on 10.2.0.1, so it seems pretty likely that others have probably hit this issue before given the 3+ years of field experience that 10g Release 2 has.

It took a few tries, but eventually, I located Oracle BUG 5380055 which is documented in Metalink Note 386830.1. It affects environments on 10.1.0.5, 10.2.0.1 or 10.2.0.2 that have flashback database enabled and conduct two switchovers. There are some other scenarios that will trigger the bug as well, but this scenario fit our circumstances.

Luckily, the note suggests that there is a patch. I looked and was lucky to find that 10.2.0.1 does indeed have a patch available for our platform. Following the procedure to install the patch on both primary and standby environments is supposed to prevent the issue from occurring in the future. A second procedure after the patch is installed is required to “fix” the issues that were introduced in to the database by the bug. I don’t really understand the internals of the issue well enough to understand how this fixes it, but the fix is to set these two parameters in the primary instance:

*._allow_error_simulation = TRUE
*._smu_debug_mode = 268435456

Then, start the primary instance in restricted mode. Remove the parameters and restart the primary instance in normal mode and the problem(s) should be gone. The standby database will receive the fixes in the redo stream, so there’s no additional procedures to conduct on the standby side. In our case, it worked well and the entire downtime to install the patch and conduct the restart procedure was about 15 minutes.

Obviously, we long ago recommended to this customer that they update their patch level to 10.2.0.3 or 10.2.0.4. The application they use is custom written in-house, so the only barrier to the upgrade is the time necessary to test the patchset. They will upgrade soon–I’m sure this issue will help motivate them even more.

21 thoughts on “ORA-1555 after switchover to standby”

  1. How weird (or is it serendipitous) ? I have been lightly trying to figure out a similar issue on a standalone training database that is just keeping and keeping undo segments for weeks at a time. I found Metalink Doc #420525.1 which sounded right up my alley. (Okay, so it isn't the exact issue but it is very similar). I implemented work-around #1 and am now seeing the TUNED_UNDORETENTION value in v$undostat is mirroring what my undo_retention is set to. the UNEXPIREDBLKS and EXPIREDBLKS have flip-flopped as well. Now to see if the segments automatically start shrinking.

  2. I probably read that note too…there was no shortage of AUM-related
    bugs and issues when I started looking around. The hard part, as it
    turned out, was to see which of the many matches was actually an exact
    match for my issue. Sounds like you're in the same situation possibly.
    Good luck!

  3. Let's look at a few things you'll need to do after installing Beehive, but before you start migrating users… If you haven't installed Beehive yet, check out my last post. OK, here we go.

  4. I am thinking that there is a way to have CRS put the standby in managed recovery mode, but not finding it easy to find in the documentation. Why is it not made more clear, more obvious? Especially if Oracle is touting this thing called “Best Practices” for “Maximum Availability”

Comments are closed.