This week started with diagnosing an interesting situation that didn’t seem to be talked about much (maybe because it is a relatively “old” problem–explanation in a minute), so I thought I’d share the experience for everyone’s benefit.

The situation involves a 10.2.0.1 (<gasp>) database on Linux 32-bit RHEL 4. The database is in flashback mode and has a single physical standby database with Data Guard configured in maximum availability mode. The standby was configured relatively recently and had never been tested with a switchover. The switchover was scheduled and conducted without much issue.

The switchover to the standby database was done, application brought up against the now activated standby database and there were no issues reported. Some basic testing was done for about 30 minutes, no application errors, no database errors.

Once satisfied that the standby database was viable and worked, the customer wanted the original production database server to be put back into the primary role. Another switchover was executed without issue and the applications were again reconfigured to access the primary database and brought up. It was at this point that the alert log started showing many (100+ per hour, during low utilization period) errors like these:

ORA-01555 caused by SQL statement below (SQL ID: 74rfy9pxwabb4, Query Duration=0 sec, SCN: 0x0003.52c296c8):

This error was puzzling since the query duration is 0 sec, so it isn’t a undo_retention issue. These statistics from v$undostat were also puzzling.


                            Tuned Undo # Unexpired  # Expired   ORA-1555 Out-Of-space  Max Query
   INST_ID BEGIN_TIME        Retention      Stolen     Reused      Error        Error     Length
---------- ---------------- ---------- ----------- ---------- ---------- ------------ ----------
         1 09/02/2008 12:59       1500           0          0          7            0        136
         1 09/02/2008 13:09       1500           0          0          3            0        137
         1 09/02/2008 13:19       1500           0          0          4            0         37
         1 09/02/2008 13:29       1500           0          0         12            0         27
         1 09/02/2008 13:39       1500           0          0          5            0         19
         1 09/02/2008 13:49       1500           0          0          0            0         23
         1 09/02/2008 13:59       1500           0          0          0            0         66
         1 09/02/2008 14:09       1500           0          0         19            0        157
         1 09/02/2008 14:19       1500           0          0         22            0        142
         1 09/02/2008 14:29       1500           0          0         14            0          0
         1 09/02/2008 14:39       1500           0          0         10            0          0
         1 09/02/2008 14:49       1500           0          0         17            0        114
         1 09/02/2008 14:59       1500           0          0          3            0          0

Seeing that Max Query Length is at least an order of magnitude less than the undo_retention (1500), there’s no reason why the ORA-1555 errors should be occurring–especially more than 20 in a 10-minute interval. At this point, I concluded that we must be hitting a bug. After all, these statistics were complete inconsistent with logic and we are on 10.2.0.1, so it seems pretty likely that others have probably hit this issue before given the 3+ years of field experience that 10g Release 2 has.

It took a few tries, but eventually, I located Oracle BUG 5380055 which is documented in Metalink Note 386830.1. It affects environments on 10.1.0.5, 10.2.0.1 or 10.2.0.2 that have flashback database enabled and conduct two switchovers. There are some other scenarios that will trigger the bug as well, but this scenario fit our circumstances.

Luckily, the note suggests that there is a patch. I looked and was lucky to find that 10.2.0.1 does indeed have a patch available for our platform. Following the procedure to install the patch on both primary and standby environments is supposed to prevent the issue from occurring in the future. A second procedure after the patch is installed is required to “fix” the issues that were introduced in to the database by the bug. I don’t really understand the internals of the issue well enough to understand how this fixes it, but the fix is to set these two parameters in the primary instance:

*._allow_error_simulation = TRUE
*._smu_debug_mode = 268435456

Then, start the primary instance in restricted mode. Remove the parameters and restart the primary instance in normal mode and the problem(s) should be gone. The standby database will receive the fixes in the redo stream, so there’s no additional procedures to conduct on the standby side. In our case, it worked well and the entire downtime to install the patch and conduct the restart procedure was about 15 minutes.

Obviously, we long ago recommended to this customer that they update their patch level to 10.2.0.3 or 10.2.0.4. The application they use is custom written in-house, so the only barrier to the upgrade is the time necessary to test the patchset. They will upgrade soon–I’m sure this issue will help motivate them even more.

  • http://piontekdd.blogspot.com/ piontekdd

    How weird (or is it serendipitous) ? I have been lightly trying to figure out a similar issue on a standalone training database that is just keeping and keeping undo segments for weeks at a time. I found Metalink Doc #420525.1 which sounded right up my alley. (Okay, so it isn't the exact issue but it is very similar). I implemented work-around #1 and am now seeing the TUNED_UNDORETENTION value in v$undostat is mirroring what my undo_retention is set to. the UNEXPIREDBLKS and EXPIREDBLKS have flip-flopped as well. Now to see if the segments automatically start shrinking.

  • http://www.dannorris.com/ Dan Norris

    I probably read that note too…there was no shortage of AUM-related
    bugs and issues when I started looking around. The hard part, as it
    turned out, was to see which of the many matches was actually an exact
    match for my issue. Sounds like you're in the same situation possibly.
    Good luck!

  • Pingback: Log Buffer #114: A Carnival of the Vanities for DBAs()

  • http://www.mogs.com/aoc-gold age of conan gold

    Let's look at a few things you'll need to do after installing Beehive, but before you start migrating users… If you haven't installed Beehive yet, check out my last post. OK, here we go.

  • http://www.dannorris.com/ Dan Norris

    Huh? Did this get misposted here by mistake or something? I am familiar with Beehive, but this post certainly didn't have anything to do with it.

  • http://www.baliinc.net/ Busby SEO Test

    great report, I'll take a note on this

  • http://intersindo.com/seo-contest/ Busby SEO

    The report so clearly, Good Diagnose.

  • http://www.comparesimple.com/rac-breakdown-cover/ rac breakdown recovery uk

    i will take note of this.
    good report.

  • http://www.dataentry-va.com Data Processing

    really nice information.

  • http://www.yourdevoncottages.co.uk Devon Cottages

    Nice work with this… you seem to have handled the request really well. Better than most.

  • http://www.unlockphoneshop.com/ Unlocked Phone

    thanks for this great post and thanks for sharing this information.

  • http://www.hypnosis-training-devon.co.uk/ hypnosiscourse

    I am thinking that there is a way to have CRS put the standby in managed recovery mode, but not finding it easy to find in the documentation. Why is it not made more clear, more obvious? Especially if Oracle is touting this thing called “Best Practices” for “Maximum Availability”

  • http://www.dannorris.com/ Dan Norris

    I'm not sure why you almost quoted Charles Schultz, but I'm guessing you're actually a bot. Hope this comment gets you the hits you're looking for. See Charles' comment at http://www.dannorris.com/2009/03/12/start-datab

  • cash_advance

    That is wierd that 10. IPs would cause errors but that it would not effect other IP

  • http://www.club-penguin.org/ Club Penguin Cheats

    I have been lightly trying to figure out a similar issue on a standalone training database that is just keeping and keeping undo segments for weeks at a time.

  • increasewebsitetraffic

    Good information…keep it flowing.
    http://www.seotrafficspider.com

  • increasewebsitetraffic

    I'm glad to have visited your blog and good to know you! I find it interesting and informative.
    http://www.seotrafficspider.com

  • increasewebsitetraffic

    Your blog is like an encyclopedia for those who want to know more about this. Thanks for the interesting information.
    http://www.seotrafficspider.com/

  • affordableseoservices1

    Good advice in your blog. Thanks a lot.
    http://www.seotrafficspider.com/

  • affordableseoservices1

    Thanks for the good info…very good blog site.
    Very nice information. Thanks for this.

  • http://pulse.yahoo.com/_IGD6SBWEKSZQKT2GJXEJCGH4BE Lonson Ricardo

    Great that you have managed to fix that issue. Rezultate live