04-Feb-2018

Hyperspi++

BEWARE: This is a technical issue on how to analyze a program error with (built-in!!) options to compiler and linker. To understand, you’ll need some knowledge of hoe a program on VMS looks like internally.
Listing fragments have been slightly edited to fit on the screen.

The issue itself – and what’s related – can be found on the WASD mailing list for 2018 in entry 8 and following. This file contains the data of the analysis as taken from the files (not edited).

As mentioned, HyperSpi++ collector didn’t collect any data since last reboot. The log showed whyL

%SYSTEM-F-HPARITH, high performance arithmetic trap,
 Imask=00000000, Fmask=00080000, summary=02, PC=0000000000031448, PS=0000001B
-SYSTEM-F-FLTINV, floating invalid operation,
 PC=0000000000031448, PS=0000001B

  Improperly handled condition, image exit forced.
    Signal arguments:   Number = 0000000000000006
                        Name   = 0000000000000504
                                 0008000000000000
                                 0000000000080000
                                 0000000000000002
                                 0000000000031448
                                 000000000000001B

Register dump:
R0 =0000000000040990 R1 =00B2816B42BD488F R2 =0000000000010250
R3 =0000000000000001 R4 =00000000000201F0 R5 =0000000000040860
R6 =0000000000040870 R7 =0000000000040870 R8 =0000000000000005
R9 =0000000000040907 R10=00000000000408F5 R11=000000007FFCDC18
R12=000000007FFCDA98 R13=000000007AEC33A0 R14=0000000000000000
R15=000000007AEC29B0 R16=0000000000000000 R17=0000000000010848
R18=000000007BF260F8 R19=FFFFFFFFFFFFFFF9 R20=FF4D7E94BD42B771
R21=000000007B63DA10 R22=000000007ADB9940 R23=0000000000000002
R24=0000000000000001 R25=000000000006C004 R26=FFFFFFFF80FAFB60
R27=000000007BEF4820 R28=FFFFFFFF80CAFE40 R29=000000007ADB9930
SP =000000007ADB9860 PC =0000000000031448 PS =200000000000001B

but since the program wasn’t built with the options: /LIS/MACHINE_CODE on compilation, and /MAP/FULL on link. you’re in the dark. IMHO, these options should be mandantory for programs that run without user intervention, as detached program or as a service.

But the kit contains the command procedures to build the agent, so I added these options, built the agent and ran it – interactively. Same error (of course) but now I could pinpoint the location.

The MAP files shows the sections (PSECT) of a program – and the offsets in the image file. The important part is pretty well in the beginning:

DEFAULT_CLUSTER
  0     5    00010000    3 0 READ WRITE NON-SHAREABLE ADDRESS DATA
  0     2    00020000    8 0 READ WRITE COPY ON REF
  0    30    00030000   10 0 READ ONLY  EXECUTABLE    < <<<-------   0     5    00040000    0 0 READ WRITE DEMAND ZERO   0     2    00050000   40 0 READ WRITE FIXUP VECTORS 253    20    7FFF0000    0 0 READ WRITE DEMAND ZERO

Most important here is the third line, where the last colums states "EXECUTABLE" This is where executable code - so what the program does - is located. It starts at offset 30000.

Since the crash line states the process counter (PC) where the error happens:

%SYSTEM-F-HPARITH, high performance arithmetic trap, Imask=00000000, Fmask=00080000, summary=02, PC=0000000000031448, PS=0000001B
-SYSTEM-F-FLTINV, floating invalid operation, PC=0000000000031448, PS=0000001B

it means the program counter refers to offset 31448 – within that section (obvious: it happens when the program executes an instruction) at offset 1448 in that section (31448 – 30000).

But what instruction – and which codeline?
Now the listing is important – and especially the /MACHONE_CODE switch. It lists the actual instructions that the program executes – located at their offsets in the executable section of the program. In the listing, this is the second part; it shows instructions, offset (within the section), the instruction and the codeline: For this particular case, look around offset 1448:

0234  L$227:
1420          LDAH    R25, 7(R31)                         ; 040554
1424          LDL     R16, (R4)                           ; 040553
1428          LDQ     R17, -528(R2)                       ; 040554
142C          LDA     R25, -16380(R25)
1430          LDQ     R18, -552(R2)
1434          LDQ     R26, -544(R2)
1438          LDF     F1, (R0)                            ; 040552
143C          LDQ     R27, -536(R2)                       ; 040554
1440          LDA     R17, 936(R17)
1444          ADDF    F1, F18, F19                        ; 040552
1448          STF     F19, (R0)
144C          UNOP
1450          BEQ     R16, L$122                          ; 040553
1454          LDL     R16, DECC$GA_STDOUT ; R16, (R18)    ; 040554
1458          JSR     R26, DECC$GXFPRINTF ; R26, R26
145C  L$122:                                              ; 040555

The first column shows the offset in the section – starting at 30000, which is offset 0. So the offending instruction seems to be on offset 1448 – the STF instruction.
The last column – not on all lines – is the line number of the code that the programmer wrote. This will, in this case, be the instruction on line 40552: on the instruction just above this instruction. This is in the first part of the file, so look around this linenummber:

1   40538 /**************************/
1   40539 /* calculate elapsed time */
1   40540 /**************************/
1   40541
1   40542 lib$sub_times (&CurrentBinTime, 
                         &PreviousBinTime,
                         &DiffBinTime);
1   40543
1X  40544 #ifdef __RMIDEF_LOADED
1X  40545 status = lib$cvts_from_internal_time 
                  (&cvtf_mode, &DeltaSeconds,
1X  40546          &DiffBinTime);
1X  40547 #else
1   40548 status = lib$cvtf_from_internal_time
                 (&cvtf_mode, &DeltaSeconds,
1   40549         &DiffBinTime);
1   40550 #endif
1   40551
1   40552 TotalDeltaSeconds += DeltaSeconds;   < <===HERE 1   40553 if (Debug) 1   40554  fprintf (stdout,           "DeltaSeconds: %f TotalDeltaSeconds: %f\n", 1   40555  DeltaSeconds, TotalDeltaSeconds); 1   40556

Both TotalDeltaSeconds and DeltaSeconds are defined as FLOAT - which is consistent with the error text. The question is: WHAT IS WRONG? So you will have to do some digging; compile the code with option /DEBUG, and set a breakpoint on, or just above this line, to find out what is going on.

In this case, it turned out that quite likely compiler switch /OPTIMIMZE (perhaps implied by /NODEBUG) made the error occur; using /NOOPTIMZE solved the problem and the program executed flawlessly.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.