IA32Lib

Document Revision 1.01.030609

This is intended to be a short description of the ia32lib library. You can use the links below to jump to the different sections. Use the HOME key on your keyboard to return to the top. If you want to get the big picture without reading too much, I would suggest skipping the Reference section. If you don't feel reading this at all, scan through the Examples section.

The IA32Lib toolkit has been designed and developed by Kamen Yotov. Direct your questions, suggestions and concerns to kamen@yotov.org.

We will appreciate any possible feedback you might have at every stage (usage, design, source, documentation...). Enjoy!

Download!

[ Introduction | Overview | Requirements | Installation | Reference | Examples | Future Work ]

Introduction

Modern processors based on IA-32 have performance counter registers that allow programmers to count different statistical values about their running applications. Programming these counters to count exactly what a programmer wants and reading their values requires access to the so called Model Specific Registers (MSRs) of the processor. There are several instructions in the IA-32 ISA that provide access to these special registers (e.g. RDMSR, WRMSR, RDPMC), but they are either privileged (restricted to be executed in kernel ring-0 mode only) or have some other restrictions, which combined with the security of the operating system, does not allow the application programmer to use them. The main purpose of the ia32lib library is to export user-programmable interface to these crucial performance measurement facilities and to provide appropriate ways for detailed processor detection (family, model, cache configuration, ...).

Both the library and its full source code are free for personal use and can be freely downloaded. I have not yet figured out the policy for commercial uses, but who knows...

Overview

ia32lib consists of two main parts plus some examples:

The sources to build ia32.sys are provided for completeness and educational purposes only. It is not advisable to try rebuilding ia32.sys unless you really know what you are doing. Further you will need Microsoft Windows NT Driver Development Kit (freely available from Microsoft's site).

Requirements

NOTES: The compilers ia32lib has been tested with so far are Microsoft C++ 6.0 and 7.0 and Intel C++ Optimizing Compiler 5.0. Every effort has been made to make to code portable to other compilers, but no tests have been performed so far. The next compilers to look at will probably be Watcom C++ 11.0c and Borland C++ 5.5. Compatibility with the first three mentioned above is guaranteed as long the development effort continues.

Installation

  1. Download the distribution (If you have not already done so);
  2. Start ia32lib.exe - this will unpack all files to a directory of your choice;
  3. Install the ia32.sys Kernel-Mode Driver on your Windows system (step-by-step instructions);
  4. Open ia32lib.dsw workspace or ia32lib.sln solution with the appropriate version of Microsoft Visual C++ (6.0 or 7.0.NET respectively);
  5. You are ready to go! Try building the two sample programs (ia32detect and ia32p6).

NOTES: Installation of the ia32.sys driver does not require system restart on Windows XP. The installation instructions are also for Windows XP, but the steps should be isomorphic to the steps for Windows NT and Windows 2000. Note that at this point I have not tried the driver on Windows NT and Windows 2000, but it should work :). If you have any problems, mail me!

The directory structure of the distribution is as follows:

Reference

[ ia32def.h | ia32size.h | ia32error.h | ia32driver.h | ia32ring0.h | ia32cache.h | ia32detect.h | ia32counter.h | p6counter.h ]

This part is mostly top-down description of all features in the library. Each header file is discussed separately and in detail. Moreover, if you don't feel like reading, this is the part to skip :).

ia32def.h

types
  name equivalent
  byte unsigned char
  word

unsigned

  bit unsigned
  uint8 unsigned __int8
  uint16 unsigned __int16
  uint32 unsigned __int32
  uint64 unsigned __int64

Notes:

Back to Reference...

ia32size.h

Constants
  Name Value
  B (uint64)1
  KB (1024 * B)
  MB (1024 * KB)
  GB (1024 * MB)
  TB (1024 * TB)
Classes
  Name Definition
  ia32size
class ia32size
	{
	    uint64 size;
	public:
	    ia32size (uint64);
	    operator const string () const;
	    operator const uint64 () const;
	}

Notes:

Example:

#include "ia32size.h"

	void main ()
	{
	    printf("%8s\n", ((string)ia32size(16)).c_str());
	    printf("%8s\n", ((string)ia32size(1024)).c_str());
	    printf("%8s\n", ((string)ia32size(4096)).c_str());
	    printf("%8s\n", ((string)ia32size(3 * 1024 * 1024)).c_str());
	    printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024)).c_str());
	    printf("%8s\n", ((string)ia32size((uint64)13 * 1024 * 1024 * 1024 * 1024 + (uint64)7 * 1024 * 1024 * 1024)).c_str());

	    printf("%8d\n", (uint64)ia32size(12345678));
	}
	

Output:

    16 B
	     1KB
	     4KB
	     3MB
	    13TB
	 13319GB
	12345678
	

Back to Reference...

ia32error.h

Classes
  Name Definition
  ia32error
class ia32error
	{
	public:
	    enum err_
	    {
	        err_generic,
	        err_ring0_cpu,
	        err_ring0_create,
	        err_ring0_ioctl,
	        err_ring0_size,
	        err_ring0_close,
	        err_counter_overflow,
	        err_counter_family,
	        err_counter_MMX,
	        err_counter_SSE,
	        err_counter_counter,
	        err_invalid
	    };

	    ia32error (err_);
	    operator const char * () const;
	protected:
	    err_ v;
	};

Notes:

Back to Reference...

ia32driver.h

Constants
  Name Value
  IA32CPU_TYPE 40000
  IOCTL_IA32CPU_READ_MSR CTL_CODE(IA32CPU_TYPE, 0x900, METHOD_BUFFERED, FILE_READ_ACCESS)
  IOCTL_IA32CPU_WRITE_MSR CTL_CODE(IA32CPU_TYPE, 0x901, METHOD_BUFFERED, FILE_WRITE_ACCESS)

Notes:

Back to Reference...

ia32ring0.h

Classes
  Name Definition
  ia32ring0
class ia32ring0
	{
	    HANDLE h;
	public:
	    ia32ring0 ();
	    uint64 rdmsr (uint32 i) const;
	    void wrmsr (uint32 i, uint64 d) const;
	    ~ia32ring0 ();
	};
	

Notes:

Back to Reference...

ia32cache.h

Classes
  Name Definition
  ia32cache
class ia32cache
	{
	public:
	    enum type_
	    {
	        type_reserved,
	        type_unified,
	        type_instruction,
	        type_trace,
	        type_data,
	        type_invalid
	    };

	    enum _
	    {
	        level_TLB = -1,
	        associativity_Full = -1,
	        block_AnySize = 0
	    };

	    const byte descriptor;
	    const type_ type;
	    const int level;
	    const ia32size capacity;
	    const ia32size block;
	    const int associativity;

	    ia32cache (byte, type_, int, ia32size, ia32size, int);
	    operator const string () const;
	protected:
	    const const char * type_text () const;
	    const const string associativity_text () const;
	};
Variables
  Name Declaration
  ia32caches extern const ia32cache ia32caches[];
Functions
  Name Prototype
  _ia32cache const ia32cache &_ia32cache (byte);

Notes:

Back to Reference...

ia32detect.h

Classes
  Name Definition
  ia32error
class ia32detect
	{
	public:
	    enum type_
	    {
	        type_OEM,
	        type_OverDrive,
	        type_Dual,
	        type_reserved
	    };

	    enum brand_
	    {
	        brand_na,
	        brand_Celeron,
	        brand_PentiumIII,
	        brand_PentiumIIIXeon,
	        brand_reserved1,
	        brand_reserved2,
	        brand_PentiumIIIMobile,
	        brand_reserved3,
	        brand_Pentium4,
	        brand_invalid
	    };

	    struct version_
	    {
	        bit Stepping  : 4;
	        bit Model     : 4;
	        bit Family    : 4;
	        bit Type      : 2;
	        bit Reserved1 : 2;
	        bit XModel    : 4;
	        bit XFamily   : 8;
	        bit Reserved2 : 4;
	    };

	    struct misc_
	    {
	        byte Brand;
	        byte CLFLUSH;
	        byte Reserved;
	        byte APICId;
	    };

	    struct feature_
	    {
	        bit FPU       : 1; // Floating Point Unit On-Chip
	        bit VME       : 1; // Virtual 8086 Mode Enhancements
	        bit DE        : 1; // Debugging Extensions
	        bit PSE       : 1; // Page Size Extensions
	        bit TSC       : 1; // Time Stamp Counter
	        bit MSR       : 1; // Model Specific Registers
	        bit PAE       : 1; // Physical Address Extension
	        bit MCE       : 1; // Machine Check Exception
	        bit CX8       : 1; // CMPXCHG8 Instruction
	        bit APIC      : 1; // APIC On-Chip
	        bit Reserved1 : 1; 
	        bit SEP       : 1; // SYSENTER and SYSEXIT instructions
	        bit MTRR      : 1; // Memory Type Range Registers
	        bit PGE       : 1; // PTE Global Bit
	        bit MCA       : 1; // Machine Check Architecture
	        bit CMOV      : 1; // Conditional Move Instructions
	        bit PAT       : 1; // Page Attribute Table
	        bit PSE36     : 1; // 32-bit Page Size Extension
	        bit PSN       : 1; // Processor Serial Number
	        bit CLFSH     : 1; // CLFLUSH Instruction
	        bit Reserved2 : 1;
	        bit DS        : 1; // Debug Store
	        bit ACPI      : 1; // Thermal Monitor and Software Controlled Clock Facilities
	        bit MMX       : 1; // Intel MMX Technology
	        bit FXSR      : 1; // FXSAVE and FXRSTOR Instructions
	        bit SSE       : 1; // Intel SSE Technology
	        bit SSE2      : 1; // Intel SSE2 Technology
	        bit SS        : 1; // Self Snoop
	        bit Reserved3 : 1;
	        bit TM        : 1; // Thermal Monitor
	        bit Reserved4 : 2;
	    };

	    string vendor;
	    string brand;
	    version_ version;
	    misc_ misc;
	    feature_ feature;
	    byte *cache;

	    ia32detect ();
	    const string version_text () const;
	protected:
	    const char * type_text () const;
	    const string brand_text () const;
	private:
	    uint32 init0 ();
	    void init1 (uint32 *d);
	    void process2 (uint32 d, bool c[]);
	    void init2 (byte count);
	    void init0x80000000 ();
	};
	

Notes:

Back to Reference...

ia32counter.h

Classes
  Name Definition
  ia32counter
class ia32counter
	{
	protected:
	    static uint32 count;
	    uint32 index;
	public:
	    ia32counter (uint32 counters);
	};
	

Notes:

Back to Reference...

p6counter.h

Classes
  Name Definition
  p6counter
class p6counter: public ia32counter
	{
	public:
	    enum event_
	    {
	        // Data Cache Unit (DCU)
	        DCU_MEMORY_REFERENCE         = 0x43, // DATA_MEM_REFS
	        DCU_LINES_IN                 = 0x45,
	        DCU_M_LINES_IN               = 0x46,
	        DCU_M_LINES_OUT              = 0x47,
	        DCU_MISS_OUTSTANDING         = 0x48,

	        // Instruction Fetch Unit (IFU)
	        IFU_IFETCH                   = 0x80,
	        IFU_IFETCH_MISS              = 0x81,
	        IFU_TLB_MISS                 = 0x85, // ITLB_MISS
	        IFU_MEMORY_STALL             = 0x86,
	        IFU_ILD_STALL                = 0x87, // ILD_STALL

	        // L2 Cache
	        L2_IFETCH                    = 0x28,
	        L2_LOADS                     = 0x29, // L2_LD
	        L2_STORES                    = 0x2A, // L2_ST
	        L2_LINES_IN                  = 0x24,
	        L2_LINES_OUT                 = 0x26,
	        L2_M_LINES_IN                = 0x25,
	        L2_M_LINES_OUT               = 0x27,
	        L2_REQUEST                   = 0x2E, // L2_RQSTS
	        L2_ADDRESS_STROBE            = 0x21, // L2_ADS
	        L2_DATA_BUS_BUSY             = 0x22, // L2_DBUS_BUSY
	        L2_DATA_BUS_BUSY_READ        = 0x23, // L2_DBUS_BUSY_RD

	        // External Bus Logic (EBL)
	        EBL_DATA_READY               = 0x62, // BUS_DRDY_CLOCKS
	        EBL_LOCK                     = 0x63, // BUS_LOCK_CLOCKS
	        EBL_REQ_OUTSTANDING          = 0x60, // BUS_REQ_OUTSTANDING
	        EBL_TRANS_BURST_READ         = 0x65, // BUS_TRAN_BRD
	        EBL_TRANS_READ_OWNER         = 0x66, // BUS_TRAN_RFO
	        EBL_TRANS_WRITEBACK          = 0x67, // BUS_TRANS_WB
	        EBL_TRANS_IFETCH             = 0x68, // BUS_TRAN_IFETCH
	        EBL_TRANS_INVALIDATE         = 0x69, // BUS_TRAN_INVAL
	        EBL_TRANS_PARTIAL_WRITE      = 0x6A, // BUS_TRAN_PWR
	        EBL_TRANS_PARTIAL            = 0x6B, // BUS_TRANS_P
	        EBL_TRANS_IO                 = 0x6C, // BUS_TRANS_IO
	        EBL_TRANS_DEFERRED           = 0x6D, // BUS_TRAN_DEF
	        EBL_TRANS_BURST              = 0x6E, // BUS_TRAN_BURST
	        EBL_TRANS_ANY                = 0x70, // BUS_TRAN_ANY
	        EBL_TRANS_MEMORY             = 0x6F, // BUS_TRAN_MEM
	        EBL_DATA_RECEIVE             = 0x64, // BUS_DATA_RCV
	        EBL_DRIVE_BNR                = 0x61, // BUS_BNR_DRV
	        EBL_DRIVE_HIT                = 0x7A, // BUS_HIT_DRV
	        EBL_DRIVE_HITM               = 0x7B, // BUS_HITM_DRV
	        EBL_SNOOP_STALL              = 0x7E, // BUS_SNOOP_STALL

	        // Floating-Point Unit (FPU)
	        FPU_FLOPS_RETIRED            = 0xC1, // FLOPS,           Counter 0 only
	        FPU_FLOPS_EXECUTED           = 0x10, // FP_COMP_OPS_EXE, Counter 0 only
	        FPU_ASSIST                   = 0x11, // FP_ASSIST,       Counter 1 only
	        FPU_MUL                      = 0x12, // MUL,             Counter 1 only
	        FPU_DIV                      = 0x13, // DIV,             Counter 1 only
	        FPU_DIV_BUSY                 = 0x14, // CYCLES_DIV_BUSY, Counter 0 only

	        // Memory Ordering (MO)
	        MO_LOAD_BLOCKED              = 0x03, // LD_BLOCKS
	        MO_STORE_BUFFER_DRAIN        = 0x04, // SB_DRAINS
	        MO_MISALLIGNMENT             = 0x05, // MISALIGN_MEM_REF
	        SSE_PREFETCH_DISPATCHED      = 0x07, // EMON_KNI_PREF_DISPATCHED
	        SSE_PREFETCH_MISS            = 0x4B, // EMON_KNI_PREF_MISS

	        // Instruction Decoding and Retirement (IDR)
	        IDR_INSTRUCTION_RETIRED      = 0xC0, // INST_RETIRED
	        IDR_UOP_RETIRED              = 0xC2, // UOPS_RETIRED
	        IDR_INSTRUCTION_DECODED      = 0xD0, // INST_DECODED
	        SSE_INSTRUCTION_RETIRED      = 0xD8, // EMON_KNI_INST_RETIRED
	        SSE_COMPUTATION_RETIRED      = 0xD9, // EMON_KNI_COMP_INST_RET

	        // Interrupts (INT)
	        INT_HW_RECEIVED              = 0xC8, // HW_INT_RX
	        INT_MASKED                   = 0xC6, // CYCLES_INT_MASKED
	        INT_PENDING_AND_MASKED       = 0xC7, // CYCLES_INT_PENDING_AND_MASKED

	        // Branches (BR)
	        BR_INSTRUCTION_RETIRED       = 0xC4, // BR_INST_RETIRED
	        BR_MISSPREDICT_RETIRED       = 0xC5, // BR_MISS_PRED_RETIRED
	        BR_TAKEN_RETIRED             = 0xC6,
	        BR_MISSPREDICT_TAKEN_RETIRED = 0xC7, // BR_MISS_PRED_TAKEN_RET
	        BR_INSTRUCTION_DECODED       = 0xE0, // BR_INST_DECODED
	        BR_BTB_MISS                  = 0xE2, // BTB_MISSES
	        BR_BOGUS                     = 0xE4,
	        BR_BACLEAR                   = 0xE6, // BARCLEARS

	        // Stalls (STALL)
	        STALL_RESOURCE               = 0xA2, // RESOURCE_STALLS
	        STALL_PARTIAL                = 0xD2, // PARTIAL_RAT_STALLS

	        // Multimedia Extensions (MMX)
	        MMX_INSTRUCTION_EXECUTE      = 0xB0, // MMX_INSTR_EXEC
	        MMX_SATURATING_EXECUTE       = 0xB1, // MMX_SAT_INSTR_EXEC
	        MMX_UOP_EXECUTE              = 0xB2, // MMX_UPOS_EXEC
	        MMX_TYPE_EXECUTE             = 0xB3, // MMX_INSTR_TYPE_EXEC
	        MMX_FPU_TRANSITION           = 0xCC, // FP_MMX_TRANS
	        MMX_ASSIST                   = 0xCD,
	        MMX_INSTRUCTION_RETIRED      = 0xCE, // MMX_INSTR_RET

	        // Segment Register Renaming (SRR)
	        SRR_STALL                    = 0xD4, // SEG_RENAME_STALLS
	        SRR_COUNT                    = 0xD5, // SEG_REG_RENAME
	        SRR_COUNT_RETIRED            = 0xD6, // RET_SEG_RENAMES

	        SEGMENT_REGISTER_LOADS       = 0x06, // SEGMENT_REG_LOADS
	        CPU_CLOCKS_UNHALTED          = 0x79  // CPU_CLK_UNHALTED
	    };

	    enum mask_
	    {
	        NONE                      = 0x0,

	        L2_M                      = 0x8,
	        L2_E                      = 0x4,
	        L2_S                      = 0x2,
	        L2_I                      = 0x1,
	        L2_MESI                   = 0xF,

	        EBL_SELF                  = 0x00,
	        EBL_ANY                   = 0x20,

	        SSE_PREFETCH_NTA          = 0x00,
	        SSE_PREFETCH_T1           = 0x01,
	        SSE_PREFETCH_T2           = 0x02,
	        SSE_WEAKLY_ORDERED_STORES = 0x03,

	        SSE_PACKED_AND_SCALAR     = 0x00,
	        SSE_SCALAR                = 0x01,

	        MMX_PACKED_MULTIPLY       = 0x01,
	        MMX_PACKED_SHIFT          = 0x02,
	        MMX_PACK                  = 0x04,
	        MMX_UNPACK                = 0x08,
	        MMX_PACKED_LOGICAL        = 0x10,
	        MMX_PACKED_ARITHMETIC     = 0x20,
	        MMX_ANY                   = 0x3F,

	        MMX_TO_FPU                = 0x0,
	        MMX_FROM_FPU              = 0x1,

	        SRR_ES                    = 0x1,
	        SRR_DS                    = 0x2,
	        SRR_FS                    = 0x4,
	        SRR_GS                    = 0x8,
	        SRR_ANY                   = 0xF
	    };

	    struct
	    {
	        bit event    : 8;
	        bit mask     : 8;
	        bit ring123  : 1;
	        bit ring0    : 1;
	        bit edge     : 1;
	        bit pin      : 1;
	        bit int_     : 1;
	        bit reserved : 1;
	        bit enable   : 1;
	        bit invert   : 1;
	        bit count    : 8;
	    } config;

	    p6counter (event_ event, mask_ mask = NONE, byte count = 0, bool invert = false);
	    operator const uint64 () const;
	protected:
	    ia32ring0 r0;
	};
	

Notes:

Back to Reference...

Examples

ia32detect

This examples fully exploits the features for CPU detection. Here you can find demonstrated all the supported features. Provided below is the complete source code (not much).

#include "ia32.h"

	void main ()
	{
	    ia32detect ia32;

	    printf("Vendor  = %s\n\n", ia32.vendor.c_str());
	    printf("Brand   = %s\n\n", ia32.brand.c_str());
	    printf("Version = %s\n\n", ia32.version_text().c_str());
	    printf("Cache: \n\n");

	    for (int i = 0; ia32.cache[i]; i++)
	        printf("%s\n", ((string)_ia32cache(ia32.cache[i])).c_str());

	    printf("\nFeatures:\n\n");

	    printf("%c %s\n", ia32.feature.FPU   ? '+' : '-', "Floating Point Unit On-Chip");
	    printf("%c %s\n", ia32.feature.VME   ? '+' : '-', "Virtual 8086 Mode Enhancements");
	    printf("%c %s\n", ia32.feature.DE    ? '+' : '-', "Debugging Extensions");
	    printf("%c %s\n", ia32.feature.PSE   ? '+' : '-', "Page Size Extensions");
	    printf("%c %s\n", ia32.feature.TSC   ? '+' : '-', "Time Stamp Counter");
	    printf("%c %s\n", ia32.feature.MSR   ? '+' : '-', "Model Specific Registers");
	    printf("%c %s\n", ia32.feature.PAE   ? '+' : '-', "Physical Address Extension");
	    printf("%c %s\n", ia32.feature.MCE   ? '+' : '-', "Machine Check Exception");
	    printf("%c %s\n", ia32.feature.CX8   ? '+' : '-', "CMPXCHG8 Instruction");
	    printf("%c %s\n", ia32.feature.APIC  ? '+' : '-', "APIC On-Chip");
	    printf("%c %s\n", ia32.feature.SEP   ? '+' : '-', "SYSENTER and SYSEXIT instructions");
	    printf("%c %s\n", ia32.feature.MTRR  ? '+' : '-', "Memory Type Range Registers");
	    printf("%c %s\n", ia32.feature.PGE   ? '+' : '-', "PTE Global Bit");
	    printf("%c %s\n", ia32.feature.MCA   ? '+' : '-', "Machine Check Architecture");
	    printf("%c %s\n", ia32.feature.CMOV  ? '+' : '-', "Conditional Move Instructions");
	    printf("%c %s\n", ia32.feature.PAT   ? '+' : '-', "Page Attribute Table");
	    printf("%c %s\n", ia32.feature.PSE36 ? '+' : '-', "32-bit Page Size Extension");
	    printf("%c %s\n", ia32.feature.PSN   ? '+' : '-', "Processor Serial Number");
	    printf("%c %s\n", ia32.feature.CLFSH ? '+' : '-', "CLFLUSH Instruction");
	    printf("%c %s\n", ia32.feature.DS    ? '+' : '-', "Debug Store");
	    printf("%c %s\n", ia32.feature.ACPI  ? '+' : '-', "Thermal Monitor and Software Controlled Clock Facilities");
	    printf("%c %s\n", ia32.feature.MMX   ? '+' : '-', "Intel MMX Technology");
	    printf("%c %s\n", ia32.feature.FXSR  ? '+' : '-', "FXSAVE and FXRSTOR Instructions");
	    printf("%c %s\n", ia32.feature.SSE   ? '+' : '-', "Intel SSE Technology");
	    printf("%c %s\n", ia32.feature.SSE2  ? '+' : '-', "Intel SSE2 Technology");
	    printf("%c %s\n", ia32.feature.SS    ? '+' : '-', "Self Snoop");
	    printf("%c %s\n", ia32.feature.TM    ? '+' : '-', "Thermal Monitor");
	}
	

Below is the output from my laptop machine. Please, if you decide to install the package, run this small problem and e-mail me the results.

Vendor  = GenuineIntel

	Brand   = Intel(R) Pentium(R) III Mobile CPU      1000MHz

	Version = 6.11.1 Intel OEM Processor XVersion(0.0)

	Cache: 

	0x01: TLB instruction, Entries( 32), PageSize(4KB), Associativity(4-way)
	0x02: TLB instruction, Entries(  2), PageSize(4MB), Associativity( Full)
	0x03: TLB        data, Entries( 64), PageSize(4KB), Associativity(4-way)
	0x04: TLB        data, Entries(  8), PageSize(4MB), Associativity(4-way)
	0x08: L1 instruction$, Size(  16KB), Block(  32 B), Associativity(4-way)
	0x0c: L1        data$, Size(  16KB), Block(  32 B), Associativity(4-way)
	0x83: L2     unified$, Size( 512KB), Block(  32 B), Associativity(8-way)

	Features:

	+ Floating Point Unit On-Chip
	+ Virtual 8086 Mode Enhancements
	+ Debugging Extensions
	+ Page Size Extensions
	+ Time Stamp Counter
	+ Model Specific Registers
	+ Physical Address Extension
	+ Machine Check Exception
	+ CMPXCHG8 Instruction
	- APIC On-Chip
	+ SYSENTER and SYSEXIT instructions
	+ Memory Type Range Registers
	+ PTE Global Bit
	+ Machine Check Architecture
	+ Conditional Move Instructions
	+ Page Attribute Table
	+ 32-bit Page Size Extension
	- Processor Serial Number
	- CLFLUSH Instruction
	- Debug Store
	- Thermal Monitor and Software Controlled Clock Facilities
	+ Intel MMX Technology
	+ FXSAVE and FXRSTOR Instructions
	+ Intel SSE Technology
	- Intel SSE2 Technology
	- Self Snoop
	- Thermal Monitor
	

ia32p6

This example demonstrates the usage of Intel P6 Hardware Performance Monitoring Counters. Processors from this family have two almost identical counters. In the source below, one of them is setup to count memory references and the other - to count requests to the L2 cache (which is actually nothing else but L1 misses!).

#include "ia32.h"
	#include "p6counter.h"

	void main ()
	{
	    p6counter c1(p6counter::L2_REQUEST, p6counter::L2_MESI);
	    p6counter c2(p6counter::DCU_MEMORY_REFERENCE);

	    const int c = 10000000;
	    static int a[c];

	    for (int ai1 = 0; ai1 < c; ai1++)
	        a[ai1]++;

	    SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);

	    uint64 t1 = c1;
	    uint64 t2 = c2;

	    for (int ai2 = 0; ai2 < c; ai2++)
	        a[ai2] *= 13;

	    printf("L1 misses   = %I64d\nL1 accesses = %I64d\n", c1 - t1, c2 - t2);
	}
	

We walk an array of 10000000 integers, multiplying each element by 13 (a load access, followed by a store access, i.e. 2 accesses per element). Also because the L1 line size is 32 bytes, we have 8 elements per line or about 12500000 cache lines accessed (all misses). This totals up to 20000000 memory accesses and 12500000 L1 misses. The excess of 1495 misses and 10728 memory accesses in the results below is due to OS noise, the amount of which (<<1%) is quite acceptable.

L1 misses   = 1251495
	L1 accesses = 20010728
	

The code of the example employs many techniques to reduce the noise during measurements. Here are the most important things you need to keep in mind when monitoring performance in this setting:

  1. Microsoft Windows NT / 2K / XP does not allocate all the memory your process requested instantly after the request. Rather pages are allocated when they are first accessed. This means that when you access a memory page for the first time, a page fault occurs and the OS takes over. The instructions executed by the OS exception handler can be millions, resulting in excessive noise in the measurements. For this reason the code above walks the array in advance to make sure all pages are present in memory when the counting starts.
  2. Because Microsoft Windows NT / 2K / XP is a preemptive multitasking operating system, our program is not the only thing running on the machine. Performance counters are in the CPU and they count for all processes simultaneously. In order to reduce foreign code noise, it is advisable to boost the priority of your process to maximum level (real-time priority). This setting will reserve the machine almost exclusively to your application and the overall responsiveness might seem jerky until the program terminates. The code above achieves the priority boost by the SetPriorityClass Windows system call: SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
  3. Last but not least, make sure you avoid obvious counting overlaps. An example would be to split the final printf statement in the example above in to different function calls. Note that the current counter value is read when the '-' sign is evaluated. Thus if you print the delta of the first counter (cache misses in this case) in a separate function call to printf, the second counter (memory references in this case) will count the data accesses performed during this function call as well.

Future Work

Although this document seems quite long, it is more of a draft than something completed.

There are many (orthogonal) directions this work can be extended.

First priority is of course implementing ia32counter subclasses (like p6counter) for other processor families, like Intel Pentium 4, Intel Ithanium and different models of AMD. I believe it is important to understand the specifics of Intel P4, as it is the first processor ever to provide precise event-based sampling performance monitoring. What this means is that one can get the processor state when an event (e.g. cache miss) occurs, so the exact instruction causing the miss is known. This can further facilitate the preciseness of research methods in this area.

Another direction is to extend the CPU detection procedure with empirical measurements that can detect memory hierarchy in conventional software (a la HW1 cs612). As processors become more and more sophisticated from hardware point of view, this task becomes harder and harder, but I believe it is still doable. This is very important step if we want to build compilers that dynamically tune themselves to the current CPU (possibly a CPU that did not exist when the compiler was released!)

Last, I am not sure how important this is, but this document is way too long and needs better structure and probably some factoring. If the library grows bigger, better documentation will be needed or it will be yet one of these public domain things that you need to read all the headers before starting to use it. I said this before, and I will repeat it again: If you ever plan to use this thing, please, please give feedback. Contributions are also more than welcome, but I would suggest if you have an idea to coordinate it with me, as there is good chance it is already under way...

So far I am not worried if this piece of software is useful or not. For sure it is useful for me. I bet it would also be useful for cs612... I hope it is useful for you too. Good luck!

References

  1. Intel IA-32 Developper Manuals v.1 - 3, http://www.intel.com
  2. http://www.sandpile.org