So with only balloon(4) as the current consumer of uvm_hotplug(9), I lacked the
proper knowledge to use the benchmarking tools within this context and I had not
spent enough time on dtrace(1) to exploit it’s features to benchmark
PHYS_TO_VM_PAGE() macro.

The only tool that I had some confidence in using is ATF and after some rounds
of discussion with Cherry (cherry@) we came up with a plan to that is a very
rough measure of performance. So here is how it goes.

The most frequent operation that is impacted by our change is the look up call
uvm_physseg_find() which now searches through a R-B Tree instead of a Static
Array. In order to simulate this we copied over the PHYS_TO_VM_PAGE() macro
and the related code from uvm_page.c and then we wrote some ATF tests that
would load some memory segments, followed by multiple calls to
PHYS_TO_VM_PAGE() that would search for random addresses within the plugged in
segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ATF_TC(uvm_physseg_100);
ATF_TC_HEAD(uvm_physseg_100, tc)
{
        atf_tc_set_md_var(tc, "descr", "Load test uvm_phys_to_vm_page() with \
            100 calls, VM_PHYSSEG_MAX is 32.");
}
ATF_TC_BODY(uvm_physseg_100, tc)
{
        paddr_t pa;

        setup();

        for(paddr_t i = VALID_START_PFN_1;
            i < VALID_END_PFN_1; i += PF_STEP) {
                uvm_page_physload(i, i + PF_STEP, i, i + PF_STEP,
                    VM_FREELIST_DEFAULT);
        }

        ATF_REQUIRE_EQ(VM_PHYSSEG_MAX, uvm_physseg_get_entries());

        srandom((unsigned)time(NULL));
        for(int i = 0; i < 100; i++) {
                pa = (paddr_t) random() % (paddr_t) ctob(VALID_END_PFN_1);
                PHYS_TO_VM_PAGE(pa);
        }

        ATF_CHECK_EQ(true, true);
}

The above snippet does tests for 100 calls like mentioned in the comment for it.
This methodology is not a perfect load test since there is a call to random()
which will cumulatively add up to the runtime of the function we are trying to
load test. I guess I could have spend a bit more time trying to remove this
overhead by having the random numbers initialze outside the test realm. After
some tweaking around I managed to write up the tests varying from 100 calls to
100 Million calls and then evaluate the time for them. Also you might notice the
ATF_CHECK_EQ(true, true) at the bottom of the test indicating the test will
never fail. This is done because the test is NOT a check of correctness of
the function being called, we assume the function works as expected when this
test is running.

This specific test was written in such a way that it would run for both R-B Tree
and Static Array impementation giving an apples to apples comparison of how much
they differ.

A second type of test we came up with is a search done on highly fragmented
memory segments that have been unplugged from a fixed chunk.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
ATF_TC(uvm_physseg_1MB);
ATF_TC_HEAD(uvm_physseg_1MB, tc)
{
        atf_tc_set_md_var(tc, "descr", "Load test uvm_phys_to_vm_page() with \
            10,000,000 calls, VM_PHYSSEG_MAX is 32 on 1 MB Segment.");
}
ATF_TC_BODY(uvm_physseg_1MB, t)
{
        paddr_t pa = 0;

        paddr_t pf = 0;

        psize_t pf_chunk_size = 0;

        psize_t npages1 = (VALID_END_PFN_1 - VALID_START_PFN_1);

        psize_t npages2 = (VALID_END_PFN_2 - VALID_START_PFN_2);

        struct vm_page *slab = malloc(sizeof(struct vm_page) *
            (npages1 + npages2));

        setup();

        /* We start with zero segments */
        ATF_REQUIRE_EQ(true, uvm_physseg_plug(VALID_START_PFN_1, npages1, NULL));
        ATF_REQUIRE_EQ(1, uvm_physseg_get_entries());

        /* Post boot: Fake all segments and pages accounted for. */
        uvm_page_init_fake(slab, npages1 + npages2);

        ATF_REQUIRE_EQ(true, uvm_physseg_plug(VALID_START_PFN_2, npages2, NULL));
        ATF_REQUIRE_EQ(2, uvm_physseg_get_entries());

        srandom((unsigned)time(NULL));
        for(pf = VALID_START_PFN_2; pf < VALID_END_PFN_2; pf += PF_STEP) {
                pf_chunk_size = (psize_t) random() % (psize_t) (PF_STEP - 1) + 1;
                uvm_physseg_unplug(pf, pf_chunk_size);
        }

        for(int i = 0; i < 10000000; i++) {
                pa = (paddr_t) random() % (paddr_t) ctob(VALID_END_PFN_2);
                if(pa < ctob(VALID_START_PFN_2))
                        pa += ctob(VALID_START_PFN_2);
                PHYS_TO_VM_PAGE(pa);
        }

        ATF_CHECK_EQ(true, true);
}

This requires the boot process to be faked since we need to invoke the
uvm_physseg_unplug() to fragment the memory. After this 10 Million calls are
made to the PHYS_TO_VM_PAGE() macro and the memory segment sizes were varied
from 1 MB to 256 MB because my VirtualBox instance of NetBSD had a total of 512
MB to spare. This test is specific to R-B Tree implementation and cannot be run
for Static Array implementation, since VM_PHYSSEG_MAX will limit the amount of
fragments that can happen to the array, and since this is a very small value
like 32, it would not make much sense to test it out.

Here is a quick and dirty output from the atf-run + atf-report on the series of
tests run with uvm_hotplug(9) enabled.

t_uvm_physseg_load (1/1): 11 test cases
    uvm_physseg_100: [0.003286s] Passed.
    uvm_physseg_100K: [0.010982s] Passed.
    uvm_physseg_100M: [8.842482s] Passed.
    uvm_physseg_10K: [0.004398s] Passed.
    uvm_physseg_10M: [0.954270s] Passed.
    uvm_physseg_128MB: [2.176629s] Passed.
    uvm_physseg_1K: [0.002702s] Passed.
    uvm_physseg_1M: [0.094821s] Passed.
    uvm_physseg_1MB: [0.984185s] Passed.
    uvm_physseg_256MB: [2.485398s] Passed.
    uvm_physseg_64MB: [0.914363s] Passed.
[16.478686s]

Summary for 1 test programs:
    11 passed test cases.
    0 failed test cases.
    0 expected failed test cases.
    0 skipped test cases.

I will need to spend sometime to create an eye candy report pretty soon to make
ths presentable to a wider audience, so keep an eye out for updates.