Crypto Performance – Introduction

Before we dive into benchmarks and measurements, we need to get some basic understanding of the problem we are trying to solve and the techniques that are being used. Without that, we run the risk of comparing apples and oranges.

Cryptography is about mathematical techniques to encode messages in a way that we can

  • hide the content of the message from anyone but the recipient
  • allow both sides of the conversation to verify the other is who he claims to be

To do this in a practical way, we deploy different techniques. They build on the same mathematical framework and it is easy to get confused. The following brief introduction should be enough to understand the performance implications.

Asymmetric encryption
Asymmetric encryption uses for both parties in the conversation a pair of keys. One key is private and only known to the user, the other key is public and shared with everyone who wishes to communicate. The theory is that a message encoded with the public key of user X can only be decoded using the private key of user X. When user Y sends a message to X, he will encode the message first with his own private key, and then with the public key of X. User X decodes the message with his private key and then with the public key of Y. Messages from X to Y follow the same process, but with the roles of the keys reversed. Because both sides need their private key to participate, we avoid interference from anyone without that private key.

Though this process ensures both privacy, this encryption takes a lot of computing resources and is unpractical for large data volumes. The CPU cost of the algorithms is deliberate because that is what makes it unpractical to break the code by brute force.

The major factor in CPU cost is the length of the key used in the process, which also determines the strength of the encryption. Popular key lengths are 1024, 2048 or even 4096 bits. Lacking detailed measurements for now, understand that encoding and decoding with a 4096-bit key is much, much more than 4 times as costly as using a 1024-bit key. Comparing performance of cryptographic routines makes no sense if we don’t know what key length is being used.

Symmetric Encryption
The ciphers for symmetric encryption involve a single key that is used both for encoding and decoding of the message. The algorithms for encoding and decoding still require a fair amount of computing power, but are efficient enough to use for large data volumes.

While the encoded message can only be decoded with that particular key, the challenge is how to keep that key secret. If you want to communicate with many different parties, you would need to keep the keys for them all. And when you break up with someone, you both would be required to generate new secret keys and share that with all partners.

The CPU cost for symmetric encryption and decryption varies by chose algorithm and normally grows linear with the amount of data encrypted (so using a chosen cipher, encoding 100 MB takes about twice as much CPU as encoding 50 MB). The symmetric encryption normally operates on groups of bits (blocks) and are referred to as block ciphers. In general a larger block reduces the overhead, but messages need to be padded to a multiple of the block size. When dealing with lots of short messages, the padding may cause us to encode much more than necessary and waste resources. Padding can be avoided using a stream cipher (CFB).

Message Digest
The message digest or hash function is different from the ciphers in that it does not encode the message in a way that at least someone could get the original message back. The digest functions compute a message signature (like a checksum). The signature is very short in comparison to the message, but the bits in the message are shaken up well enough that someone could not create the signature without having the original message. This property is vital in secure communication protocols because parties can compare the result of operations without exchanging the actual result.

The same theory is used to produce the md5 checksum for ISO images after transfer over the network. When the checksum computed by md5sum matches the published one, we can be sure no bits were fallen over in the process.

Hybrid Crypto System
Using the techniques shown above, secure communication is typically done in such a way that the parties start the conversation using asymmetric encryption with their private and public keys. In this secure communication a temporary secret key is exchanged between the parties. This secret key is then used to encode and decode the actual payload of the conversation. Depending on the data volume and required level of confidentiality, a new secret key may be established several times during the conversation. Popular protocols for this security key handshake are RSA, DSA and DHE. Note that such a handshake involves several of these cryptographic operations that requires a fair amount of CPU cycles. For conversations that transfer only a small amount of data, the cost of encoding the actual payload of the conversation may be neglecticable.

In addition to exchanging the secret key for the next part of the conversation, the protocol may also involve verification of the authentity of one or both parties (validating the chain of signed certificates). From an end user perspective, all these components add into the perceived transaction response.

Economics of Crypto Express

We learned before that Crypto Express hardware can save CPU cycles. The more cryptographic operations you do, the more CPU cycles you save. We would expect at some level of utilization, the saved CPU cycles make up for the cost of a Crypto Express. It’s hard to provide exact numbers without knowing what a CEX or IFL would cost (remember an extract IFL also raises your software bill) but at least we can show the approach make that trade off.

One of my friends provided some data from a system where CEX4S cards are configured as CEX4A processors. I was told that the benefit of CEX4A was such that the customer would notice when the hardware was not available. That surprised me a bit since my other measurements suggested the CEX saved CPU cycles but did not make it faster.

 ESACRYPT 0/1-----Request Rate/Second------->         <-NQ to DQ> <-Rejects->
          <-NQ Rate-> <-DQ Rate-> <Completed>   Queue <-ms/oper-> <-per sec->
 Time      Virt  Real  Virt  Real  Virt  Real  Length  Virt  Real  Virt  Real
 -------- ----- ----- ----- ----- ----- -----  ------ ----- ----- ----- -----
 18:48:00   5.0   5.0  17.9  11.0   5.0   5.0       0  4.79  3.10     0     0
 18:49:00   5.0   5.0  17.9  11.1   5.0   5.0       0  4.74  3.10     0     0
 18:50:00   5.4   5.4  19.4  12.0   5.4   5.4       0  4.74  3.06     0     0
 18:51:00   4.8   4.8  17.5  10.8   4.8   4.8       0  4.87  3.19     0     0
 18:52:00   4.6   4.6  17.3  10.6   4.6   4.6       0  4.85  3.15     0     0
 18:53:00   4.9   4.9  17.4  10.9   4.9   4.9       0  4.70  3.12     0     0


The data in ESACRYPT shows that the z/VM system does on average 5 cryptographic operations per second (the “Completed” column). The “NQ to DQ” columns show response time observed by z/VM (the “Real” value) and observed by the virtual machine (the “Virt” column). So for the Linux guest, these cryptographic operations complete in slightly under 5 ms.

Unfortunately this does not tell me ”what” kind of operations were performed. Without knowing that, we can’t tell how much CPU time it would take to do these operations in software. We might ask the application owners, but in the case of the CEX4A we can actually get that from the hardware metrics of the card.

 ESACRYPA 2/3            <--RSA 2048 ME------------> <--RSA 2048 CRT----------->
 Time     AP# Type Util% Crypto Time(s)  Ops/s ms/op Crypto Time(s)  Ops/s ms/op
 18:53:00   0   10  0.52      0       0      0     .    145   0.313    2.4  2.16
 18:53:00   1   10  0.52      0       0      0     .    144   0.311    2.4  2.16
 18:53:00   2   10  0.51      0       0      0     .    142   0.306    2.4  2.16
 18:53:00   3   10  0.52      0       0      0     .    144   0.311    2.4  2.16


The display above shows the data from the 4 installed CEX4A processors. The operations performed are 2048 bit RSA CRT operations, the CPU intensive part of the SSL handshake. z/VM appears to balance the work over the available processors, with each of them doing some 140-145 cryptographic operations in a minute. For this interval, the total of 575 gives an average of 9.6 per second. This is twice the amount we got from the previous screen. The reason is that the system has two LPARs that run a very similar workload and share the Crypto Express cards. Since the IFLs are also shared (and software licensed for the two LPARs will be the same) we can work with the combined numbers for the two LPARs.

An interesting observation is that these 2048 bit CRT operations appear to take 2.16 ms, so 144 of them keep the card busy for 311 ms during this minute. Divided by 60 seconds gives the utilization of about 0.51% for each of the cards.

Since the customer runs SLES11, we know from an earlier measurements that this operation takes 17.4 ms in software on a z12B, so about 13 ms on the z12EC. If we need to do those 9.6 handshakes in software, that would cost us about 130 ms every second, or 13% of a CPU. So in this case, the 4 CEX4A cards save the customer 13% of CPU.

Since the CEX4A cards are not utilized much by this workload, we could have done this with two cards (for redundancy reasons, you probably don’t want to have just one card). That might make the business case a bit easier.

Knowing that it takes 13 ms to do the operation in software and just under 5 ms when using the CEX4A processor, it is very well possible that the response time measurements on application level would indeed notice the difference. Note that even when both would take the same time, the reduced CPU usage might still improve application response time because the CPU contention is less when using the Crypto Express.

When the customer would upgrade to SLES12 with openssl-1.0.1, the optimized code will take about 3 ms of CPU per handshake. In that case the 9.6 CRT operations save less than 30 ms, or 3% of a CPU. But it is not unlikely they upgrade to z13 before that happens. While the z13 will make the software implementation of the CRT operation even a little faster, we’re also told the CEX5S card will be considerably faster. While that does not change the economics side of the matter, it does help to speed up the application even more.

Do understand Crypto Express is not just about saving CPU cycles or speeding up your handshake. The cards also offer security advantages for secret key operations or function as tamper-free device to store the keys.

Blind Men and an Elephant

Last weak I had pleasant conversation in which I explained the reason for performance measurements both on the hypervisor level as well as within the virtual machine.

Over the weekend, I was reminded of this conversation when reading a version of the story of Blind Men and an Elephant. If you don’t know the story, it’s worth to check out the Wikipedia page linked above.

elephant

All of you are right. The reason every one of you is telling it differently is because each one of you touched the different part of the elephant. So, actually the elephant has all the features you mentioned.

This applies very much to performance measurement of virtual machines. You can’t make sense of it when you only have part of the information. But the same is true when you look at the applications within that virtual machine. You need to extend your view into the application subsystem to understand the interaction with the virtual machine operating system and the hypervisor. When you focus on a small part without the context, you’re likely to be wrong when trying to tune the system.

The Long Tail

I don’t like averages. Sure, when you have a production workload of thousands of transactions per second, it is not an option to keep measurements about each of them for a longer period of time. You want to aggregate the measurements in a way that makes sense. Too often, blindly computing the average does not make sense.

Computing the mean value or average only makes sense when you know how samples are distributed. Using just the mean value to monitor your service level or steer your tuning efforts can be very misleading. Some people tell me they also compute the standard deviation that shows how samples are distributed around the mean value. That is true, but only when samples follow the normal distribution.
700px-Normal_Distribution_PDF.svg
The normal or Gaussan distribution is the result of an even random variation. The density chart shows the typical bell shaped curve around the mean value. As long as the samples are distributed like that, knowing the mean value and standard deviation defines the behavior.

Unfortunately a lot of things we deal with don’t follow the normal distribution. This is even more true when some abnormality in the system is causing problems that require your attention. The chart below shows latency measurements for some 100,000 transactions. When looking at the blue curve, it is tempting to assume this is pretty much a bell shape centered around 100 us. But rather non-intuitive is that the mean value for the samples is 203.
latency-b
The reason for this strange effect is more obvious in the next chart. Using a logarithmic X-axis we can show a larger part of the latency scale. The blue curve shows some “ripple” at 1000 us and beyond. The mean or average is shifted because of a relatively small number of outliers with very large value.
latency-a
In some cases, these outliers are the result of failures. A common approach is to keep those outliers out to avoid spoiling the mean value. Obviously you would have to measure and report the failure rate and maybe act on that. Most Service Level Agreements for traditional applications acknowledge the fact that “things happen” and specify for example that 99% of the transactions should complete within specified time. The orange curve in the chart for example shows the 50% percentile at 103 us, so 50% of all transactions completes within 103 us. And 80% completes within 130 us. The 90% percentile lies at 300 us, and 99% at 1600 us.

Understand that those outliers are a lot larger than the most common case. I remember a case where I/O response was believed “a little slow” but it turned out that once in a while, an I/O took more than a second while the rest typically was within half a milisecond. The problem was not that the average response was 2 ms, the problem was that rare I/O that took 1000 times longer.

Traditionally, the approach is to accept that once in a while there is a slow transaction. When resources are shared, it may happen that just when you wanted to run a transaction there were a few others ahead in queue, and just at that time they all happened to use their entire time slice. Knowing that, you exclude the top 1% or 0.1% from the SLA measurements. This may be true when you’re dealing with end users who initiate a dozen or maybe 100 transactions per day and sometimes encounter a slow one. But if you look at modern web applications that may require 1000 interactions with a host for a single user “transaction” then most of those end user interactions will capture one of the bad transactions. This makes everything for the user real slow.

The way to address this challenge is twofold:

  • Work on having less slow transactions, so reduce the chance for hitting one of those
  • Reduce the length of the tail, make those rare slow transactions less slow

This all applies to Linux on z/VM because we share resources, insufficient (political) tuning sometimes makes it very likely that you encounter a slow transaction, and thoughtless overcomitting of resources makes the tail very long. So this is why I am looking at latency measurements and expect to talk about a lot more in the future.

RSA Speedup in OpenSSL

To prepare for my Crypto Express measurements, I wanted to repeat the measurements for RSA private decrypt in software. That private decrypt is the CPU intensive part of the SSL handshake. I have personal affection for the algorithms after spending some time to rewrite the cryptography in an application that was causing a performance problem (and was able to reduce CPU usage by almost a factor of 100).

Although openssl speed can do a simeple benchmark, I wanted to have my own code that I can steer in doing the same cryptographic operation for a longer period rather than using some a mix of different operations. After I wrote my benchmark program, the CPU time used per RSA decrypt appeared much lower than what I remembered. Obviously I double checked my code and compared my notes from 2 years ago, but the difference seemed true. For entirely unrelated reasons I moved my tools development to a SLES12 server a while ago, which comes with openssl-1.0.1 rather than the openssl-0.9.8 that SLES11 has. After running my test program on the old server, the difference shows very clear.
rsa-ssl-cpu
The logarithmic scale makes it less impressive, but on a linear scale you can’t see anything at all. For the 2048-bit private decrypt, CPU usage dropped from 17.4 to 3.1 ms on the same hardware; more than a factor 5. The two curves are pretty much parallel on the logarithmic scale, which means the savings at 4096-bit key are very drastic (factor 14). If these numbers are valid, that would make a big difference for the economic benefits when you’re looking at Crypto Express hardware to do the RSA. The numbers also troubled me because my own code is with 7.2 ms not anymore twice as fast but twice as slow as openssl.

After picking up a copy of the openssl-1.0.0 source, the reason for this difference became very clear. Just as already observed for the symmetric encryption in openssl that uses the CPACF instructions on z Systems, the Montgomery Multiplication for the RSA decryption is handcrafted in assembler and uses the 64-bit multiply instructions that are normally harder to get from C source. The current source in https://github.com/openssl/openssl/blob/master/crypto/bn/asm/s390x-mont.pl and you will see it’s doing proper 64×64 multiply where our CPU does very well.

Indeed, with the software implementation getting much cheaper, the break-even for a Crypto Express card to save CPU cycles changes: you need to do a lot more RSA handshakes to have the CPU savings pay for the card.

Understand the openssl is not all there is. Some applications use another implementation (like GSKit for example). And some applications don’t take advantage because they bring their own. I noticed for example that the node.js port indeed “uses” openssl, but comes with its own copy based on an older openssl that does not have these routines in assembler and does not use the CPACF instructions.

Crypto Express Measurements

For workloads that use cryptographic operations, Crypto Express cards can be installed in your z Systems hardware. Customers do this for several reasons:

  • Speed up cryptographic operations, improve end-user response time
  • Reduce CPU cycles used by cryptographic operations
  • Keep secret keys in tamper-proof device to increase security

The first two objectives should interest anyone dealing with performance and capacity planning. These objectives are actually different, and in most cases even conflicting. In a shared resource environment, you try to find a balance between efficiency and response times. This balance is unique for each configuration and workload. Fortunately we have performance metrics in different areas of the configuration.
Crypto Virtualization Metrics
The actual Crypto Express cards in this case are shared by the two z/VM LPARs and the Linux guests running there. Even though the Crypto Express is designed just to do those cryptographic operations, it does need time to perform the operations, so it has a limited capacity. Though IBM does not publish actual numbers, one could expect the hardware might need something in the order of 100 microseconds for an RSA 2048-bit sign operation (the time consuming part of an SSL handshake). Some vendors of comparable hardware do show numbers, but they may assume a fair percentage of “session resume” operations that avoid the RSA sign operation.

Passing the request and response between application and crypto processor involves polling in Linux, in z/VM, in PR/SM and probably in hardware. The response time observed by the application is not just the time it takes for the crypto processor, but includes latency due to polling. This also means that utilization of the crypto processor can’t be computed from the observed response time. While one application is taking its time to retrieve the results, the crypto processor can be working on requests from other applications.

Screen: ESACRYPT Velocity Software              ESAMON 4.201 03/06 09:34-10:36 
1 of 1  Crypto Facility Analysis                                     2827 00000
                                                                               
         <-------Request Rate/Second------->         <-NQ to DQ> <-Rejects->   
         <-NQ Rate-> <-DQ Rate-> <Completed>   Queue <-ms/oper-> <-per sec->   
Time      Virt  Real  Virt  Real  Virt  Real  Length  Virt  Real  Virt  Real   
-------- ----- ----- ----- ----- ----- -----  ------ ----- ----- ----- -----   
10:36:00   0.6   0.6 262.2   1.5   0.6   0.6       0  5.14  3.25     0     0   
10:35:00  36.5  36.5 949.9 146.5  36.5  36.5       0  7.02  5.44     0     0   
10:34:00 336.9 336.9 834.1 615.8 336.9 336.9       0  2.97  1.57     0     0   
10:33:00  33.6  33.6 793.3  53.3  33.6  33.6       0  2.53  1.15     0     0   

The ESACRYPT screen above shows both virtual and real crypto operations. In the 10:34 interval, the virtual machines in this z/VM system enqueued 336.9 requests per second, and CP enqueued the same requests to the Crypto AP. The same number of responses were “Completed” per second. The “DQ” operations shows the amount of polling; the number of times per second the virtual machine and CP respectively were checking whether the operation yet completed. Dividing the two numbers, we see that in this interval CP did almost 2 tests for each request, and the virtual machine close to 3 times. Understand that these polling rates are basically “stacked” on top of each other: as long as CP has not checked to notice the operation completed, the virtual machine can check what it wants, but will not get the response.

The columns on the right show the computed average response time observed for the virtual machine and for CP itself. The difference of 1.5 ms between the two is due to the polling frequency of Linux. That is significant when dealing with a real response time of 5 ms. We could imagine that when Linux was checking for completion more frequent, the application would get the response quicker. So we would increase the single thread throughput. However, for z/VM to virtualize the test for completion CP needs to massage the virtual AP Queue and the real AP Queue. This takes CPU time and increases the overhead. Even when it’s not much CPU, when the Linux guest would be checking 1000 times more often, the CPU overhead could slow down other parts of the workload (even in this same virtual machine).

While it is tempting to try and compute utilization and response times from these numbers, bear in mind that we are dealing with a mix of different cryptographic operations that vary 2 orders of magnitude in complexity (the data shown is from an openssl speed rsa benchmark that does sign and verify with different key length). With that kind of workload, computing averages can be very misleading.

We Want Your Data With different workloads, different generations of Crypto Express cards configured either as Accellerator or Coprocessor, there are a lot of combinations possible. If you have Crypto Express cards installed and are using them (or would like to use them, or would like to know whether they are being used) drop me a note when you have an opportunity to collect some data for me to review.

Polling and Waiting

In non-trivial IT processes, we find asynchronous activities that need to be aligned. A very obvious case is when an application server needs to query a database. The application server asks for a particular record and can only resume the current task when that record has been delivered. The database server itself may have to start an I/O to the disk and can only return the answer once that I/O has completed. On the network level, the web server needs to align sending buffers of data with the client receiving those blocks. In a multi-threaded application, a process may spawn a child process to complete a transaction and needs to do some extra work when that transaction completes. Another common case is when exclusive access to a resource is arranged by software locks, with other processes waiting for the critical section to become free again.

A multi-threaded application or system will likely have some other task to perform while waiting to resume a suspended task. The ideal approach is when the requester is interrupted once the transaction has completed. To communicate with hardware devices, normally an interrupt mechanism is used. The System z CPU has different interrupt types that will invoke Interrupt Handler routines. The software equivalent of such a scheme involves callback routines to wake up a suspended thread.

polling1
Lacking a suitable interrupt mechanism, we have to resort to checking the status of the request more or less frequently until the transaction has completed. In most cases the status is observed by comparing the contents of a memory location. This approach is often referred to as polling. Unfortunately some applications use polling even when there are suitable alternatives available, sometimes because developers only worked on systems where it did not really matter.

One of the design choices of such a scheme is to determine the polling frequency. As the chart shows, it will take some time before the polling process notices that the task has completed. If the polling frequency is very low, the complete transaction can go unnoticed for a long time until the status is inspected. If the polling frequence is very high, resources are wasted in requesting the status of the transaction too often. The observed response time for the transaction is a combination of actual response time and polling latency.

Depending on the polling frequency, people sometimes use slightly different terms. When the polling delay is more than the actual transaction response time, it’s often referred to as a test when almost every time the status is checked, the transaction has completed. On the other side of the scale, when the CPU does nothing but check the transaction status in a tight loop, it’s referred to as busy wait because the CPU does nothing but wait for the completion. The busy wait reduces the latency to a minimum, but at a very high cost.

When polling is not synchronized, on average it adds half the polling delay to the response time. In practice, many implementations start polling when a request is started. In that case the average latency depends on both polling frequency and actual response time. Worst case, we have to add the entire polling delay. We also see more complicated schemes where the delay is varied over time (for example by doubling the delay on each pass). Unless the scheme can be tuned to match the expected actual response time, it appears to do nothing more than complicate things.

The acceptable amount of polling depends a lot on the overhead of polling and the transaction response time. A polling delay in the same order of magnitude as the average response time appears reasonable when the test itself is not too expensive. But for example testing for completion 10,000 times per transaction is excessive. The good thing is that in a shared resource environment, we don’t need to have each single thread be able to fully saturate the system.

Linux CPU Time Measurements

I recently started to do some fine granulariy measurements on Linux to measure latency and efficiency effects of various tuning options and put some numbers behind z/VM Hyperdispatch and other forms of resource affinity. A dispatcher must decide whether to wait for the favorite resource to be come available or run the workload right now, but with lower efficiency.

After a lot of experiments, it turns out that the clock_gettime() function for CLOCK_PROCESS_CPUTIME_ID includes steal time. This is very unfortunate when you want to distinguish waiting for the resource and efficiency in using the resource. The jury appears to be still out as whether this is a bug or not. There seem to be voices outside claiming that “elapsed time is all that matters” so I’m less certain about the outcome.

There is another function getrusage() that appears to behave better in general, but this fails for very short transactions that use little CPU time. My impression is that it is rounding to full timer ticks (10 ms) somewhere in the process, or maybe missing the CPU usage in the current time slice. Unfortunately the affected kernel code is not entirely trivial, so I’m not sure when I find time to dig into this. For the time being, I will have to separate the latency and efficiency experiments to make sense of things.

CP Overhead in Mode VM

The IBMVM mailing list spawned a discussion about z/VM running in a Mode VM partition, so with a mix of CP and IFL processors, for example to run second level z/OS or z/VSE guests. Someone suggested that “CP itself would not run on IFL” which would be a concern when the CPs are throttled quite a lot. When guessing does not work, we look at the numbers.

The machine in this case is z12-BC A02 which means the two CPs are the slowest you can buy for the z12-BC. The LPAR we run in has 1 logical CP and 2 logical IFLs. The ESACPUU screen shows the system is pretty idle, except for some systems services that run on the primary type (this is something you might want to change when you’re on seriously throttled CPs).

 Screen: ESACPUU  Velocity Software              
         CPU Utilization Analysis (Part 1)       
                   
           Total  Emul   
 Time     Type  ID  util  time User Syst  Idle Steal
 -------- ---- --- ----- ----- ---- ---- ----- -----
 00:47:00 CP     0  13.0  10.1  0.7  2.2  86.7   0.3
          IFL    1   0.6   0.4  0.1  0.1  99.2   0.1
                 2   0.4   0.3  0.1  0.1  99.5   0.0

The “Syst” column shows system overhead usage on each of the logical CPUs. Simply looking at the percentage, we might be tempted to think that most of the system work runs on the IPL processor. When some workload is started in Linux that is known to create system overhead, we get even 45% of system overhead on the CP.

Screen: ESACPUU  Velocity Software              
        CPU Utilization Analysis (Part 1)       
                  
          Total  Emul   
Time     Type  ID  util  time User Syst  Idle Steal
-------- ---- --- ----- ----- ---- ---- ----- -----
 00:54:00 CP     0  56.6  10.6  0.7 45.4  42.6   0.7
          IFL    1  25.0  17.9  5.8  1.3  69.7   5.3
                 2  26.1  18.8  6.1  1.3  68.5   5.3

But we should not let this mislead us. There is almost no other user work on the CP (about 11.5%) so that logical CP is idle more often than the logical IFLs. When the workload on the CP increases (because we let the z/VSE guests run their batch jobs) the picture changes a lot.

 Screen: ESACPUU  Velocity Software              
         CPU Utilization Analysis (Part 1)       
                   
           Total  Emul   
 Time     Type  ID  util  time User Syst  Idle Steal
 -------- ---- --- ----- ----- ---- ---- ----- -----
 00:59:00 CP     0  96.4  76.6  4.2 15.6     0   3.6
          IFL    1  26.8  19.1  5.5  2.3  71.9   1.3
                 2  27.1  19.3  5.5  2.3  71.7   1.2

Now we see 15.6% of system overhead on the CP and 2.3% on each logical IFL. If we correct for the fact that the IFL is 21 times faster than the CP, it’s pretty clear that the CP is leaving most of the system tasks for the IFLs (the 2.3% IFL would be worth 48% of a CP). So it’s clear that z/VM system activities run pretty much on any of the available logical CPUs, not just on the IPL processor.

Now there’s some finer details in here. When you look at the “Emul time” column for the IFLs, you see the second case shows a total 38.4% which is a bit more than the 36.7% in the first case. This means we got a bit more work done in the second case (the Linux test program suggests that too). It appears that in this case, it is not an advantage to run the system tasks on the slow logical CP, even when that is available right now. The challenge is a bit like when waiting for the train and decide whether you take the slow train that is leaving right now, or wait for the faster train that will leave later. In the case of z/VM, you can’t really predict whether the users currently running on the fast logical IFLs will consume their full time slice, so you don’t know when the faster train will leave. Similar arguments can be made for Hiperdispatch where CP needs to decide whether the guest must be dispatched on a free logical CPU or whether it wants to wait for another logical CPU that might be more efficient because it still has data in cache for this guest.

YaST trouble in SLES12 using openssl

While I do many of the things here so you don’t have to, I thought I should give you a heads-up on a weird problem with YaST that I ran into. At least Google will find this when you’re searching for the error messages. I’m working with brighter folks on this, and will update the blog entry when I have more news.

The problem started after I installed the openssl-ibmca package that makes opensssl use the libica crypto library that uses CPACF and Crypto Express devices.

libica-2_3_0-2.3.0-15.2.s390x
openssl-ibmca-1.2.0-144.2.s390x

Note that the libica package is named libica-2_3_0. I consider this bad habit because it lets you have multiple versions installed at the same time, but in most cases the last one you (re)installed is the active one. I strongly recommend to keep things clean and not have multiple versions installed.

When you install openssl-ibmca you also need to modify the /etc/ssl/openssl.cnf file to include the reference to the library that contains the openssl engine code. Once I had done that, some of the YaST functions stopped working. The one that caused me most trouble was the “Software Management” function; it flashed the ncurses panel and returned to the menu. Trying to invoke yast -i from the command line failed without visible error message. Only when I redirected stderr to stdout I could see the error message.

Auto configuration failed
20943:error:26078067:engine routines:ENGINE_LIST_ADD:conflicting engine id:eng_list.c:116:
20943:error:2606906E:engine routines:ENGINE_add:internal list error:eng_list.c:288:
20943:error:260B6067:engine routines:DYNAMIC_LOAD:conflicting engine id:eng_dyn.c:540:
20943:error:0E07606D:configuration file routines:MODULE_RUN:module initialization error:conf_mod.c:235:module=engines, value=engine_section, retcode=-1

It turns out that these messages hint at a duplicate entry for an openssl engine. After digging in strace output, my best guess is that YaST somehow tries to initialize the openssl library twice. This initialization involves loading the default configuration file and causes the engine library to be loaded. The second attempt fails because it’s already loaded. It’s possible this happens with SLES12 because parts of YaST now have been written in Ruby that might also initialize the openssl libraries on its own.

If you run into this problem, either bypass the menu by specifying the package to be installed, like

zypper --non-interactive install openssl

You can also temporarily comment out the line in the configuration file that specifies the engine to load:

openssl_conf = openssl_def

When the engine is not loaded anymore, all the configuration file does is setting the path for certificates and some other things that are harmless to repeat.

Follow

Get every new post delivered to your Inbox.