Practical Performance

This is an overview of issues that can influence the practical performance as you should have seen when measuring the time performance of the algorithms in task 1 to 4. Key factors in execution time are:

  • Instruction count
    As in the theoretical approach to the time complexity of algorithms the number of instructions executed heavily influences the time behaviour of your programs.
  • CPU speed
    Assuming every instruction requires approximately the same time on the CPU the speed of the CPU multiplied with the number of instructions determines the time your program takes.
  • Pipelining
    Pipelining in modern CPUs means that every branch instruction in your code can slow down the execution of your code. Necessarily branch prediction has to fail some of the time and if the branches taken depend on your input data for the algorithm, different data may expose different timing behaviour.
  • Data layout and access
    Usually this is not considered for time complexity and memory complexity usually just considers the amount of memory required for the algorithm.
  • Cache
    Multiple levels of memory caches in modern CPUs mean the order in which you access the data and the layout of that data in memory have a notable effect on the performance of your algorithms. The generic rule is to store your data in memory in the sequence you access it and access data in memory as much as possible in the sequence it is stored to avoid cache misses. More complex issues are beyond the scope of this course, but see http://www.lamarca.org/anthony/caches.html for more on this
So you must have some understanding of the hardware to fully optimise your code and different hardware platforms may require different optimisations. Moreover, the compiler (or virtual machine, interpreter, etc.) apply various automatic optimisations (partially controllable by the user or programmer). In particular dynamic compilation can have additional effects.
    • Dynamic compilation
      The presence of a virtual machine with dynamic compilation makes performance measurement more difficult. You must have an understanding of how the virtual machine executes the byte-code to properly assess the performance of your programs.
  • Compilation Effects on Timing

    • Static compilation
      Convert instructions directly into machine code optimised for the hardware platform (to some user-defined degree).
    • Dynamic compilation
    • Compiler converts instructions into intermediate byte-code as instructions for a virtual machine. Most optimisations are performed during runtime instead.
      • Interpretation
        • Initial JVMs simply interpreted the byte-code, which made them rather slow.
      • Just-in-time compilation
        • Lazy conversion of byte-code into machine code before execution
        • Time for code optimisation vs. patience of user
      • Hot-spot dynamic compilation
        • Combination of interprteation, profiling and dynamic compilation
        • Interpret first and compile hot code only
        • Use profiling data for optimisation decisions
        • JVM: client and server compiler (platform dependent)
          • Specify -server or -client when starting the JVM
          • Server to maximize peak operating speeds for long-running server applications
          • Client designed to reduce startup time and memory footprint with less time-consuming optimisations
          • Also see http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html
      • Continuous recompilation
        • Keep gathering profiling information, even after compilation
        • Recompile with better optimisation later (very hot code, better optimisation)
        • Run JVM with -XX:+!PrintCompilation to see when compiler runs
      • On-stack replacement
        • Initially compile a method if it was executed more than certain number of iterations (~10,000?), but only use compiled code after interpreted code finished (so may never execute)
        • On-stack replacement switches from interpretation to compilation (during loop execution) or replaces code with recompiled code
    Timing
    • Measuring time of individual algorithm typical to assess performance
    • Dynamic compilation means timing of such code has to be done carefully to assess real performance
    Warmup
    •  Initially interpreted code is timed, but this is not so useful for full practical performance
    • With simple just-in-time compilation, run code to be timed for some iterations
      • JVM will switch from interpreted to compiled code then
      • A single "re-start" in the performance data
    •  With recompilation and on-stack replacement
      • Hard to tell when compiled code is executed
      • Even harder to tell when optimal code is found
      • You see multiple "re-starts" in the performance data
      • All mixed up with time required for compilation
      • Amount of warmup required is rather unknown
      • Run with "-XX:+!PrintCompilation" to get an idea of what is going on
    Garbage Collection
    • Code that allocates new objects may eventually trigger garbage collection
    • One iteration more can do this * Distorts average timing results
    • Use --verbose:gc with JVM to see when garbage collection takes place
    • Or (better) run for very long time to get average results that also account for allocation and garbage collection
    Dead Code
    • Code that has apparently no effect may not even be executed.
    • Some code can be made much faster by replacing "return RESULT" with "return 0.0" or otherwise ignoring the result, which in particular is an issue for pure timing tasks
    Deoptimisation
    • Compiler can inline instruction into calling code to optimise code (larger code chunks are easier to optimise)
    • If instructions being called are replaced with other instructions, the complete optimised code is invalidated and replaced with a slower version again. The slower version may be optimised again or not.
    • Check with -XX:+!PrintCompilation
    • Issue here is often the use of virtual methods
    • Cannot conclude anything from the timing results except that timing is very difficult to get right
  • Last modified: Thursday, 12 November 2015, 12:31 PM