Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate performance issues through cache configuration and other improvements. #215

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

asw101
Copy link
Member

@asw101 asw101 commented Sep 29, 2020

This PR mitigates performance and transient reliability issues which we have identified during load testing via JMeter and the Latency-Sensitive Stress Testing (time-gated-exam.jmx) exam with tweaks and updates for the latest version. The changes are as follows:

  1. Sets the Moodle localcachedir to /tmp/localcachedir

    During testing of the Large size deployment, which defaults to Azure Premium Files as the external file share, we identified files in the /moodle/moodledata directory that caused increased latency. The first is the localcachedir directory which Moodle recommends using a fast local file system for when Moodle is clustered.

  2. Sets alternative_component_cache to /var/www/html/moodle/core_component.php

    This change is in conjunction with localcachedir and provides significant performance improvements when moodledata is located on an external file share such as Azure Premium Files (see related issue caching problem with gluster #126 regarding GlusterFS). We chose this directory because it must already exist and the web server must have permissions to write to it.

  3. Increases default osDisk size from 30Gb (120 IOPS/3,500 Burst IOPS/25MB/sec) to 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec)

    During load testing we believe we may have hit IOPS and/or Throughput limits at either the Disk and/or VM level which can cause a VM to become unavailable. Updates to Disk and VM metrics will make this clearer. In order to mitigiate this we chose a Premium SSD size with significantly more IOPS and throughput.

    We initially chose 1,024Gb (5,000 IOPS/200MB/sec) because this size is the first that does not utilize the 3,500 "Burst" IOPS. Latency also decreased as the disk size was increased. However, a smaller size such as 256Gb (1,100 IOPS/3,500 Burst IOPS/125MB/sec) may be suitable and this PR changes from 30Gb to 256Gb.

    We applied this change to both the Virtual Machine Scale Set (VMSS) that handles the web traffic, as well as the Controller VM we use for JMeter testing (after resizing to match the VMSS), in order to maintain parity in terms of IOPS and throughput.

  4. Defaults Load Balancer and Public IP to the Standard SKU.

    We upgraded our Load Balancer and Public IP to the Standard SKU to enable the Multi-dimensional metrics and alerts, particularly "SNAT connections", to help avoid as well as confirm we do not experience issues such as SNAT Port Exhaustion.

These changes have been tested to deploy successfully against the current master, though load testing was performed against an earlier commit.

(Special thanks to @iennae for feedback and insights throughout!)

Copy link
Contributor

@iennae iennae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed and provided feedback to Aaron directly. LGTM so far.

@asw101 asw101 marked this pull request as ready for review September 30, 2020 12:34
Copy link
Contributor

@iennae iennae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Changes look great.

@asw101
Copy link
Member Author

asw101 commented Oct 13, 2020

Thank you @naioja for your tweaks for NSG with Standard Load Balancer. I have merged the current changes from master and resolved the merge conflict. I have also included your suggested snippet to ensure the alternative_component_cache directory exists!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants