Load Balance your Fleet of FlashArrays

In this post, I’m going to discuss how to load balance your storage provisioning across a fleet of Pure Storage FlashArrays.

As an integral part of Pure’s Kubernetes integration, Pure Service Orchestrator has the ability to load balance across a fleet of Pure storage devices. This is great for your containerized space, but I wondered how you could do something similar for arrays in a non-containerized environment. For example, a vSphere environment where there are multiple Pure Storage FlashArrays available to a vCenter and when creating a new datastore, you want to do this on the least full array.

What I wanted to do was orchestrate which storage array a volume was provisioned on automatically without the storage administrator having to keep checking which array to use. Now, when I think of automation and orchestration I immediately think of Ansible.

Pure Storage has an amazing suite of modules for Ansible that we can leverage to do this work, so I created an Ansible load-balancing role called, logically enough, lb.

The role takes a list of arrays, interrogate them and works out which has the least used capacity and then provide the information required for the rest of the playbook to provision against that array.

So where can you find this role and how do you use it?

The role can be found on the Pure Storage OpenConnect GitHub account in the ansible-playbook-examples repository, under the flasharray/roles directory.

To use it requires you to populate the variables file roles/lb/vars/main.yml with the management IP addresses of your fleet of arrays, together with an API token for a storage admin privilege user for each array. I guess there is no limit to the number of arrays you can load balance over, but the example below is for a fleet of six FlashArrays.

The populated file would look something like this (use your own array credentials):

---
arrays:
  - url: 10.21.200.xxx
    api: c6033033-fe69-2515-a9e8-966bb7fe4b40
  - url: 10.21.230.xxx
    api: 70e2e819-df9d-54d7-7af3-6aa9bd72b20b
  - url: 10.21.229.xxx
    api: f5dda863-5275-d092-f747-8eeac4d54006
  - url: 10.21.228.xxx
    api: 96abbb22-f23f-af72-14cb-982c41291e79
  - url: 10.21.228.xxx
    api: 4ea6b496-7408-12bb-6dac-2be96100c070
  - url: 10.21.229.xxx
    api: 473c4f2d-1bc1-0bac-4417-891d988a46c0

To use the role just add it into your Ansible playbook.

If you are security-minded, then all of these entries in the URL and API token can be encrypted using Ansible Vault. I wrote another blog post that included details on how to implement Vault for these variable.

When the role has run two variables will have been defined: use_url and use_api. These identify the array with the lowest utilization level and therefore the one you should be provisioning to. There is also an additional variable you can use (use_name) that identifies the true name of the array selected.

A super simple playbook that uses the lb role and then provisions a single volume to the least full array is shown here:

- name: Pure Storage load balancing example
  hosts: localhost
  gather_facts: no
  vars:
    array_usage: []    # Do not remove - required by the role
  roles:
    - role: lb

  tasks:
  - name: Provisioning to {{ use_name }}
    purefa_volume:
      fa_url: "{{ use_url }}"
      api_token: "{{ use_api }}"
      name: lb_test
      size: 50G

I hope this short post and this role prove useful, and if you have any of your own roles or playbooks for Pure Storage devices that you think would be useful to other users, please feel free to contribute them to the ansible-playbook-examples GitHub repository.

Some Reality for us Infrastructure Peeps or Apps are cool too

Don’t’ you just love double titles?

For many years I have been an infrastructure guy. I really liked how the cables, and processors and Memory and blinking lights worked. Applications were often the necessary evil tolerated so that I can play with cool technology. During my own journey toward learning about the cloud it becomes increasingly important to consider the function of the application. Six years ago me would totally punch me in the face right now. Traitor. J

1 – Don’t get your App messed up in my resource buckets of awesomeness

 

So the reality check to the Infrastructure geek in me is this: The application teams really think of what you do as the network. That is why when anything is ever wrong it is always “the network’s” fault. What we love to do is getting abstracted more and more. I will still contend that is very important and very hard to do. Whether you are building reference architectures or deploying a converged infrastructure appliance almost no one but us cares. They just want the data to do their jobs. So while we have really great discussions about speeds and feeds, the guy in the picture below just wants the app. From the hypervisor down we need to design with the application in mind or we will risk becoming like that goth dude locked in the server room on IT Crowd.

 

2 Honey badger don’t care about FCoE

My next post will get into what I have been researching regarding what is out there and hopefully help us (infra. peeps) understand our App/Dev brothers better.

You are probably an Infrastructure person if:

  1. You read this blog.
  2. You work mainly with Virtualization
  3. Storage Admin
  4. Network Admin
  5. You like to make fun of DBA’s

 

Dynamic Cluster Pooling

Dynamic Cluster Pooling is an idea that Kevin Miller ( @captainstorage) and I came up with one day while we were just rapping out some ideas on the whiteboard. It is an incomplete idea, but may have the beginnings of something useful. The idea is that clusters can be dynamically sized depending on expected workload. Today a VMware Cluster is sized based on capacity estimates from something like VMware Capacity Planner. The problem is this method requires you apply a workload profile across all time periods or situations. What if only a couple days of the month require the full capacity of a cluster. Could those resources be used elsewhere the rest of the month?

Example Situation
Imagine a scenario with a Virtual Infrastructure with multiple clusters. Cluster “Gold” has 8 hosts. Cluster “Bronze” has 8 hosts. Gold is going to require additionally resources on the last day of the month to process reports from a database (or something like that). In order to provide additional resources to Gold we will take an ESX host away from the Bronze cluster. This allows us to deploy additional Virtual Machines to crunch through the process or allow less contention for the existing machines.

You don’t have to be a powercli guru to figure out how to vMotion all the machines off of a ESX host and place it in maintenance mode. Once the host is in maintenance mode it can be moved to the new cluster, removed from maintenance mode and VM’s can be redistributed by DRS.

Sample Code more to prove the concept:
#Connect to the vCenter
Connect-VIServer [vcenterserver]
#indentify the host, you should pass the host or hosts you want to vacate into a variable
Get-Cluster Cluster-Bronze | get-vmhost

#Find the least loaded host(skipping for now)

#Vmotion the machines to somewhere else in that cluster
Get-VMHost lab1.domain.local | Get-VM| Move-VM -Destination [some other host in the bronze cluster]

#Move the host
Set-VMHost lab1.domain.local -State Maintenance
Move-VMHost lab1.domain.local -Destination Cluster-Gold
Set-VMHost lab1.domain.local -State Connected

#Rebalance VM's
Get-DrsRecommendation -Cluster Cluster-Gold | Apply-DrsRecommendation

I was able to manually make this happen in our lab. Maybe if this sparks any interest someone that is good with “the code” can make this awesome.

My Fun with the VMware Enterprise Administration and Design Exams

Sorry I have been missing for a few weeks. I know many were quite worried why I hadn’t blogged for a couple weeks (not really).

Back in February I sat for the Enterprise Administration Exam at PEX in Las Vegas. It was scheduled the day after the Super Bowl, what a bunch of distractions. Thankfully I passed and I want to give my experience so as to not violate any rules or anything I agreed to. This was a technical test. A lot of settings and configurations and information like that. Still multiple choice so at least you know the right answer is on the screen (hopefully, I did have one I thought none of these are right). The lab section was actually as fun as test taking could be. I wish there was more lab practical type things when it comes to these kinds of tests. Overall there is more intricate settings and config questions then you will find on the VCP exam.

At the end of April I took the Design Exam. This was a much different experience. I had a extremely hard time finding a study list of things that would help. Know the Exam Blueprint is all I would say. Also, this I think is where VMware can start finding out who does Architecture work and who may be an Administrator. I could say you could read every PDF on VMware.com and still not know how to pass this test unless you work with the solutions multiple times. The design drawing was a challenge, I wasted too much time reading the requirements document and ran of time, but I feel I was able to get a good portion of what I needed up on the page. Technically the interface was kind of quirky.

I felt both exams were challenging and but were fair to the Exam Blueprints. Nothing on there made me scream, “they didn’t say they would test on THAT!” The design exam needs some technical improvement (matching questions were buggy).

Now begins the harder and more involved process. The Design submission and hopefully an invitation to a defense.

Storage Design and VDI

Recently I have spent time re-thinking certain configuration scenarios and asking myself, “Why?” If there is something I do day to day during installs is this still true when it comes to vSphere? or will it still be true when it comes to future versions.
Lately I have questioned how I deploy LUNs/volumes/datastores. I usually deploy multiple moderate size datastores. In my opinion this was always the best way to fit in MOST situations. I also will create datastores based on need afterward. So will create some general use datastores then add a bigger or smaller store based on performance/storage needs. After all the research I have done and asking questions on twitter* I still think this is a good plan in most situations.
I went over a VMworld.com session TA3220 – VMware vStorage VMFS-3 Architectural Advances since ESX 3.0 and read this paper:
http://www.vmware.com/resources/techresources/1059
I also went over some blog posts at Yellow-Bricks.com and Virtualgeek.

An idea occurred to me when it comes to using extents in VMFS, SCSI Reservations/Locks, and VDI “Boot Storms”. First some things a picked up.
1. Extents are not “spill and fill” VMFS places VM files across all the LUNs. Not quite what I would call load balancing, since it does not take IO load into account when placing files. So in situations where all the VM’s have similar loads this won’t be a problem.
2. Only the first LUN in a VMFS span gets locked by “storage and VMFS Administrative tasks” (Scalable Storage Performance pg 9). Not sure if this implies all locks.

Booting 100’s of VM’s for VMware View will cause locking and even though vSphere is much better when it comes to how quickly this process takes. There is still an impact. So I am beginning to think of a disk layout to ease administration for VDI, and possibly lay the groundwork for improved performance. Here is my theory:

Create four LUNs with 200GB each. Use VMFS to extents to group them together. Resulting in an 800 GB datastore with 4 disk queues and only 1 LUN that locks during administrative tasks.

Give this datastore to VMware View and let it have at it. Since the IO load for each VM is mostly the same, and really at the highest during boot other tasks performed on the LUN after the initial boot storm will have even less impact. So we can let desktops get destroyed and rebuilt/cloned all day with only locking that first LUN. This part I still need to confirm in the LAB.

What I have seen in the lab is with same sized clones the data on disk was spread pretty evenly across the LUNs.

Any other ideas? Please leave a comment. Maybe I am way off base.

*(thanks to @lamw @jasonboche and @sakacc for discussing or answering my tweets)