Category: Cisco

Vendor and Cloud lock-in; Good? Bad? Indifferent?

Vendor lock-in, also known as proprietary lock-in or customer lock-in, is when a customer becomes dependent on a vendor for products and services. Thus, the customer is unable to use another vendor without substantial switching costs.

The evolving complexity of data center architectures makes migrating from one product to another difficult and painful regardless of the level of “lock-in.” As with applications, the more integrated an infrastructure solution, architecture and business processes, the less likely it is to be replaced.

The expression “If it isn’t broke, don’t fix it” is commonplace in IT.

I have always touted the anti-vendor lock-in motto. Everything should be Open Source and the End User should have the ability to participate, contribute, consume and modify solutions to fit their specific needs. However is this always the right solution?

Some companies are more limited when it comes to resources. Others are incredibly large and complex making the adoption of Open Source (without support) complicated. Perhaps a customer requires a stable and validated platform to satisfy legal or compliance requirements. If the Vendor they select has a roadmap that matches the companies there might be synergy between the two and thus, Vendor lock-in might be avoided. However, what happens when a Company or Vendor suddenly changes their roadmap?

Most organizations cannot move rapidly between architectures and platform investments (CAPEX) which typically only occur every 3-5 years. If the roadmap deviates there could be problems.

For instance, again let’s assume the customer needs a stable and validated platform to satisfy legal, government or compliance requirements. Would Open Source be a good fit for them or are they better using a Closed Source solution? Do they have the necessary staff to support a truly Open Source solution internally without relying on a Vendor? Would it make sense for them to do this when CAPEX vs OPEXi s compared

The recent trend is for Vendors to develop Open Source solutions; using this as a means to market their Company as “Open” which has become a buzzword. Such terms like Distributed, Cloud, Scale Out, and Pets vs Cattle have also become commonplace in the IT industry.

If a Company or individual makes something Open Source but there is no community adoption or involvement is it really an open Source project? In my opinion, just because I post source code to GitHub doesn’t truthfully translate into a community project. There must be adoption and contribution to add features, fixes, and evolve the solution.

In my experience, the Open Source model works for some and not for others. It all depends on what you are building, who the End User is, regulatory compliance requirements and setting expectations in what you are hoping to achieve. Without setting expectations, milestones and goals it is difficult to guarantee success.

Then comes the other major discussion surrounding Public Cloud and how some also considered it to be the next evolution of Vendor lock-in.

For example, if I deploy my infrastructure in Amazon and then choose to move to Google, Microsoft or Rackspace, is the incompatibility between different Public Clouds then considered lock-in? What about Hybrid Cloud? Where does that fit into this mix?

While there have been some standards put in place such as OVF formats the fact is getting locked into a Public Cloud provider can be just as bad or even worse than being locked into an on-premise architecture or Hybrid Cloud architecture, but it all depends on how the implementation is designed. Moving forward as Public Cloud grows in adoption I think we will see more companies distribute their applications across multiple Public Cloud endpoints and will use common software to manage the various environments. Thus, being a “single pane of glass” view into their infrastructure. Solutions like Cloudforms are trying to help solve these current and frustrating limitations.

Recently, I spoke with someone who mentioned their Company selected OpenStack to prevent Vendor lock-in as it’s truly an Open Source solution. While this is somewhat true, the reality is moving from one OpenStack distribution to another is far from simple. While the API level components and architecture are mostly the same across different distributions the underlying infrastructure can be substantially different. Is that not a type of Vendor lock-in? I believe the term could qualify as “Open Source solution lock-in.”

The next time someone mentions lock-in ask them what they truly mean and what they are honestly afraid of. Is it that they want to participate in the evolution of a solution or product or that they are terrified to admit they have been locked-in to a single Vendor for the foreseeable future?

Is it that they want to participate in the evolution of a solution or product or that they are terrified to admit they have been locked-in to a single Vendor for the foreseeable future?

The future is definitely headed towards Open Source solutions and I think companies such as Red Hat and others will guide the way, providing support and validating these Open Source solutions helping to make them effortless to implement, maintain, and scale.

All one needs to do is look at the largest Software Company in the world, Microsoft, and see how they are aggressively adopting Open Source and Linux. This is a far cry from the Microsoft v1.0 which solely invested in their own Operating System and neglected others such as Linux and Unix.

So, what do you think? Is Vendor lock-in, whether software related, hardware related, Private or Public Cloud, truly a bad thing for companies and End Users or is it a case by case basis?

Advertisements

The day the systems administrators was eliminated from the Earth… fact or fiction?

As software becomes more complex and demands scalability of the cloud, IT’s mechanics of today, the systems administrator, will disappear. Tomorrow’s systems administrator will be entirely unlike anything we have today.

For as long as there have been computer systems, there has always been a group of individuals managing them and monitoring them named system administrators. These individuals were the glue of data centers,  responsible for provisioning and managing systems. From the monolithic platforms of the old ages to todays mixed bag approach of hardware, storage, operating systems, middleware, and software.

The typical System Administrator usually possessed super human diagnostic skills and repair capabilities to keep a complex mix of various disparate systems humming along happily. The best system administrators have always been the “Full Stack” individuals who were armed with all skills needed to keep systems up and running but these individuals were few and far between.

Data centers have become more complex over the past decade as systems have been broken down, deconstructed into functional components and segregated into groupings. Storage has been migrated to centralized blocks like a SAN and NAS thus inevitably forcing personnel to become specialized in specific tasks and skills.

Over the years, this same trend has happened with Systems Infrastructure Engineers/Administrators, Network Engineers/Administrators and Application Engineers/Administrators.

Everywhere you look intelligence is being built directly into products.I was browsing the aisle at Lowe’s this past weekend and noted that clothes washers, dryers, and refrigerators are now being shipped equipped with WIFI and NFC to assist with troubleshooting problems, collecting error logs and opening problem service tickets. No longer do we need to pour over those thousand pages long manuals looking for error code EC2F to tell us that the water filter has failed, the software can do it for us! Thus is has become immediately apparent that if tech such as this has made its way into low-level basic consumer items things must be changing even more rapidly at the top.

I obviously work in the tech industry and would like to think of myself as a technologist and someone who is very intrigued by emerging technologies. Electric cars, drones, remotely operated vehicles, smartphones, laptops that can last 12+ hours daily while fitting in your jeans pocket and the amazing ability to order items from around the globe and have them shipped to your door. These things astound me.

The modern car was invented in 1886 and in 1903, we invented the airplane. The first commercial air flight was not until 1914 but to see how far we have come in such a short time is astounding. It almost makes you think we were asleep for the last Century prior.

As technology has evolved there has been a need for software to also evolve at a similarly rapid pace. In many ways, we have outpaced software with hardware engineering over the last Score and now software is slowly catching up and surpassing hardware engineering.

Calm down, I know I am rambling again. I will digress and get to the point.

The fact is, the Systems Administrator as we know it is a dying breed. Like the dinosaur, the caveman and the wooly mammoth. All of these were great at some things but never enough to stay alive and thus were wiped out.

So what happens next? Do we all lose our jobs? Does the stock fall into a free fall and we all start drinking Brawndo the Thirst Mutilator (if you havent seen Idiocracy I feel for you.) The fact is, it’s going to be a long, slow and painful death.

Companies are going to embrace cloud at a rapid rate and as this happens people will either adapt or cling to their current ways. Not every company is going to be “cloudy”.

Stop. Let me state something. I absolutely HATE the word Cloud. It sounds so stupid. Cloud. Cloud. Cloud. Just say it. How about we all instead embrace the term share nothing scalable distributed computing. That sounds better.

So, is this the end of the world? No, but it does mean “The Times They Are a Changin” to quote Mr. Dylan.

A fact is, change is inevitable. If things didn’t change we would still be living in huts, hunting with our bare hands and using horses as our primary methods of transportation. We wouldn’t have indoor toilets, governments, rules, regulations, or protection from others as there would be no law system.

Sometimes change is good and sometimes its bad. In this case, I see many good things coming down the road but I think we all need to see the signs posted along the highway.

Burying ones head in the dirt like an Ostrich is not going to protect you.

How to build a large scale multi-tenant cloud solution

It’s not terribly difficult to design and build a turnkey integrated pre-configured SDDC ready to use solution. However building one that completely abstracts the compute, storage and network physical resources and provides multiple tenants a pool of logical resources along with all the necessary management, operational and application level services and allows to scale resources with seamless addition of new rack units.

The architecture should be a vendor agnostic solution with limited software tie-in to hardware vendor specifics but expandable to support various vendor hardware needs with plug-n-play architecture.

Decisions should be made early if the solution will come in various forms and factors from appliances, quarter, half and full racks providing different levels of capacity, performance, redundancy HA, SLA’s. Building a ground-up architecture to expand to mega rack scale architecture in future with distributed infrastructure resources without impacting the customer experience and usage.

The design should contain more than one physical rack with each rack unit composing of: Compute Servers with direct attached storage (software defined) a Top of the Rack and Management Switches hardware Data Plane, Control Plane and Management Plane software Management plane software Platform level Operations, Management and Monitoring software Application-centric workload Services.

Most companies have a solution based on a number of existing technologies, architectures, products, and processes that have been part of the legacy application hosting and IT operations environments. These environment can usually be repurposed for some of the scalable cloud components which saves time, cost and the result is a stable environment that operations can still manage/operate with existing processes and solutions.

In order to evolve the platform to provide not only for stability and supportability but additional features such as elasticity and improved time to market companies should begin immediately initiating a project to investigate and redesign the underlying platform.

In scope for this effort are assessments of the network physical and logical architecture, the server physical and logical architecture, the storage physical and logical architecture, the server virtualization technology, and the platform-as-a-service technology.

The approach to this effort will include building a mini proof of concept based on a hypothesized preferred architecture and benchmarking it against alternative designs. This proof of concept then should be able to scale to a production sized system.

Implement a scalable elastic IaaS – PaaS leveraging self-service automation and orchestration that enables end users the ability to self-service provision applications within the cloud itself.

Suggested phases of the project would be as follows:

Phase Description:

  • Phase I Implementation of POC platforms
  • Phase II Implementation of logical resources
  • Phase III Validation of physical and logical resources
  • Phase II Implementation of platform as a service components
  • Phase IV Validation of platform as a service components
  • Phase V Platform as a service testing begins
  • Phase VI Review, document complete knowledge transfer
  • Phase VII Present fact findings to executive management

Typically there are four fundamental components to cloud design; infrastructure, platform, applications, and business process.

The infrastructure and platform as a service components are typically the ideal starting place to drive new revenue opportunities, whether by reselling or enabling greater agility within the business.

With industries embracing cloud design at a record pace and technology corporations focusing on automation this allows the benefit of moving towards a cloud data infrastructure design.

Cloud Data infrastructure allows the ability to provide services, servers, storage, and networking on-demand at any time with minimal limits helping to create new opportunities and drive new revenue.

The “Elastic” pay-as-you-go data center infrastructure should provide a managed services platform allowing application owner groups the ability to operate individually while sharing a stable common platform.

Having a common platform and infrastructure model will allow applications to mature while minimizing code changes and revisions due to hardware, drivers, software dependencies and infrastructure lifecycle changes.

This will provide a stable scalable solution that can be deployed at any location regardless of geography.

Today’s data centers are migrating away from the client-server distributed model of the past towards the more virtualized model of the future.

Storage: As business applications grow in complexity, the need for larger more reliable storage becomes a data center imperative. Disaster Recovery / Business Continuity: Data centers must maintain business processes for the overall business to remain competitive. Dense server racks make it very difficult to keep data centers cool and keep costs down. Cabling: Many of today’s data centers have evolved into a complex mass of interconnected cables that further increase rack density and further reduce data center ventilation.

These virtualization strategies introduce their own unique set of problems, such as security vulnerabilities, limited management capabilities, and many of the same proprietary limitations encountered with the previous generation of data center components.

When taken together, these limitations serve as barriers against the promise of application agility that the virtualized data center was intended to provide.

The fundamental building block of an elastic infrastructure is the workload. Workloads should be thought of as the amount of work that a single server or ‘application gear/container/instance’ can provide given the amount of resources allocated to it

Those resources encompass compute (CPU & RAM), data (disk latency & throughput), and networking (latency & throughput). A workload is an application, part of an application, or a group of application that’s work together. There are two general types of workload that the most customers need to address: those running within a Platform-as-a-Service construct and those running on a hypervisor construct. Sometimes bare metal should also be considered where applicable but this is in rare circumstances.

Much like database sharding, the design should be limited by fundamental sizing limitations which will allow a subset of resources to be configured at maximum size hosting multiple copies of virtual machines, applications group and distributed load balanced across a cluster of hypervisors that share a common persistent storage back end.

This is similar to load balancing but not exactly the same as a customer or specific application will only be placed in particular ‘Cradles’. A distribution system will be developed to determine where tenants will be placed upon login to and direct them to the Cradle they were assigned.

In order to aggregate as many workloads as possible in each availability zone or region, a specific reference architecture design should be made to determine the ratio virtual servers per physical server.

The size will be driven by a variety of factors including oversubscription models, technology partners, and network limitations.The initial offering will result in a prototype and help determine scalability & capacity and this design should scale in a linear predictable fashion.

The cloud control system and its associated implementations will be comprised of Regions or Availability Zones. Similar in many ways to what Amazon AWS does currently.

The availability zone model allows the ability to isolates one fault domain from another. Each availability zone has isolation and redundancy in management, hardware, network, power, and facilities. If power is lost in a given availability zone tenants in another availability zone are not impacted. Each availability zone resides in a single datacenter facility and is relatively independent. Availability zones are then aggregated into a regions and regions into the global resource pool.

The basic components would be as follows:

· Hypervisor and container management control plane
· Cloud orchestration
· Cloud blueprints/templates
· Automation
· Operating system and application provisioning
· Continuous application delivery
· Utilization monitoring, capacity planning, and reporting

hardware considerations should be as follows:

· Compute scalability
· Compute performance
· Storage scalability
· Storage performance
· Network scalability
· Network performance
· Network architecture limitations
· Oversubscription rates & capacity planning
· Solid-state flash leveraged to increase performance and decrease deployment times

Business concerns would be:

· Cost-basis requirements
· Margins
· Calculating cost VS profits to show ROI (chargeback/show back)
· Licensing costs

The extensibility of the solution dictates the ability to use third party tools for authentication, monitoring, and legacy applications. The best cloud control system should allow the ability to integrate legacy systems and software with relative ease. Its my own personal preference to lead with Open Source software but that decision is left to the user to decide.

Monitoring,  capacity planning, and resource optimization should consider the following:

· Reactive – Break-Fix monitoring where systems and nodes are monitored for availability and service is manually restored
· Proactive – Collect metrics data to maintain availability, performance, and meet SLA requirements
Forecasting – Use proactive metric data to perform capacity planning and optimize capital usage

Because cloud computing is a fundamental paradigm shift in how Information Technology services are usually delivered it will cause significant disruption inside most of the current organizations. Helping each of these organizations embrace the change will be key.

While final impacts are currently impossible to measure it’s clear that a self-service model is clearly the future and integral to delivering customer satisfaction, both from an internal or external user perspective.

Some proof of concept initiatives would be as follows:

· Determine a go-forward architecture for the IaaS and PaaS offering inclusive of a software defined network
· Benchmark competing architecture options against one another from a price, performance, and manageability perspective
· Establish a “mini-cradle” that can be maintained and used for future infrastructure design initiatives and tests
· Determine how application deployment can be fully or partially automated
· Determine a cloud control system to facilitate provisioning of Operating Systems and multi-tiered applications
· Complete the delivery of FAC to generate metrics and provide statistics
· Show the value of self-service to internal organizations
· Measure the ROI based on cost of the cloud service delivery combined with the business value
· Don’t build complex for the initial offering
· Avoid spending large amounts of capital expenses on the initial design

After implementing a proof of concept testing encompassing the following(and more) should be done:

Proof of Functionality

  • The solution system runs in our datacenter; on our hardware
  • The solution system can be implemented with multi-network configuration
  • The solution system can be implemented with as few manual steps as possible (automated installation)
  • The solution systems have the ability to drive implementation via API
  • The solution system provides a single point of management for all components
  • The solution system enables dynamic application mobility by decoupling the definition of an application from the underlying hardware and software
  • The solution system can support FAC production operating systems
  • The solution system Hypervisor and guest OS are installed and fully functional
  • The solution systems support internal and external authentication against existing authentication infrastructure.
  • The solution system functions as designed and tested

Proof of Resiliency

  • The solution system components are designed for high availability
  • The solution system provides multi-zone (inter-DC,inter-region, etc.) management
  • The solution system provides multi Data Center management

Integration Testing

  • The solution system is compatible with legacy, current, and future systems integration

Complexity Testing

  • The solution system has the ability to manage both simple and complex configurations

Metric Creation

  • The solution systems have metrics that can be monitored

How to configure MDS Port-Channels for Cisco UCS

First, enable the fibre-port-channel feature:

conf t
feature fport-channel-trunk

Next we need to configure the SAN port channel first before adding ports to it

interface san-port-channel 100
channel mode active
switchport trunk mode off

We set the channel mode to auto, SAN port channel only supports on or active, active negotiates an FC port channel but on forces it to be on. We then set trunking mode to be off, this might be different for you if your using both npiv and trunking multiple VSANs.

Next, configure the actual ports to be members of the port channel:

interface fc1/29
switchport trunk mode off
channel-group 100 force

interface fc1/30
switchport trunk mode off
channel-group 100 force

Once you have done this if your VSAN was not vsan 1 you would need to bind this interface:

vsan database
vsan XYZ interface san-port-channel-100

You can now no shut this:

interface fc1/29
no shut
interface fc1/30
no shut

interface san-port-channel-100
no shut

Once this is done you need to configure the UCS to support this connectivity.

Troubleshooting vSphere Auto Deploy with the vCenter Appliance

I was pulled into an issue the other day where some ESXi Hosts were failing to boot via Auto Deploy. Now, this shouldn’t come as a shock to anyone but I absolutely HATE the Auto Deploy product as I feel its flakey and doesn’t work properly and can be cumbersome to manage if one is not comfortable using command line. I personally am not a GUI person and feel right at home in the command line but have seen enough random issues with Auto Deploy over the years that I rarely ever recommend using it for large scale deployment unless the environment has a requirement to be stateless and boot from SAN is not an option.

So. Lets dig in and discuss Auto Deploy, how it works, find the issue and sold the problem! Below we have a diagram that shows the typical chain of events when an server boots and is directed to use Auto Deploy via PXE.

So, obviously the first thing we need to check is if there is an error on the console output. This environment is using Cisco UCS which is a stateless computing platform. UCS allows hardware to migrate between chassis or domains and the logical design of the server follows. The logical design contains everything that would make up a normal whitebox server but the hardware details are specifically abstracted. This allows for administrators to provision nearly an infinite of logical design. If you want to know more about UCS you can find details here:

https://www.youtube.com/user/bradhedlund/videos

So, when we open the KVM console of the UCS blade we see the following:

error

Right off the bat we know that DHCP seems to be working and we can see what device is handing out the IP addresses. Looking into the DHCP device I was able to determine that it was a Windows Server OS responsible for the DHCP configurations:

DHCP

A quick look at the DHCP configuration looks correct and we can see that the DHCP server is sending the TFTP request to 10.40.80.7 which is our vCenter Appliance. So lets go and check whats going on there! Opening a web browser to the vCenter Appliance admin page we see that Auto Deploy is running though looks can be deceiving.

VCSA

 

The best way to determine if Auto Deploy TFTP is working is to login and check so I logged in via SSH and elected to take a look. Immediately after logging in I noticed the TFTP daemon was not running. Further, when I checked the chkconfig it was also not enabled to run at boot. So, I decided to check the TFTPD config file and thats when I saw the issue.

2015-03-28 12_28_22-bnpappvcs611.corp.fairisaac.com - PuTTY

Someone had modified the config file to run the ATFTPD service under root. This was preventing the service from starting as it didnt have the proper runtime. I made a quick change using VI and saved the file.

2015-03-28 12_28_39-bnpappvcs611.corp.fairisaac.com - PuTTY

I then adjusted the start up level  for the service

chkconfig

A quick reboot of the server was then done to make sure things were working.

2015-03-28 12_29_14-BP-UCSPERFLAB-FI _ VMWESXINF092 (Chassis - 1 Server - 6) - KVM Console(Launched

Then it was time to celebrate!

 

 

 

 

 

 

 

vMotion over Distance and Stretched VLAN

I am only going to focus on a single option for stretching Layer 2. Cisco’s OTV provides a mechanism to transport native Layer-2 Ethernet frames to a remote site. With a standard Layer-3 WAN, there is no way to bridge layer-2 VLANs, and as a result, communication between two sites must be routed. Because of the routing aspect, it is not possible to define the same VLAN in two locations and have them both be actively transmitting data simultaneously.

Because OTV can operate over any WAN that can forward IP traffic, it can be used with a multitude of different underlying technologies. It provides mechanisms to control broadcasts at the edge of each site, just as with a standard Layer-3 WAN, but also gives you the ability to allow certain broadcasts to cross the islands. OTV only needs to be deployed at certain edge devices, and is only configured at those points, making it simple to implement and manage. It also supports many features to optimize bandwidth utilization, provide resiliency and scalability.

 

Here we have two sites separated by a standard Layer-3 WAN connection. OTV is deployed across the WAN by configuring it on an edge switch at both sites. Each end of the OTV “tunnel” is assigned an IP address. Both OTV switches maintain a MAC-to-Next Hop IP table so that they know where to forward frames in a multisite configuration

When a host at one site sends a frame to a host at the other site, it can determine the MAC address of the other host, since it is on the same VLAN/network. The host sends the Ethernet frame, which is accepted by the OTV switch and then encapsulated in an IP packet, sent across the WAN, and subsequently decapsulated by the remote OTV switch. From here, the Ethernet frame is delivered to the destination as if it had been sent locally.

Concept in Practice: Workload Relocation Across Sites

In the event that there is a planned event that will impact a significant number of resources, services must be moved to an alternate location. Unfortunately, because the networks are disjointed, there is no way to seamlessly migrate virtual servers from one location to another without changing IP addresses.

As a result, Site Recovery Manager can be used to provide an offline migration to the second site, and update DNS records to reflect the new IP addresses for the affected servers. Once the event is complete, another offline migration is performed to restore services to the primary site.

Concept in Practice: vMotion over Distance with OTV & Stretched VLANs

By using OTV in this situation instead of having to use SRM to emulate a disaster situation, vMotion can be used to migrate the VMs from one site to the other. While this migration is still an offline event, it does provide a much simpler solution to implement and manage by allowing the VM to maintain its network identity in either location.

In addition to addressing the initial challenge, OTV provides additional benefits.

By having two functional sites with the same network attributes, it is possible to split workloads for services, providing fault tolerance and redundancy. If the primary site does have a planned event, failing resources over to the second site may not even be necessary. It also allows the infrastructure to scale by having added ESXi servers operational at the second location to distribute the load.

 

Cisco UCS Manager Java Errors – Unable to load source

I have literally spent the las, including this morning,  trying to figure out why I could no longer launch my Cisco UCS Manager console for the test environment. The production environment works without any issues, but when I attempt to load UCSM for the test environment it hangs at the java loading screen and after a few minutes errors out.

Knowing that Java and UCS usually don’t play well together, I figured this issue was most likely related to the Java version not working with UCS 2.1.1C so I assumed that my current version of Java 7 u51 was the culprit. Here is a list of step of what I did; but you can jump iver this if you want the quick fix.

  • Removed Java 7 U57, cleaned up the Java leftovers, rebooted Windows
  • Installed Java 6 U37, tested, failed. Removed Java 6 U37, cleaned up Java leftovers, rebooted Windows
  • Installed Java 6 U45, tested, failed, removed, rebooted.
  • Its about this point when my blood pressure began to rise and things turned from rated G to rated R
  • Installed Java 7 U21 tested, failed, removed, rebooted
  • Installed Java 7 U25 32-bit and 64-bit… tested, failed, removed, rebooted.
  • I did some research on google and came across JavaRa which removed all traces of Java. Used the utility, and then used CCleaner followed by another reboot.
  • Installed Java 8 Beta, tested, and failed.

After looking around online I came across someone reporting a similar issues, but not related to Cisco UCS but instead another Java based application. In the post they said deleting the following locations resolved their problems.

  • \Users\username\AppData\Local\Sun
  • \Users\username\AppData\LocalLow\Sun
  • \Users\username\AppData\Roaming\Oracle

Sure enough, deleting those files and rebooting fixed it!

On a side note; I am now running Java 8 and it seems to work fine with UCS…. but we will see how long that last for.