VMs are dead. Long live containers.

In the early 2000’s, we were introduced to virtual machines promising the ability to layer multiple complex environments on top of the same hardware without interfering with one another. We graduated from single-purpose servers to multi-purpose servers, greatly improving efficiency and allowing data centers to outgrow their physical footprint in terms of traditional usefulness. Fast-forward to the present time and container technology, that has been around for decades as well, has matured to the point that most services that we would run on VMs can now run from containers.

What’s the big deal? Well, at the end of the day, virtual machines emulate hardware and containers do not. This gives containers an edge over VMs regarding performance. But it goes deeper than that. VMs are generally big and bulky. Containers are generally small and modular. Container technology allows for faster backup and recovery times, tighter standardization (through compose yaml files), all in addition to the aforementioned performance boost.

This is why I run as many services as I can within containers, and only the few that are not yet fully supported are run from VMs. But the clock is ticking for those few services.

Boeing’s 737 Max Problem

To understand Boeing’s challenge, consider if you were to build a house with 2000 sq. ft. You spec an air conditioner and heater out for it and run all of the ducts and everything is perfectly working as designed. Now you want to add a room. So you add a 500 sq. ft. room. you split a nearby duct and feed the new room. But your HVAC system is not designed for 2500 sq. ft. and the ducts are not run from your HVAC unit directly to the new room. So now you have a consequence of having diminished the HVAC effectiveness in the existing areas, and also in the new area. To fix the problem, you develop fancy software to control electronic louvers that route air to where it’s needed so that you can prevent hot/cold spots. Now you’ve introduced something else that, if it breaks, can jeopardize the effectiveness of your entire HVAC system. At the end of the day, you have compromised the initial design in a way that can only truly be corrected with a completely new design. Anything but, just adds complexity (more things that can break).

Back to the 737 issues – no matter what Boeing does to fix the 737 Maxes, and even if they are successful in their quest, at the end of the day, they’ve added complexity which reflexively makes their modified 737 Max design inferior to a new design/model. The introduction of and critical dependence on the MCAS software is a byproduct of their modification to a pre-established design. Put simply, MCAS is yet another potential point of failure of the entire aircraft. We want to minimize potential points of failure, not add to them. MCAS would never be a critical feature of a new design, and this is why I believe that Boeing is going to eventually need to discontinue production of the 737 max produce an entirely new plane.

Container Wars – Podman vs. Docker

This article is intended for developers familiar with Docker and container management who are curious about trying Podman.

I recently began a quest to test Podman to explore its features and assess its feasibility as a Docker replacement.

The first thing I had to do was add the docker.io registry to podman so that I could basically use podman as a drop-in replacement to manage my docker-compose.yml files.

sed -i /etc/containers/registries.conf -e 's/^.*unqualified-search-registries .*$/unqualified-search-registries = ["docker.io"]/'

Now, I wanted to use podman-compose, but quickly discovered that this application is still undergoing many bug fixes. Installing the apt version of podman-compose, for example, gave me version 1.0.3, but it was actually version 1.0.6+ that I needed in order to work past a bug that prevented one of my host-network containers from starting. As the apt version wasn’t suitable, I opted for a pip3 installation, which offered version 1.0.6+ (the newest version with less bugs).

pip3 install podman-compose

But when I ran it, I received the following.

error: externally-managed-environment.

So I tried this command – with success.

pip3 install podman-compose --break-system-packages

Great! Now I was finally moving along. But I wanted to run pihole (a DNS server) in a container, but when starting it, I received an error.

Error: cannot listen on the UDP port: listen udp4 :53: bind: address already in use
exit code: 126

Back to digging to figure out how to fix this. Apparently, Podman uses a DNS resolver called aardvark, and it’s configured in a file at /use/share/containers/containers.conf. It’s possible to change the DNS port. But, as I learned, the changes do not take effect until every pod/container is shut down. I made the following change…

sed -i /usr/share/containers/containers.conf -e 's/#dns_bind_port.*$/dns_bind_port=54/'

Now, after stopping all of my containers and starting them all again, I was almost there. I noticed something peculiar. The start order matters. If I started pihole first, then any pods started after it would fail due to the inability for them to resolve names of the other containers. The trick was simply to start pihole last!

And that’s it! I have over a dozen containers running now and they seem surprisingly more peppy than when I ran them in Docker. But that may just be my brain trying to justify all of the hours that I spent figuring out how to make this transition work.

Overall, transitioning to Podman presented challenges, but I gained valuable insights and found it surprisingly performant. While Docker remains familiar, Podman’s security focus and rootless operation are intriguing, especially for long-term use.

GlusterFS Optimized for VMs (Ultra-Low-Cost)

This is a 4-node glusterfs cluster setup as a replica 3 arbiter 1 cluster. I use 512MB shards, reducing fragmentation of VM disks without hurting performance. My disks are all backed by SSDs. Every node has up to 2 bricks on 2 different disks. Every node is an Android TV H96 MAX X3 box running Armbian with disks attached via the USB 3.0 port. I am able to reboot any node without data loss.

Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: amlogic1:/mnt/gluster1/brick2
Brick2: amlogic2:/mnt/gluster1/brick2
Brick3: amlogic4:/mnt/arbiter/arb2s1-2 (arbiter)
Brick4: amlogic3:/mnt/gluster1/brick2
Brick5: amlogic4:/mnt/gluster1/brick2
Brick6: amlogic2:/mnt/arbiter/arb2s3-4 (arbiter)


GlusterFS Volume Options
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.cache-refresh-timeout: 10
performance.cache-size: 2GB
storage.fips-mode-rchecksum: on
performance.strict-o-direct: on
features.scrub-freq: daily
features.scrub-throttle: lazy
features.scrub: Inactive
features.bitrot: off
storage.batch-fsync-delay-usec: 0
performance.nl-cache-positive-entry: off
performance.parallel-readdir: off
performance.cache-max-file-size: 512MB
cluster.server-quorum-type: server
performance.readdir-ahead: on
network.ping-timeout: 10
features.shard-block-size: 512MB
client.event-threads: 5
server.event-threads: 3
cluster.data-self-heal-algorithm: full
cluster.shd-max-threads: 16
cluster.shd-wait-qlength: 8192
server.allow-insecure: on
features.shard: on
cluster.quorum-type: auto
network.remote-dio: on
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.locking-scheme: granular
performance.low-prio-threads: 20
cluster.choose-local: off
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 10
network.inode-lru-limit: 32768
cluster.self-heal-window-size: 8
cluster.granular-entry-heal: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


And a custom sysctl.d config:
#/etc/sysctl.d/999-gluster-tuning.conf
vm.admin_reserve_kbytes = 8192
vm.compact_unevictable_allowed = 1
vm.compaction_proactiveness = 20
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 5
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 5
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 3600
vm.extfrag_threshold = 500
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 32 32 32 0
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 36200
vm.mmap_min_addr = 65536
vm.mmap_rnd_bits = 18
vm.mmap_rnd_compat_bits = 11
vm.nr_hugepages = 0
vm.nr_overcommit_hugepages = 0
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.page_lock_unfairness = 5
vm.panic_on_oom = 1
vm.percpu_pagelist_high_fraction = 0
vm.stat_interval = 1
vm.swappiness = 10
vm.user_reserve_kbytes = 128364
vm.vfs_cache_pressure = 50
vm.watermark_boost_factor = 15000
vm.watermark_scale_factor = 10

Technology – Cost, Risk, and Reward

As a trusted decision maker in the technology space, I am constantly evaluating the following three things. Cost, risk, and reward. When choosing the right technology to solve a problem, not evaluating all three areas is, at best, wasteful – and at worst, exceedingly risky.



Take for example the tried and true example of telephone service. What is the cost, risk, and reward profile of a standard copper-delivered (POTS) telephone service to a company? Well, the cost is relatively high, the risk is relatively low, and the reward is relatively low – it does what it’s supposed to do, and not more. But if we consider VoIP, now we have a lower cost, a higher risk (Internet issues may influence quality or availability), and higher reward (flexible vendor choice, mobility options, etc.). Risk typically increases as cost decreases for different solutions. If cost is the primary driver of a solution, then risk will likely be higher. When stability is the driving factor for a project or application, expect to pay a premium cost.

Whenever choosing technology for your company, always remember the three main decision points – cost, risk, reward – and choose wisely.

Technology Matters

I recently had a candid conversation with one of our marketing executives, and I was shocked to learn her perspective on IT’s purpose within the company. The conversation was about web site performance. I was talking about how so many of the company’s objectives require IT, including her objectives to improve the web site’s performance. She had been working directly with our internal development team for over a year to get a project rolling to improve our web site’s performance with no traction and no success. Immediately after learning about her objective, I offered a solution. Since our developers were unable to prioritize a project for her, I suggested that we put a layer of caching in front of the site to meet her performance objectives without having our developers modify a single line of code. I explained to her how IT is entrenched in every aspect of the company, charged with the goal of making everyone more efficient and finding creative solutions to the company’s technical challenges. She had no idea that we could help her. She confessed to me that she thought that IT was just the people who sets up her new monitors and replaces her laptops every few years. Very sad.

The saddest thing, is that this marketing executive’s perspective is not an exception, it’s a common belief among other executives as well. By not knowing the role of IT, she excluded us from the conversation and paid for it with a year of lost clicks, lost leads, and lost revenue.

If a company was viewed as Amtrak, then the IT department would be Grand Central Station – the hub in which all paths cross. If a company was viewed as a car, then IT would be the oil and the gas (or batteries). You get the idea. IT is the force that keeps the company productive, on track, online, and competitive. Innovation starts here. If you think IT just sets up your monitors and then goes back to their caves, you are naive and it is very likely that you will struggle to succeed.

Advice – Learn the importance of IT and leverage your IT team to help you succeed! At a minimum, keep them informed. Any modern company without IT representation with a seat at the table, is destined for failure.

The Unprecedented Importance of Revisiting IT Precedents

Every IT leader would agree that a few of the top goals of any IT department are to provide reliable, secure, and sufficient services to meet the needs of the company that they support. Most IT leaders would also agree that yesterday’s solution is for yesterday’s needs. If today’s needs are different, then yesterday’s solutions are likely insufficient. This is clear to most people. The blind spot for some leaders is recognizing that solutions continuously evolve even if your needs remain static. Today’s unchanged solution to your unchanged needs might seem sufficient, but it might be hurting your company if it goes unchecked.

Whether you have worked with the company for 2 years or 20 years, it is important to revisit the precedents for solutions that have been set by you or others on a periodic basis. Each year, I add to my IT action plan to review at least one past large-scale solution with the same basic goals to achieve: better service, and better value.

Two years ago, I chose to revisit the company’s telecommunications infrastructure. As anticipated, I had the goals of providing better service and better value to the company and our customers. In pursuit of these goals, I met with our sales and support teams and asked about any pain points that they had and what an ideal solution would look like for them. I met with our finance team and compiled a complete TCO report for our existing telecommunications infrastructure. To provide better value, it was imperative to know the big picture of our total costs. Any new solution would need to provide compelling enough value and service for it to make sense. After a few months of research, and many SC led demo’s we found our solution. We tossed our very large footprint legacy on-premises PBX (including the PRI, and ACD, AES, SIP, etc.) in favor of software phones and a standards-based SIP solution with a managed cloud PBX. It has been two years since that project was completed and the our sales and support teams still tell me how happy they are with the new phone system. Not only did this project save the company $30K per year, but the sales team is enjoying mobility options that had never existed before. Our sales team can work from home and even take calls to their office line from the road.

The best part of the whole project was when I saw the smiles on our sales team’s faces a few weeks after the project was completed. The company’s needs had not changed since the last solution was implemented. But by contrasting the viable incumbent solution against leading modern solutions, I was able to sponsor and implement the change that unlocked unforeseen productivity gains and simultaneously improved the company’s bottom-line.

Creating a GlusterFS Cluster for VMs

The best GlusterFS layout that I have found to work with VMs is a distributed replicated cluster with shards enabled. I use laptops with 2 drives in each one. Let’s say we have 4 laptops with 2 drives on each one, we would do something like what I have listed below.

First, let’s create a partition on each disk. Use fdisk /dev/sda and fdisk /dev/sdb to create a partition. Now, each disk needs to be formatted as XFS. We format with the command.

mkfs.xfs -i size=512 /dev/sda1
mkfs.xfs -i size=512 /dev/sdb1
mkdir /mnt/disk1
mkdir /mnt/disk2

Now we can use blkid to list the UUID of sda1 and sdb1 so that we can add it to fstab. My fstab looks something like this (your UUIDs will be different. The allocsize is set to pre-allocate files 64MB at a time to limit fragmentation and improve performance. The noatime is to prevent the access time attribute being set every time the file is touched – this is to improve performance as well. The nofail is to prevent the system from not booting in the event of a disk failure.

#!/etc/fstab
UUID=3edc7ec8-303a-42c6-9937-16ef37068c72 /mnt/disk2 xfs defaults,allocsize=64m,noatime,nofail 0 1
UUID=b8906693-27ba-466b-9c39-8066aa765d2e /mnt/disk1 xfs defaults,allocsize=64m,noatime,nofail 0 1

Now I did something funky with my fstab because I wanted to mount my bricks under the volname so that I could have different volumes on the same disks. So I added these lines to my fstab (my volname is “prod”).

#!/etc/fstab
/mnt/disk1/prod/brick1 /mnt/gluster/prod/brick1 none bind 0 0
/mnt/disk2/prod/brick2 /mnt/gluster/prod/brick2 none bind 0 0
mkdir -p /mnt/gluster/prod/brick1
mkdir -p /mnt/gluster/prod/brick2
mount /mnt/disk1 mount /mnt/disk2
mkdir -p /mnt/disk1/prod/brick1
mkdir -p /mnt/disk2/prod/brick2

Now we can mount everything else.

mount -a

Make sure that everything is mounted properly.

df -h /mnt/gluster/prod/brick1

Make sure that you see /dev/sda1 next to it. If not, just reboot and fstab will mount everything appropriately.

Now let’s create the gluster cluster.

gluster volume create prod replica 2 gluster1:/mnt/gluster/prod/brick1 gluster2:/mnt/gluster/prod/brick1 gluster3:/mnt/gluster/prod/brick1 gluster4:/mnt/gluster/prod/brick1 gluster1:/mnt/gluster/prod/brick2 gluster2:/mnt/gluster/prod/brick2 gluster3:/mnt/gluster/prod/brick2 gluster4:/mnt/gluster/prod/brick2

By specifying the order as we did, we ensure that server gluster1 and server gluster2 are paired up with each other, and gluster3 and gluster4 are paired up with each other. If we did both bricks on gluster1 successively, then we would be unable to sustain a failure of the gluster1 node. So we alternate servers. This also improves performance.

Now, the only thing left is to tune some parameters meant for VMs. For each parameter below, we will use a command like the following:

gluster volume set prod storage.fips-mode-rchecksum on

Here are my options:

performance.readdir-ahead: on
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
cluster.granular-entry-heal: on
network.ping-timeout: 20
features.shard-block-size: 64MB
client.event-threads: 4
server.event-threads: 4
cluster.data-self-heal-algorithm: full
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
server.allow-insecure: on
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: on
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.locking-scheme: granular
performance.low-prio-threads: 32
cluster.choose-local: off
storage.fips-mode-rchecksum: on
config.transport: tcp

Now we can fire up the gluster cluster with the following command.

gluster volume start prod

I pity the fool…

Recently, there seems to be an inordinate amount of bad luck surrounding my family. My wife passed out in February and crashed her car with the girls in it – thankfully everyone was ok – but it could have been tragic. As a result of that medical event, my wife can no longer drive for at least a year, perhaps longer. As you can imagine, this puts a huge hardship on the entire family. To make things more complicated, she continues to have episodes of passing out every once in a while, which makes it prudent for her not to carry the kids up or down stairs anymore. Back to the driving – not only can she not drive, but I have to take her for many doctor appointments while the doctors try to figure out what is wrong, and I also have had to re-arrange my work arrangements so that I can spend more time working from home so that I can drive everyone places. I am working longer hours to make up the time, doing more chores at home, and pretty much burning the candle at both ends. Where am I going with all of this? It’s easy for someone in a predicament like this to ask for pity. It’s easy to lose track of all of the good things in the world and focus on the current uptick in negative things. But that’s just not in my DNA. I don’t want pity, though help is always welcome. I come home every day from work and give the family a big hug. I’m happy that everyone is healthy, and that I get to see and enjoy their presence another day. So in the face of all of the recent bad luck bestowed on my family, I still feel like the luckiest guy around to have, and be a part of, such an amazing family.

Create a UEFI bootable ISO on Debian

First thing’s first… let’s get a quick overview of what’s need, what goes where, and what to expect.

In order to make an ISO bootable, you need an .img file. Many tutorials call this efiboot.img. It’s basically a FAT formatted file that contains a specific folder structure and a specially named executable that UEFI will run. The folder structure should have just one file in it located in /efi/boot/bootx64.efi. The bootx64.efi file is the boot code, so it can be any boot loader that generates it. I like Grub, so that’s what my instructions below do.

Now you may wonder – if you use Grub as the bootx64.efi, then where the heck does it get grub.cfg from? It’s pretty plain and simple – it’s in /boot/grub/grub.cfg just as it would be in a regular filesystem. Except, /boot/grub/grub.cfg lives outside the .img file — it lives along-side it, really.

Ok, now that that’s out of the way, let’s get started. You need some packages first.

sudo apt update
sudo apt install grub-efi-amd64 grub-efi-amd64-bin grub-imageboot grub-legacy grub-pc-bin grub-pc

Now let’s create a rescue image, which will do the heavy lifting for us. Then we will mount it to a temp folder.

grub-mkrescue -o grub.iso
mkdir /tmp/grubcd
mount -o loop grub.iso /tmp/grubcd

Now let’s create the folder where we want our ISO to live. Then we will copy the contents of grub.iso to that folder.

mkdir myiso
cp -avR /tmp/grubcd/* myiso/

Next, let’s create a grub.cfg file so that you can tell it where and what to load. You will want to edit this and not leave it blank.

touch myiso/boot/grub/grub.cfg

Lastly, copy your vmlinuz and initrd files to myiso, along with any other files you need/want. Then you can create your iso with something like xorriso.

cd myiso
xorriso -volid "CDROM" -as mkisofs -isohybrid-mbr isohdpfx.bin -b isolinux.bin -no-emul-boot -boot-load-size 4 -boot-info-table -eltorito-alt-boot -e grub.img -isohybrid-gpt-basdat -o ../mycd.iso ./