Speeding up Web Scraping

Introduction

In this tutorial we’ll use a variety of techniques to speed up our toy web scraper.

Program Version Time (s)
Original 17.7
Release mode 4.7
DispatchQueue.concurrentPerform 1.1
URLSession 0.6
More …

Web Scraping with SwiftSoup

Introduction

In this tutorial we’ll use the open-source Swift Soup library to scrape open-source houseplant data from the Wikipedia houseplants page.

Our program will scrape this URL: https://en.wikipedia.org/wiki/Houseplant

Our program will produce a JSON object containing the scraped data. It will look something like this:

{
    "Tropical and subtropical": {
        "Aglaonema" : {
            "description": "These are evergreen perennial herbs with stems…'
        }
        …
    }
    …
}
More …

Advent of Code 2020

My college freshman son and I competed in this year’s Advent of Code coding competition. It was a battle of youthful energy against wisdom & experience. I am somewhat chastened to report that we scored about the same.😅

My son used Python, and didn’t even use any of the fancier Python libraries or language techniques. He also eschewed the debugger, doing all his debugging using print() statements.

I used Swift & Xcode, and tended to use every feature of the language and libraries.

That we scored similarly probably shows that the two languages are evenly matched for the kinds of problems that were given in this contest.

The Advent of Code puzzles typically come in two parts: The first part is usually easier, the second part usually adds a twist that requires extending the original solution.

In looking back on the month, I think that, compared to my son, I typically over-engineered my solutions. I introduced enums and structs and helper functions where he used the built-in data types and copy-and-paste. This usually gave him a signficant edge on the time to solve the first, “easier” problem.

My over-engineering sometimes paid off. I usually had easier debugging (due to more typechecking), and sometimes it was easier to refactor my code for the “twist” in problem two. Plus occasionally having a compiled language helped. Althoght not that often. The AoC problems are designed to be solvable in short run times in Python, even on old hardware.

My edge, the reason that I was competitive, was that I have enough experience that I could usually figure out how to solve a given problem. In later days of the contest, when the problems got harder, I felt that this was an unfair advantage, so I would give my son a hint about what web searches to use to figure out a good way of approaching the problem.

Overall we had fun. Shout out to the excellent r/adventofcode subreddit. Each day, after we solved the puzzles ourselves, we would check in with the subreddit to see other solutions.

Improving my Swift skills

Using Swift to compete in AOC has encouraged me to explore parts of Swift that I hadn’t had a need to or chance to learn before.

  • String processing, and the relationship between String, Substring, and Character.
  • Regular expressions
  • map, reduce, filter, forEach, compactMap and occasionaly flatMap.
  • Classic Algorithms
  • SIMD for points.
  • typealias for briefer code.
  • Generics for reusable algorithms.
  • NSCountedSet
  • ArraySlice and the various Ranges.
  • Creating Sets and Dictionaries the functional way rather than imperatively.
  • Identifiable, RawRepresentable, Hashable, CaseIterable, CustomStringConvertible.
  • Sorting
  • Using value types as much as possible.

Room for Improvement

Swift is a good language for coding contests. Certainly much less verbose than many other Java-like languages. However, there are still some speed-bumps compared to Python or F#.

  • String processing is verbose.
  • Regular expressions are very verbose.
  • No automated synthesis of Comparable.
  • Tuples can’t conform to Hashable, limiting their use as a general value type.
  • The Swift standard library is missing many useful algorithms and collection classes.
    • These can be added via third party libraries, but that takes time during a contest.
  • The tradeoff between debug builds and release builds.
    • Debug builds are slow.
    • Release builds are difficult to debug.

Previous AoC contests

The old Advent of Code contests are all still “live”. You can’t get a timed score, but you can still enter and complete any contest.

I first did Advent of Code last year, but this year in addition to entering the 2020 contest I also completed all the earlier years’ contests. r/adventofcode was a handy resource for this task. There are archived posts for discussing the solutions to each day’s puzzles for every year’s contest.

Blogging from an iPad

After a day’s hacking, I am pleased that I can update my github.io-based blog using an iPad. Here’s how I did it:

  1. I researched the topic, finding some good info on Avery Vine’s post.

  2. I wrote a script, ConvertBlogFromPublishToJekyll.swift, to convert my posts from the markdown flavor used by Publish to the markdown flavor used by Jekyll.

That’s it, there’s no step three. Whenever I want to post to my blog from my iPad, I just edit the sources in Working Copy. Because the blog is backed by a git project, I can also edit the blog from any other device that supports a git client, including a regular Mac or PC.

This works because Jekyll support is built into github.io web pages. Whenever a new commit is made, Github’s servers automatically run the Jekyll app to regenerate my blog.

Pro tip: The Working Copy text editor can be switched from “Programming” mode to “Natural” mode. Natural mode provides spell checking.

Picking the Right Tool for the Job

Over the past few months I have been coaching one of my daughters as she writes a data collection application for her model rocketry team.

Her team is competing in a yearly model rocketry contest called The American Rocketry Challenge (TARC). Teams of high school students compete to design model rockets that best meet contest rules. Like a road rally race, the goal is not to build the fastest or highest flying rocket, but rather to build a rocket that can most precisely fly to a given height, with a given flight duration.

In the course of a year, a team typically makes around 30 test flights, carefully modifying their rocket to more closely meet the contest criteria.

My daughter wanted to create an application to enable her team to record the flight data (altitude, flight time, weather, whether the payload egg cracked, etc.) for all their flights. Once collected, she wanted to be able to analyze the data. (Graphing it, computing averages and deviations, and so forth.)

She initially wanted to write a mobile phone app to do this, so I helped her investigate the how to do this. We settled on Google Firebase. I helped her write a prototype iOS app using SwiftUI. It worked well, and looked great, but it turned out to have a few drawbacks:

  • Firebase servers are complicated to set up, and can’t easily be cloned. This steered us towards a design where we had one application for multiple contest teams. Once we started down that design path, we ended up with a hierarchical design with “organizations”, that had “teams”, that had “members”. There was an account system with user roles such as “organization administer”, “team administrator”, “team member”, and so on. Complicated server-side rules and scripts enforced different permissions.

  • Apple’s “Sign in With Apple” product is attractive for users, but is difficult for app developers. Users typically don’t know their Apple account email, which makes it difficult to help them administer their accounts. Working around this required us to implement a complicated invitation system.

  • The app itself was about 1500 lines of Swift code, and we were not enthusiastic about porting it to Android.

We were not sure that we were going to be able to get both the iOS and Android versions of the app finished in time for the fall launch season. Plus we weren’t sure we wanted to deploy an app that required centralized administration.

So in late June we had a re-think. We came up with a simpler solution: Write the app as a Google Sheets spreadsheet. This is a clunkier UI, but it has a number of important benefits:

  • Works on Android, iPhone, and PC.
  • Each team has its own independent spreadsheet.
  • No central administration.
  • Uses the normal Google Docs account system.
  • Avoids the potential of hitting the “free tier” Firebase account limits.
  • Each team can customize the sheet to taste, using easy-to-understand spreadsheet formula and scripts.
  • Powerful charting and data analysis tools are built in.
  • Allows quick-and-dirty end-user changes to UI.
  • Concentrates on solving the “Data” part of the problem, rather than the “Design” part.

Since she switched technologies, development has gone much more quickly. Partly because the spreadsheet provides so much built-in structure, and partly because it de-scoped all the multi-team and account role related work.

The main drawback of the new tech stack is that the Google Sheets mobile app UI is limited. The new app definitely looks like a spreadsheet app rather than a mobile app. For example, instead of pressing a button to invoke a script, users have to pick a menu item from a cell’s pop-up data validation menu.

But it’s such a relief to be essentially “done” with development of the first version of the app. Now she’s working on creating a web site, writing documentation, recording tutorial videos, running user tests, and all the other work that’s needed to polish and launch the app.

I guess her greatest challenge is waiting to see if next year’s TARC contest happens at all, given the COVID-19 pandemic.

The app’s web site is Yes it’s Rocket Science

Building an image board browser using SwiftUI and Combine

As a hobby project, I’ve been writing an imageboard browser app to learn the SwiftUI and Combine libraries.

SwiftUI and Combine are available on many Apple platforms. So far I’ve gotten my imageboard browser working well on iPhone and iPad, and this weekend I got it working on Apple TV.

The iPhone and iPad run the same “Universal” app, which provides a vertical scrolling list and navigation stack UI that works well for both iPhone:

iPhone Screenshot

and iPad:

iPad Screenshot

The AppleTV app looks and acts quite differently:

AppleTV Screenshot

More …

Dropping a Dynabook: A comic that turned from Science Fiction to Science Fact

Some time around 1982, I saw an amazing comic on the wall of a Xerox Alto computer room at the MIT AI Lab. Given the subject matter, I assume the comic was originally created at Xerox PARC, possibly as part of the NoteTaker project, but can’t find any trace of it on the web. I have recreated it from memory, below.

The comic is explaining the events that happen when a Dynabook is accidentally dropped off the top of Half Dome in Yosemite National Park. Note that a free-fall calculator claims that it would take over 12 seconds for the Dynabook to hit the Yosemite Valley floor.

Dropped Dynabook Comic

T+00.000 Dynabook accidentally dropped from top of Half Dome.

T+00.016 Dynabook notices that

  • It can’t sense its user.
  • It is in zero gravity.
  • There is a 200 MPH wind from below.

Dynabook concludes that it is falling.

  • Turns off the display to save memory.
  • Opens a radio connection to El Capitan radio tower.
  • Begins backing up the user’s recent changes.

T+05.000 Dynabook hits a glancing blow to the side of Half Dome, breaking 3 of its 6 CPUs. The Dynabook reconfigures itself to continue working with the 3 remaining CPUs.

T+10.000 User data backup finishes.

T+11.000 Dynabook orders the user a replacement Dynabook.

T+12.000 Dynabook turns on an emergency locator beacon.

T+12.816 Dynabook smashes into the rocks at the bottom of Half Dome.

——–=====—–

Pro No Mo - I don't really need a MacBook Pro machine for hobby programming.

I’ve been trying to decide which Apple laptop to buy for hobby programming.

I’m leaning towards the cheapest laptop Apple sells, the 2019 MacBook Air. As far as I can tell, is fine for my current needs.

I specced out a more powerful and future-proof laptop, a 2019 MacBook Pro with 2x the RAM and SSD storage, but it was 60% more expensive.

I think it makes more sense for me to buy the cheaper laptop today, and plan on replacing it sooner. Especially because I have a lot of family members who would be fine with the cheaper laptop as a hand-me-down.

It does feel a little weird to decide that I don’t need a “Pro” machine. When it comes down to it, Xcode, a SSD and a retina display are all the “Pro” I need for hobby programming, and Apple has made those features available in the budget Air line.

Follow-up

In the end I bought the more expensive Macbook Pro. I am a sucker. :-)