There and back again

When I first learned programming, I learned Javascript and for the last 6 months, I mostly worked in Java, but now I'm back to working mostly in Javascript. I wanted to jot down a few thoughts, now that I've extensively worked in a second programming language, particularly one that's actually very different despite the similarities in name.

I know this is over-generalizing but I think it's fair to say that there's major cultural differences between Javascript programmers / programs and Java programmers / programs. I say programs, and not just programmers, because I've noticed myself switching mentalities when I switch programming languages.

Javascript

  • Multi-paradigm programming language that initially started with a focus on functional programming and then more recently embraced classical OO programming with classes in ES6.
  • Initially focused on building small scripts in the very early days, but in the last 10+ years it's been used for large-scale programs (e.g. gmail, google maps, etc.).
  • More recently, there's been a push for static typing with language extensions / tools like Typescript and Flow. This is largely in response to the increasingly complex apps that people are building JS apps in.
  • With WebAssembly, Javascript engines are providing a low-level compilation targets for systems languages like C++, Rust, etc. This means Javascript will increasingly interop with other languages.

Java

  • Multi-paradigm programming language that initially started as being very focused on OO programming (i.e. everything must be in a class) and more recently embraced functional programming with lambdas in Java 8.
  • Focused on building large-scale systems that can be maintainable / reusable for many years and contexts.
  • JVM has become a rich target environment for languages like Kotlin, Scala, Clojure due to the massive investment in the runtime (JVM) and ecosystem (JDK libraries, community libraries).
  • Java has several promising projects to address some of its key limitations:
    • Type inference to minimize type boilerplate (I believe this is now available?)
    • Record types to eliminate the boilerplate with creating POJO.
    • Pattern matching to allow more concise, less error-prone patterns (e.g. don't need to cast instance manually).
    • Fibers to allow a very lightweight thread (sounds like Go routines?)
    • Value types which lets you "code like a class, works like an int". It allows much more efficient memory layout which is important for high-performance code. This is the one area where C++ is much better than Java at right now, as it's very hard to minimize cache misses in Java and results in ugly patterns like shredding an object into arrays of primitives.

What's interesting is that when I look at how Javascript and Java have evolved and are still evolving, there's several commonalities:

  • Both languages have huge communities and deep investments in their VM runtimes. Both of these factors made them rich target environments for other languages.
  • Both languages have been incredibly careful about preserving backwards compatibility.
  • Static typing has proven its value for large-scale programs and ability to enable powerful tooling (e.g. large JS apps are increasingly written in statically typed variants, especially at large tech companies like Google, Microsoft, and Facebook).
  • On the flip side of static typing, type inference to minimize boilerplate has also been embraced by both. While too much reliance on type inference can lead to obscuring code, it's usually quite valuable, esp. when variable names are appropriate.
  • Both languages provide significant amount of async support (e.g. Promises and async/await in Javascript vs. Futures and fiber in Java). Caveat: fiber in Java is much more about programming "like it's multi-threaded" vs. Javascript has continued to embrace the single-thread programming model much more.
  • Both are focused on low-level performance through extensions / major infrastructure projects like WebAssembly for Javascript and Valhalla for Java.
---

While I'm at it, I thought I might share a few thoughts on programming languages that I haven't used that much (e.g. only here and there when I need it for glue code), but still have opinions on :)

C++
  • C++ biggest strength and weakness is its compatibility with C. It's initial interop with C (e.g. preprocessor for C with classes) was a major factor for its early success and it continues to be useful for interoping with legacy systems / other languages because C in many contexts is the lowest common denominator (e.g. many languages interop with C).
  • C++, as its creator Bjarne Stroustrup would say has many barnacles. Like Java and Javascript, C++ has been very serious about maintaining backwards compatibility. To my knowledge, they make very few backwards incompatible change, and after 30+ years of existence its not a surprise that there's many warts in the language. C++ has become a huge language and there's many ways to do the same / similar thing (e.g. char* vs std::string, enum vs. enum class). Because of this it's easy to do things the "wrong" way or the non-modern way.
  • C++ motto is zero-overhead abstractions, which means "you don't pay for what you don't use".
  • C++ is valuable in context where performance is critical (e.g. browsers, language runtimes, embedded systems, big data, etc.) but it tends to be less productive, IMO, compared to languages like Java / Javascript (this point might be controversial, but at the very least I think you can say that there's more people who can program productively in Java vs. C++).
  • C++ makes it easy to shoot yourself in the foot (e.g. dangling pointers), but it also gives programmers very low-level control of their program which is critical for performance (e.g. avoiding cache misses by avoiding heap allocations).

Python

  • Python was originally intended for small scripts / glue code, but it has evolved to build very large systems (e.g. YouTube). That said, I'm skeptical personally that Python is a great choice for large scale programs. While Python has added type checking functionality through projects like mypy, based on my brief research, it looks a lot less mature than projects like Typescript which is a fully staffed team from Microsoft with deep language experience (e.g. the creator of C#, Anders Hejlsberg, also created / led the Typescript project).
  • Python doesn't have a great performance story, even compared to other dynamically typed languages. While Javascript got much faster because browsers innovated a lot, Python's VM never had a similar level of investment. IIRC, a lot of Python's core modules are written in C for this reason.
  • Python's biggest advantage is probably in the data science / ML space where it has a rich ecosystem for those use cases (e.g. iPython for interactively creating data visualizations in notebooks). A lot of its popular libraries like Numpy and Scipy use languages like C / Fortran under the hood to achieve high performance.
  • Personally, I don't like Python that much for anything more than 1 file to do some glue code / shell scripting, mostly because the typing / tooling story is not that great.

Go

  • Go is a relative new comer and is only ~10 years old. It's initial focus was on building highly scalable cloud services. It's a minimal language and its spec is very small compared to languages like Java and C++.
  • The language is very opinionated and has many less features compared to the other languages: for example, there's no inheritance, no exceptions, no generics, etc. Some of these limitations will likely be there forever (e.g. inheritance) while others are being addressed in Go 2 (e.g. generics).
  • Having no generics has led to some very hairy code, in my brief experiment with Go. I ended up having to use libraries which accepted a nil interface and carefully needed to read the godoc to understand how the interface actually worked. Essentially nil interface opts you out of static type checking and it feels unsafe (a la dynamically typed languages).
  • The language / community very much encourages few abstractions which on the positive side, makes code pretty readable even for newcomers to the language or the project. The downside is that (from what I hear), is that it's boilerplate-heavy, particularly with error handling. This will hopefully be addressed in Go 2, but we'll have to see.
  • Go's performance is probably somewhere between Java and C++. Even though it has a garbage collector, you can still allocate objects to the stack and not the heap, and have more control over memory layout than Java.
  • The garbage collector used to be problematic for large-scale users (e.g. Google), but the investments in their GC has lead to very small GC pauses (esp. compared to JVM).
  • In my brief experiment, I found Go to be very simple to read and write and productive, but it's also quite tedious. The boilerplate with error handling is annoying but was still OK. My biggest issue was the lack of generics because it inevitable leads to nil interface in many use cases and you lose the benefits of static types. I think Go has a sweet spot, which is maybe for projects where there's lots of new contributors (e.g. a large project that's constantly hiring new engineers or an open-source projects), because it's a simple language and essentially forces you to follow a very consistent style guide without excessive abstractions.

---

You can probably tell from my comments that I have favorites when it comes to languages :) 

  1. I like Javascript the most because it has reasonable performance, allows you to effectively do large-scale programming with static types using Typescript, and because it has a very fast iteration cycle (e.g. 1-sec refresh).
  2. I also like Java because its tooling is phenomenal and the performance is good enough for most contexts. Java definitely has a slower iteration cycle than Javascript and tends to have a culture with heavy abstractions, but neither of those are a deal-breaker for me.
  3. C++ is great for high-performance contexts, but I don't really like doing low-level programming where you sweat all the performance details (e.g. avoiding cache misses). In my experience, when I've coded in C++ it's because some other system used C++ (who might need the performance win), but in my program I didn't need or care about the performance win and it felt like an overkill to use C++. The tooling in C++, while improving, is still much worse than Java. My guess is that the language itself is much complex to parse and analyze (e.g. macros, templates, etc.) than Java.
  4. I'm fine with using Python for doing basic maintenance / infra work that you'd otherwise use a shell script for, but the lack of static types and its relatively worse performance makes it unappealing for me for production use cases.
  5. I have pretty mixed opinions on Go. I don't think I'd look into it again until Go 2 is released, which may be a while, because it feels like an overtly restrictive language without a big win. Its concurrency feature, goroutines, is pretty interesting but I think other languages like Java will catch up in this area, and I don't see it being compelling enough to win over legacy systems / companies entrenched in older languages like Java, which has a much bigger ecosystem.

Event Log Workflow

Why event logs are used by data systems and accountants

Mature data systems use an event log mechanism, which means they track every change.

Let’s consider the canonical example of a bank with a database of its customers’ financial records. A simple implementation of this database would be a row for each customer account and columns for the account id and the account balance. Whenever a transaction occurs, the account balance is adjusted. Each transaction information, once it’s been accounted for, is promptly discarded to save space.

This implementation is problematic for several reasons:

  • What if the customer wants to verify the accuracy of the account balance? The simplest way to do this would be to provide a record of each transaction since the inception of the account.

  • What if the credit card is stolen three days ago and the bank wants to undo those recent transactions? Since none of the transaction information was stored permanently.

  • How do you handle concurrent updates to the same account? If two threads both see $100 as the existing balance and they each try to update the balance to a new number, will you end up losing one of the updates?

Event log is a simple but powerful concept where you store each event permanently and then later generate views based on those events. Much like how an accountant never “erases” a record when a debt is paid off, an accountant only adds a new entry to the general ledger.

The only two downsides are the storage cost of keeping each event and computation cost of generating each view. In real-world systems, you may end up compacting old events into a snapshot. Likewise, an accountant might calculate the current balance by looking at last year’s audited account balance and then tallying all the transactions in the current year. In theory, one could generate it by tallying all the transactions since the inception of the audited firm, but that would be excessive work with negligible benefit.

Using event log to “Get Things Done” (GTD)

We can use the same event log methodology and use it to keep track of everyday activities to maximize productivity and ensure correctness.

The actual category names aren't important but I'll tell you mine to make this description more concrete.

The first group is called in progress and these are the tasks that I’m working on right now. Typically I'm only going to have one task that I’m working on at any given point. Sometimes, I might be waiting for someone else before I finish a task (e.g. getting a code review) and in that case it's okay to have two tasks in progress. The goal is to limit the cognitive load by focusing your attention on one thing and doing one thing well: you might recognize this as the Unix philosophy.

The second group is called upcoming -  this is the work I haven't done yet. Each of these tasks should be discretely defined with a clear definition of what “done” means. And if I can't precisely define the task yet, that’s the first sub-task that I will do for that task. The goal is to minimize the number of large hairy tasks that I might feel inertia to starting and instead have many small tasks that I can accomplish in a day or so.

The last group is called completed and these are all the tasks that I’ve finished previously.  

What's important to note here is that I never delete a task. The lifecycle of a task is that it starts in the upcoming group and then it goes to the in progress group and then finally it lands in the completed group.

If one of the tasks that I'm working on isn't needed anymore, I won’t actually delete it. Instead I’ll move it to the completed group and mark it with the keyword “skip”. If I find out a month later that I actually do need to do the task, I haven’t lost any information.

Lastly, I like doing this in a Google doc, it keeps things really simple because there's no fancy UI to distract me. Furthermore, everything is tracked in the revision history.


Front-end Architecture

Outline:

  • Import "concepts" not "implementations" - encourage a pattern where people import a generic Component interface
  • Encourage compositions through decorator and mix-in patterns
  • Type safety as a first-class concern
  • Prefer fractal architecture (e.g. a big component is composed of smaller components)
  • Web standards over proprietary standards (e.g. use the normal DOM interface, make it compatible with web components)

Review of "Large-scale Automated Visual Testing"

Just watched a video from Google's 2015 testing conference called Large-scale Automated Visual Testing. Incredibly insightful talk by a cofounder of Applitools, a SaaS provider for visual diff testing.

I heard of Applitools before when I was researching various visual diff tools for my team at work, and I was initially wary that the talk would be an extended informercial of Applitools' product. My concern was quickly proven wrong. It's an incredibly informative talk filled with numerous examples and demos to demonstrate various tips he has for doing visual testing in an efficient and effective manner. I was actually blown away by the demos of Applitools and how effective they were at identifying "structural changes", that is substantive changes to a website / app, and being able to ignore minor differences between browsers or dynamic content that changes (e.g. article blurbs that change each day).

I'm looking forward to trying out the free plan and seeing if we can incorporate Applitools into our team's continuous delivery workflow.

Data normalization

Data normalization is one of those words that I've been intimidated of for a while. My initial reaction is that it's about making the data "normal", i.e. standardized, so you can't have some rows where the date is a timestamp (e.g. 14141231) and others where it's a string (e.g. "January 23, 2015"). I think that initial intuition was along the right tracks but data normalization seems to be more focused on making sure no particular piece of data is stored in more than one place. Essentially, if I can boil it down, data normalization is about having a "single source of truth" for any given piece of information (e.g. Bill Clinton's date of birth).

There are three forms of data normalization that each build on each other, with the second being more strict than the first, and so on. The examples in the wikipedia page were actually very easy to understand and I highly recommend skimming through the pages and reading through the examples:

https://en.wikipedia.org/wiki/Database_normalization

https://en.wikipedia.org/wiki/First_normal_form

https://en.wikipedia.org/wiki/Second_normal_form

https://en.wikipedia.org/wiki/Third_normal_form

I initially got interested in what "normalization" meant, when Dan Abramov mentioned his library normalizr, which normalizes nested JSON data.

As a business analyst in my last job, I think this notion of de-duplicating data is second nature and storing the same piece of information in multiple places is the bane of any data analyst managing a complex Excel workbook. For example, sometimes we had to build an Excel model really quickly and take some shortcuts. Later when our boss would ask us "what would be the impact if factors A and B were adjusted by 5%?", it wouldn't be as simple as changing a single cell in one place. The difficulty would be in remembering all the places where you would need to manually update the data. Of course, as you get better at Excel modeling, you would utilize cell references as much as possible, and try to consolidate all the various inputs ("levers" in consulting-speak) in one area, ideally the first worksheet of an Excel file.