1 YEAR OF TEACHING MYSELF COMPUTER SCIENCE

War Stories: What Are We Doing Here?

In the winter of 2023, I came to a humbling realization.

While I had a firm foundation in the mathematics behind data science, I didn't understand any of the software/hardware enabled me conduct data science. This made me think...

What the f*** is data science?

The “field of data science” itself is still very young and is still figuring out its identity. It lies in this weird limbo between computer science, mathematics, statistics, and some domain for application. Honestly, this awkwardness made me shy away from studying “Data Science” and I focused my energy on theory, spending my time at university studying applied mathematics and statistics which I do not regret. However, as I left university and began work, I soon realized that I didn’t understand how to effectively use and optimize software/hardware for data mining and analysis. In my mind, I thought that all I needed to know was the theory behind these models; understanding was someone else’s problem and I adopted an “if it aint’ broke, don’t fix it” mentality. Ultimately, I soon realized that understanding is my problem. Here are a few stories:

Version Control: My team used Dropbox as a version control system. If you have ever used git, you’re probably thinking “why would you ever put yourself through that?”. At the time, I didn’t know any better and my unfortunate experience made me appreciate git even more. Suppose that two people (delightfully named person A and person B) decide to open a file like a Python script. Say that person A works on that script for 2-3 days and saves the file. Person A expects that those changes should be reflected if they reopen that file. However, if person B had access to the file prior to person A saving it, they would have a “dirty” copy of Python script. As such, when person B decides to save the file, 2-3 days of work all go bye bye. The reality was that I was person A. Several times. Again, there’s a reason why nearly everyone uses git.
Data Processing Libraries: My team was primarily using pandas, a data processing library in Python. When I had to run complex queries on datasets ranging from 50GB to 300GB, I ran into deep performance problems. These queries would take hours, and I couldn’t figure out why. A friend told me to try out polars, a dataframe interface suited for OLAP. I didn’t know what any of that meant when I used the tool but, when I wrote some scripts with polars, my complex queries now took minutes. In my mind, polars query fast, go brrrr, me happy. The lingering question though was why polars go brrr?
Packages, Virtual Environments, Shell: I didn’t understand brew at all. Heck, I didn’t even know what a package manager was and why we needed it. In my mind, all I knew was that I needed to import a bunch of things that are useful for data science (pandas, numpy, pytorch, scikit-learn, os, matplotlib). This led to issues with dependencies, especially when sharing scripts with my peers. I didn’t understand how to leverage virtual environments or environment variables for data science. A great example is API key management. Since we used Dropbox (and not Github), I never pushed an API key to a repository (thank god). However, this never should have been a problem in the first place if I learned I how to properly set up environment variables and manage my shell config.
General Systems Concepts: My team needed standardized data from thousands of SEC filings in a single database. As such, I was tasked with developing a pipeline from scratch. Ultimately, I decided to use large language models (GPT-4.0 turbo at the time) but I initially did not know how to efficiently process these filings to minimize token usage, ensure persistent storage, and maximize efficiency. For brevity, I will only focus on the LLM processing step since there were other stages of the pipeline (like string processing with regex and consistency checks) which aren’t relevant for the story. With my naïve-approach, I ran a script that literally iterated through each SEC filing and standardized the text. If there was a failure, too bad; I just skip that entry with a generic error handle
```
try:
	\\ some processing
except Exception as e: 
	\\ This was, unfortunately, the extent of my error handling.
	\\ Maybe I would also print(e)
    print(“something went wrong”)
```
Eventually, I made some optimizations like memoization, multithreading, better error handling, and so on. However, I still remember the nights in my apartment where I left my laptop lid half-open to keep my query running and would occasionally check the progress of my 12-24 hour scripts.

These are just a few of my experiences. The point is I soon realized that I could not avoid the gigantic Snorlax blocking the bridge; I needed to find that PokeFlute. Now it’s been one year and while I've learned a lot, I'm still nowhere near where I want to be. However, I hope to recount my experiences in the past year, outline the resources that I used, and what I plan on doing this year!

One Year of Fun

To start off, I needed a frame of reference. The most helpful resource that I found was Oz Nova’s teachyourselfcs. While I did not use all the resources that the website provided, I used his “areas of computer science” as a guidepost of what I should know. Additionally, I had already developed a solid foundation in discrete mathematics and machine learning/deep learning through school, so I spent much of my time on the other fields of computer science.
An additional logistical note, I have generally excluded links for courses since they may be deprecated. However, the general strategy that I used to find lectures is as follows. If they contain public lecture recordings, there is no additional work to be done! If not, usually if you find a “COVID” version of the course, there will be lecture recordings. The same applies to projects/labs and homework which are more likely to be publicly available. I want to emphasize that most of my learning was done through projects/labs so much sure you do them. I would also recommend choosing the best method for you! I learn best through reading and doing so the textbook and project pair was ideal for me. As for textbooks, I would recommend using a fork of [0x6c,0x69,0x62,0x67,0x65,0x6e,0x0a] (where the string is UTF-8 encoded). With that sorted out, here we go! I started with introductory computer architecture/systems. During this time, I primarily used two class resources…

MIT 6.1800 (formerly 6.033) – Computer Systems: a great course for understanding most areas of computer systems like databases, distributed systems, operating systems, computer networking, and security,
UC Berkeley’s CS61C – Great Ideas in Computer Architecture: a necessary course if you want to understand how your computer actually works. After the course, I could understand how people built computer in Minecraft or a computer in Terraria. and the following textbooks also proved useful
Principles of Computer System Design: An Introduction
Computer Organization and Design RISC-V Edition
Computer Systems: A Programmer's Perspective which also comes with labs

It was this time that I switched from VSCode to Vim. In the end a text editor is a text editor but I really have been enjoying using Vim all things programming. In the summer, I decided to delve deeper in operating systems and distributed systems. To this end, I used the following references…

MIT 6.1810 – Operating System Engineering: I really enjoyed this course as the labs help you understand the internals of an operating system by hacking on xv6.
MIT 6.5840 – Distributed Systems: To be honest, I jumped into this course way too soon. I did end up completing all the labs and lectures but I think I could’ve saved this after database systems and computer networking.
MIT 6.5660 – Computer Systems Security: in hindsight, I also jumped into this course too soon. I did half of the course and then decided to focus my energy on other things. Granted, I will be reexploring this course later this year.

I read one textbook for operating systems, which was Operating Systems: Principles and Practice, as well as a bunch of papers on distributed systems from A Distributed Systems Reading List. Then, I finished up 2024 by focusing on computer networking via UC Berkeley’s CS 168 – Introduction to the Internet.

What I’m Doing Now

As for learning, I have been focusing my energy on going through database systems. Specifically, Professor Andy Pavlo from CMU has a bunch of great, publicly available resources for database systems.

and compilers (here is my current implementation of the Lox programming language in Zig)

Future Plans

Again, this is nowhere near where I want to be, and I consider this past year a first step in a lifelong journey of continuous learning. While I am focusing on database systems and compilers right now, I know that I want to explore additional topics this year: reinforcement learning, data compression/coding theory, advanced computer architecture, “graduate level” statistical inference, and machine learning. While this whole endeavor was sparked by my frustration at not understanding the software/hardware that enables productive data science, I have grown to become deeply interested in the world of computer systems and theoretical computer science and I hope you stick along with me for this ride!