compute_total_loss

You’re here because you can’t get the unit tests for the Coursera Improving Deep Neural Networks course, Week 3, exercise number 1, to pass. You are trying to correctly implement

def compute_total_loss(logits, labels):

And the unit test keeps failing with an off-by-a-large number (i.e. not the usual “shapes are incompatible” error).

Sorry, this post isn’t going to just give you the answer.

But, since Coursera (instructions and code) and Tensorflow (documentation) have come together to make an incredibly frustrating riddle, this will provide hints on where they went wrong, and thus where you went wrong.

Another guess on why you’re here: you entered “tf.keras.losses.categorical_crossentropy example” into a search engine, and followed the link tf.keras.losses.CategoricalCrossEntropy, and that page shows the first argument as “from_logits=False,”. And you just figured that was close enough.

Yet another guess on why you’re here: you read the instructions – “It’s important to note that the “y_pred” and “y_true” inputs of tf.keras.losses.categorical_crossentropy are expected to be of shape (number of examples, num_classes)”. And you noticed the code gives you “logits” and “labels”, in that order. So Coursera has said “y_pred/y_true”, then said “logits/labels”, and the first documentation link doesn’t even have the first two arguments. But even with all of these naming mis-matches, you still entered the arguments in the order given (implied?) by the instructions.

Final clue: use the link provided by the instructions – tf.keras.losses.categorical_crossentropy and pay attention to the order of the arguments.

Then, channel Jeff Atwood on the “hard things in computer science” and write your own post about how using poor names can cause a ton of problems…

Posted in Machine Learning | Comments Off on compute_total_loss

Teraflops Comparison

Documenting some various GPU hardware

NameTFLOPs
single precision
TFLOPS
tensor perf (FP16)
TFLOPS
(FP16-Sparse)
Tensor
cores
CUDA coresRAM
RTX 3080Ti34.113627332010,24012 GB
V100
(specs)
141126405,12016 GB
RTX
3070
20.311845,8888 GB
GTX 10808.82,5608 GB
PS510.3
Xbox X12.1
PS41.8
Xbox One1.4

Note that the V100 is used in the AWS p3.2xlarge instance type. The V100 numbers are in general smaller than the 3080Ti, and with the WSL2 tensorflow 2.12 libraries, the 3080Ti out-performs the V100 on the 50,000 epoch test 736 seconds to 928 seconds – here the 3080Ti is 26% faster.) (Caveat – extremely small test set – only my ml-style-transfer code.)

(Using the “Windows Native tensorflow 2.11” libraries, the V100 out-performed the 3080Ti on the 50,000 epoch test 928 seconds to 1063 seconds – here the V100 is 12% faster).

It looks like the p3.2xlarge has been around since late 2017. It started at $3.06/hour, and is still the same price today (2023/Apr). The V100 prices seems to have dropped from $6,000 in 2019 to $3,500 today.

Node Replacement Factor (NRF) – nvidia documentation

Posted in Uncategorized | Comments Off on Teraflops Comparison

AWS EC2 GPU instance comparison

These are the results from running the ml-style-transfer project on three different AWS EC2 instance types.

Instance Name$Cost/hour250 epochs2,500 epochs50,000 epochs
p3.2xlarge$3.0614s
$0.0119
55s
$0.0468
928s
$0.7888
t2.large$0.093849s
$0.0219
14,676s
$0.3791
293,520s
(extrapolated)
$7.5826
c5.4xlarge$0.68221s
$0.0417
2,152s
$0.4065
43,040
(extrapolated)
$8.1298
Comparing p3.2xlarge GPU with t2.large non-GPU

The p3.2xlarge is 32x more expensive per hour. Yet, for the 2,500 epoch tests, it is 266x faster, which combines to be 8x more cost-effective (e.g. $0.05 versus $0.40 for 2,500 epochs).

The p3.2xlarge also gets results faster (wall clock) – the 50,000 epoch run on p3.2xlarge only took 15 minutes wall clock, yet the projected run on t2.large is 4,892 minutes (over 3 days).

The c5.4xlarge is a “compute optimized” – large vCPU, large RAM. It is 7x the hourly price of the t2.large, and on the 2,500 epoch test, delivers about that much wall clock improvement – so, basically the same cost, but 7x faster delivery of results.

Note: the total test time was approximately 40 minutes on the p3.2xlarge ($2.04) and 280 minutes on the t2.large ($0.44). But, for a mere 5x total cost difference, the p3.2xlarge performed a 50,000 epoch run that would have taken forever on the t2.large.

In contrast, purchasing a RTX 3080Ti for $900 and using it for the 2,500 epoch run took 16 seconds (which is barely slower than the current record-holder p3.2xlarge at 14 seconds).

Next up is to upgrade my AWS account to allow “spot” pricing for p3.2xlarge – if Amazon will allow it for my non-commercial account. The $3.06 on-demand price seems to drop to about $1.01 for spot instances. Update (1 day later): “We have approved and processed your service quota increase request”. So my “EC2->Limits->All P Spot Instance Requests now says “8 vCPUs”, which is enough for one p3.2xlarge instance.

Just FYI: when that AMI is run under a non-GPU machine type, the first run takes an extra 5+ minutes, as the system says “Matplotlib is building the font cache; this may take a moment.” It only does this the first run. (This is an example of a thing that cause differences in “python elapsed time” and “wall clock elapsed time”.)

Posted in Hardware, Machine Learning | Comments Off on AWS EC2 GPU instance comparison

Read, Do, Aha

This is a variation of “Be, Do, Have” (“Be, Do Have” == Be a photographer, Do take a bunch of photos, and then Have/Buy expensive equipment). It records my recent epiphany in Machine Learning. This variation is “Read, Do, Aha”.

I have read a lot of material on machine learning (linear regression, logistic regression, neural nets, supervised learning, CNNs, etc.) I have done some certifications (mainly from Coursera, from Andrew Ng, in the Machine Learning Specialization).

I have already completed the Convolutional Neural Networks course (fourth of the five of the Deep Learning Specialization) and recently decided to “go back” and start with the first course in this specialization – Neural Networks and Deep Learning.

In that course, one of the exercises is to build a simple neural net (simple = two layers total, no hidden layers). So simple it is equivalent to a Logistic Regression model. A very basic plot of a Logistic Regression model is:

In the example above, two inputs (plotted above as x and y) are given a binary classification (blue dot or orange dot), and then a single-line regression is computed, and the graph shaded into blue and orange areas. Thereafter, you can use that line to test the accuracy of known (training) data, and to make predictions on never-seen-before (test) data. In the above example, the training accuracy is quite good, and it seems safe to say the test accuracy will/would be quite good as well.

The course uses “Cat or not-Cat” as the binary class definitions, and flattens a 64x64x3 image into a an input of 12,288 values. After everything is implemented and run against the training data, you have a 12,000+ dimension hyperplane that divides the input space into Cat/not-Cat.

The “Aha!” moment looks like this:

  1. Fact: It can be trained to 99% accuracy to correctly separate the Cats from the not-Cats, in a very small amount of time/iterations.
  2. Fact: It has a terrible accuracy (70%) on the test set (aka never-seen-before data). Just guessing would have a 50% accuracy if the test set was split 50-50. i.e. the hyperplane is terribly overfit

  3. Aha! This makes sense because you tried to use 12,288 pixels individually to decide between Cat/not-Cat. (i.e. you had no hidden layers, no “build up of knowledge”, no “higher level constructs”). It only used individual pixels. When looked at from that perspective, it is amazing that it can get 99% accuracy on the training data, and completely un-amazing that it doesn’t generalize to the test data.
Posted in Machine Learning | Comments Off on Read, Do, Aha

Another Dell SFF

The goal of this machine was, well, it was basically too inexpensive to pass up. It was refurbished, from SJ Computers LLC. It arrived in a box that used the standard “conform to interior” expanding foam, and was very well packed. The CPU was “First Seen” in 2013.

Note that Windows 10 Pro OEM itself sells for approximately $145.

Some facts on the CPU: the Core i5-4570@3.2GHz is currently #1390 on PassMark [5,220] and is a LGA1150 at 84W. It is a 4 core 4 thread model with turbo speed of 3.6GHz. It supports a maximum of 32 GB of DDR3 RAM.

All product links are from the actual vendor.

ItemProductCost
SystemDell OptiPlex 7020 SFF Computer$138
CPUIntel Core i5-4570 3.2Ghzincl.
RAM32GB (2x 16GB) DDR3 DRAMincl.
Motherboardincl.
Power Supplyincl.
VideoIntel graphicsincl.
DVD/CDThin form factor DVDincl.
Caseincl.
SSD Drive512GB SSD (TIMETEC brand)incl.
HHD Drive
Monitor
Wi-FiUSB wi-fi adapterincl.
Keyboard/MouseDell wired keyboard; Dell optical mouseincl.
OSWindows 10 Proincl.
Software
Total$138
Posted in Computer Builds | Comments Off on Another Dell SFF

Ubuntu 22.04 VNC stops working

My Ubuntu 22.04 virtual machine has been constantly turning off (crashing?) the VNC server. This is the VNC server that is built-in under Settings -> Sharing -> Remote Desktop -> Enable Legacy VNC Protocol.

One way to restart the VNC server is to uncheck the box, then re-check the box.

However, since that has to be done with a GUI, it is really annoying.

So, here is the script that will stop and then restart the VNC server:

#!/bin/env bash

grdctl vnc disable
sleep 5
grdctl vnc enable

Using this script, once your VNC server crashes, you can ssh into the machine, run the script from the command line, and then re-establish your VNC session.

Another tip: be very careful when closing the “Remote Desktop” dialog, because the “X” to close that dialog is right over the red sliding “Sharing” on/off control. Be careful that you don’t accidentally click again, which will disable all Sharing when closing the “Remote Desktop” dialog.

Posted in Ubuntu | Comments Off on Ubuntu 22.04 VNC stops working

Kaggle, JSON, Python, and pandas

While looking at various datasets from kaggle to do some experiments with Python and graph visualization with networkx, the arXiv dataset caught my attention. It has a (relatively) simple schema – authors, papers, and categories. It is also huge – there are 2,219,423 lines/records in its 3.5GB uncompressed .json file (1.2GB compressed).

The format of that .json file is also a practical example of an issue with JSON – you can’t start with a valid .json file, then add/write/append another record, and end up with valid JSON. (The CSV format is allows appending without rewriting the entire file, but CSV has many other issues that make it unattractive for most cases.)

There is a (semi-formal?) standard for the format used by arXiv (even though it is never mentioned by name) – “JSONL” or “JSON Lines”. See this 2022 entry from atatus.com and this jsonlines.org summary. The vocabulary is fun: “Each line has a valid JSON value”. Other vocabulary: “Line-delimited JSON” and ” Newline-delimited JSON” and “Concatenated JSON”. The suggested extension is “.jsonl”

Tool support for JSONL seems to be hit-or-miss. jq supports JSONL (surprise!). The Python library json does not (it throws a JSONDecodeError “Extra data: line 2 column 1 (char 1689)”. As the title says – this post is about pandas.

Based on many of the code examples, you could easily conclude that pandas does not support JSONL. For a naive implementations, see Arxiv Data analysis (loads every object into an array, then creates the DataFrame with pd.DataFrame.from_records(array). The people who run this naive code have either (a) trimmed the number of lines in their input .json to something reasonable, like 100k lines or (b) have access to machines with 24GB+ of RAM in order to have two copies in memory at the same time.

The github gist that inspired this post was from cj2001/load_arxiv_data.py, which creates the entire array of “metadata”, but with a limit on the number of lines/objects created. It does limit the number of lines, but it is still “the hard way”.

For a slightly more sophisticated implementation, see Scientific Article Recommendation – See In[2] extract_data with “yield line” and In[3] fetch_n_records with list comprehension to load the file into memory:

return [json.loads(record) for record in islice(data_gen, chunksize)]

In conclusion:

The correct pattern is to use pd.read_json( filename, lines=True, orient=’records’, nrows=nnn) (as can be seen in ArXiv) Note that leaving off the “nrows=” argument can create an out-of-memory situation even when using nrows=<a huge value> will work just fine (e.g. arXiv, nrows=5000000 works, since there are only 2219423 lines, but leaving off nrows crashed with OOM killer)

So – add pandas.read_json() to the list that supports JSONL natively:

df = pd.read_json("arxiv-metadata-oai-snapshot.json",
                  nrows = 5000000,
                  orient='records',
                  lines=True )

Posted in Machine Learning | Comments Off on Kaggle, JSON, Python, and pandas

Vue.js 2 versus 3

As of this post creation, Vue.js version 3 is available, and is “being promoted” over version 2. For example, the default “npm install -g @vue/cli” install will use Vue 3.2.41 by default.

Don’t use version 3 at this time.

The reason: Vue.js v3.0 was released 2020/Sep. But Vue.js v3.2 was released 2021/Aug, and there is some sort of “architecture war” (negative spin) or “massive improvement” (positive spin) happening with v3.2

Specifically – v3.2 has introduced <script setup> coding style, which results in “… difference in module execution semantics”. The documentation and examples for <script setup> is very thin, and seems to (typically) assume you already know how to use it while it is being explained.

Just stick with Vue.js (v2.5 is what I’m using in “compatibility mode” with “vue init” from the Vue CLI tool suite. After my application is fully functional, I’ll try to upgrade to v2.7 (which is EOL 2023/Dec; hopefully a year from now v3.2 will have caught up in its documentation.)

Posted in Software Engineering, Software Project | Comments Off on Vue.js 2 versus 3

Benchmark Disks NVMe vs SSD SATA vs HDD SATA

Benchmarked with Samsung Magician

DeviceSeq Read
MB/s
Seq Write
MB/s
Random Read
IOPS
Random Write
IOPS
Samsung 980 Pro
1TB NVMe
69205174984130297607
Samsung 980 Pro 2TB NVMe700752401053222924072
Samsung 980 Pro 2TB NVMe
when plugged into a PCIe 4.0 x2 mode
35613514852294835205
Samsung 850 Pro
256GB SATA
5605269497081298
WDC 2003FZEX
2TB SATA
111115238244
Samsung 970 Pro 512GB NVMe35592299362304364476
Intel 660P 2TB NVMe18391938234130226806
Samsung 860 Evo 1TB SATA5615319716786669
WD Red WD40EFRX 4TB SATA146140244488*

Note: the two 980 2TB rows show an important difference: “PCIe 4.0 x4 mode” versus “PCIe 4.0 x2 mode”. The first 980 1TB is also plugged into an “x4 mode”.

Posted in Computer Builds, Hardware | Comments Off on Benchmark Disks NVMe vs SSD SATA vs HDD SATA

Core i5 Update

This is a rebuild of the Core i5 Sandy Bridge Build machine. This machine also started spontaneously shutting off, have BSOD and freeze events. (Just like what caused the Core i5-10400 rebuild, previously).

The CPU, motherboard and RAM were changed out. The old motherboard was USB 2 with DDR3; the new motherboard is USB 3 with DDR4, and got a bump from 16GB to 32GB

Some facts on the new CPU: it is currently #279 on PassMark [20,970] cpubenchmark.net, with a turbo speed of 4.8 GHz and TDP of 65W to 117W. It is 6 cores/12 threads, first seen Q1 2022.

For now, the old GTX 970 Video card is still being used (see Video Card Relative Benchmarks) – it is on its last legs, though, and needs upgraded too.

Next step: Windows was no longer activated (new hardware, which is expected). What is different is that the “New Hardware Installed” button is completely borked. It tried to log in to Microsoft, which changed my local account into the “linked email account”. In the end, I just decided to get a new OEM Windows 10 and a new SSD NVMe (because at this point, this system is begging for it).

Update: The old Optiarc DVD drive was not recognized by the motherboard/BIOS as a valid boot device. So – swapped in a newer drive to get Windows installed. Update: splurged, purchased a Blu-ray burner with M-DISC support. Update: M-DISC blanks from Amazon arrived – true M-DISC 25x for $65, regular BD-R 25x for $22. Both are 25GB size.

Update: Bought a RTX 3080Ti for $900 (on sale from $1,200)

ItemProductCost
CPUIntel Core i5 12600 3.3GHz LGA 1700 65W$230
RAMCorsair Vengeance RGB Pro 32GB (2x16GB) DDR4 3600 PC4-28800$130
MotherboardASUS PRIME B660-Plus D4 LGA 1700 2.5Gb LAN, Gen 1 Type-C$140
Power SupplyCorsair 750CX ATX 12V 80 Plus EPS12V($60)
VideoNVIDIA GeForce RTX 3080 Ti 12GB GDDR6X, PCIe x4.0, 10,240 CUDA cores$900
CaseAntec Three Hundred ATX Mid Tower($60)
SS DriveSamsung 980 Pro 1TB PCIe Gen 4×4 NVMe, 7000MB/sec, M2.2280$140
HD DriveWestern Digital Black 1TB SATA 6.0Gb/s 7200RPM 64MB WD1002FAEX($88)
DVD/CDASUS BW-16D1HT Blu-ray burner with M-DISC support$80
OSWindows 10 Professional 64 bit$130
KeyboardCorsair Vengeance K70 Mechanical Gaming Keyboard – Red LED – Cherry MX Brown Switches($130)
MouseLogitech G5 2-Tone 6 Buttons 1 x Wheel USB Wired Laser 2000 dpi Mouse($46)
Total$

Posted in Computer Builds, Core-i5 | Comments Off on Core i5 Update