Kaggle, JSON, Python, and pandas

While looking at various datasets from kaggle to do some experiments with Python and graph visualization with networkx, the arXiv dataset caught my attention. It has a (relatively) simple schema – authors, papers, and categories. It is also huge – there are 2,219,423 lines/records in its 3.5GB uncompressed .json file (1.2GB compressed).

The format of that .json file is also a practical example of an issue with JSON – you can’t start with a valid .json file, then add/write/append another record, and end up with valid JSON. (The CSV format is allows appending without rewriting the entire file, but CSV has many other issues that make it unattractive for most cases.)

There is a (semi-formal?) standard for the format used by arXiv (even though it is never mentioned by name) – “JSONL” or “JSON Lines”. See this 2022 entry from atatus.com and this jsonlines.org summary. The vocabulary is fun: “Each line has a valid JSON value”. Other vocabulary: “Line-delimited JSON” and ” Newline-delimited JSON” and “Concatenated JSON”. The suggested extension is “.jsonl”

Tool support for JSONL seems to be hit-or-miss. jq supports JSONL (surprise!). The Python library json does not (it throws a JSONDecodeError “Extra data: line 2 column 1 (char 1689)”. As the title says – this post is about pandas.

Based on many of the code examples, you could easily conclude that pandas does not support JSONL. For a naive implementations, see Arxiv Data analysis (loads every object into an array, then creates the DataFrame with pd.DataFrame.from_records(array). The people who run this naive code have either (a) trimmed the number of lines in their input .json to something reasonable, like 100k lines or (b) have access to machines with 24GB+ of RAM in order to have two copies in memory at the same time.

The github gist that inspired this post was from cj2001/load_arxiv_data.py, which creates the entire array of “metadata”, but with a limit on the number of lines/objects created. It does limit the number of lines, but it is still “the hard way”.

For a slightly more sophisticated implementation, see Scientific Article Recommendation – See In[2] extract_data with “yield line” and In[3] fetch_n_records with list comprehension to load the file into memory:

return [json.loads(record) for record in islice(data_gen, chunksize)]

In conclusion:

The correct pattern is to use pd.read_json( filename, lines=True, orient=’records’, nrows=nnn) (as can be seen in ArXiv) Note that leaving off the “nrows=” argument can create an out-of-memory situation even when using nrows=<a huge value> will work just fine (e.g. arXiv, nrows=5000000 works, since there are only 2219423 lines, but leaving off nrows crashed with OOM killer)

So – add pandas.read_json() to the list that supports JSONL natively:

df = pd.read_json("arxiv-metadata-oai-snapshot.json",
                  nrows = 5000000,
                  orient='records',
                  lines=True )

This entry was posted in Machine Learning. Bookmark the permalink.