Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

You will be divorced within a year.


comp / comp.lang.python / Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

SubjectAuthor
o Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Ke2QdxY4RzWzUUiLuE

1
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: 2QdxY4RzWzUUiLuE@potatochowder.com
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 00:20 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: 2QdxY4RzWzUUiLuE@potatochowder.com
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Tue, 1 Oct 2024 20:20:59 -0400
Lines: 67
Message-ID: <mailman.25.1727828470.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
<CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
<ZvyR67khWYevR7hn@anomaly>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: news.uni-berlin.de RiOC4zZDBOrzsWVFb8w6dgCFKzCFBx3x98c56pEJqPVg==
Cancel-Lock: sha1:M6Ck9q1dM4HxlvBxPnRXVsSdx2U= sha256:nOaQd1mYoSO7H1GnUYS9LmuhEifuEMmGrTQmJvHP774=
Return-Path: <2QdxY4RzWzUUiLuE@potatochowder.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=potatochowder.com header.i=@potatochowder.com
header.b=HSg+OgiB; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.012
X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'stream': 0.04; 'bigger':
0.05; 'partial': 0.07; 'string': 0.07; 'subject:API': 0.07;
'example.': 0.09; 'fails': 0.09; 'infinite': 0.09; 'json': 0.09;
'parse': 0.09; 'received:78': 0.09; '"what': 0.16; '(i.e.,': 0.16;
'+0200,': 0.16; 'bits': 0.16; 'constraint,': 0.16;
'from:addr:2qdxy4rzwzuuilue': 0.16; 'from:addr:potatochowder.com':
0.16; 'intent': 0.16; 'low-level': 0.16; 'parsing': 0.16;
'protocol,': 0.16; 'received:136.243': 0.16; 'received:172.58':
0.16; 'received:78.46': 0.16; 'received:78.46.172': 0.16;
'received:www458.your-server.de': 0.16; 'received:your-server.de':
0.16; 'wrote:': 0.16; 'problem': 0.16; 'says': 0.17;
'subject:Help': 0.17; 'bug': 0.19; 'to:addr:python-list': 0.20;
'language': 0.21; 'input': 0.21; "i've": 0.22; 'received:de':
0.23; 'idea': 0.24; '(and': 0.25; 'anything': 0.25; 'saying':
0.25; 'cannot': 0.25; 'anyone': 0.25; "isn't": 0.27; 'function':
0.27; '(as': 0.32; 'amounts': 0.32; 'received:136': 0.32; 'but':
0.32; "i'm": 0.33; 'subject:for': 0.33; 'same': 0.34; "didn't":
0.34; 'header:In-Reply-To:1': 0.34; 'words': 0.35; 'files': 0.36;
'subject:from': 0.37; 'file': 0.38; 'way': 0.38; 'read': 0.38;
'least': 0.39; 'single': 0.39; 'break': 0.39; 'presentation':
0.39; 'case.': 0.40; 'hand': 0.40; 'remember': 0.61; "there's":
0.61; 'internal': 0.63; 'ever': 0.63; 'definition': 0.64;
'number,': 0.64; 'thus': 0.64; 'your': 0.64; 'let': 0.66;
'receiving': 0.66; 'numbers': 0.67; 'maximum': 0.67; 'right':
0.68; 'during': 0.69; 'closing': 0.69; 'manner': 0.69; 'quote?':
0.69; 'subject:Data': 0.71; 'deal': 0.73; 'career': 0.78;
'significant': 0.78; 'spent': 0.81; 'left': 0.83; 'known': 0.84;
'extracted': 0.84; 'further.': 0.84; 'lasts': 0.84; 'subject: \n
': 0.84; 'sufficiently': 0.84; 'number.': 0.91; 'magic': 0.93;
'storage': 0.95
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
d=potatochowder.com; s=default2305; h=In-Reply-To:Content-Type:MIME-Version:
References:Message-ID:Subject:To:From:Date:Sender:Reply-To:Cc:
Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID;
bh=2Hog428PI/SBJWsUncKQIcre4xKxNZCWNEkn8t3UWzY=; b=HSg+OgiBZTOPLnyhXlo7WtDKjn
+7Lbn7DxOnWTW2YETdnlRIyw/xX4XhiBSk1R+9SF0bdgs4YIaz4BeyUxKr1bjvxg3tECoUeSZDelH
YTuUWJVvLlJKgPGpyHLWyx4owFWScarPq4ojvUXbXpbu4VrKOG5hqTOGRlERrlwcPOj48IoDubrpj
MVA103jDoeppYLHnfxsKrR/GsQ+jH36A5aH5hdvFo0/spmCrCdFpd5R+Xh0S6xqmG3eb5FsPyKn5N
HIEAhUptMy2Y6Hps+2w7+bx+Bhc19jgtkQtjz/vjYl/viNtbuQflp6dHrUgJgIyot1PI5CHBMS7Ss
YYfyzGHg==;
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
X-Authenticated-Sender: 2QdxY4RzWzUUiLuE@potatochowder.com
X-Virus-Scanned: Clear (ClamAV 0.103.10/27414/Tue Oct 1 10:44:50 2024)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <ZvyR67khWYevR7hn@anomaly>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
<CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
View all headers

On 2024-10-01 at 23:03:01 +0200,
Left Right <olegsivokon@gmail.com> wrote:

> > If I recognize the first digit, then I *can* hand that over to an
> > external function to accumulate the digits that follow.
>
> And what is that external function going to do with this information?
> The point is you didn't parse anything if you just sent the digit.
> You just delegated the parsing further. Parsing is only meaningful if
> you extracted some information, but your idea is, essentially "what if
> I do nothing?".

If the parser detects the first digit of a number, then the parser can
read digits one at a time (i.e., "streaming"), assimilate and accumulate
the value of the number being parsed, and successfully finish parsing
the number it reads a non-digit. Whether the function that accumulates
the value during the process is internal or external isn't relevant; the
point is that it is possible to parse integers from most significant
digit to least significant digit under a streaming model (and if you're
sufficiently clever, you can even write partial results to external
storage and/or another transmission protocol, thus allowing for numbers
bigger (as measured by JSON or your internal representation) than your
RAM).

At most, the parser has to remember the non-digit character it read so
that it (the parser) can begin to parse whatever comes after the number.
Does that break your notion of "streaming"?

Why do I have to start with the least significant digit?

> > Under that constraint, I'm not sure I can parse anything. How can I
> > parse a string (and hand it over to an external function) until I've
> > found the closing quote?
>
> Nobody says that parsing a number is the only pathological case. You,
> however, exaggerate by saying you cannot parse _anything_. You can
> parse booleans or null, for example. There's no problem there.

My intent was only to repeat what you implied: that any parser that
reads its input until it has parsed a value is not streaming.

So how much information can the parser keep before you consider it not
to be "streaming"?

[...]

> In principle, any language that has infinite words will have the same
> problem with streaming [...]

So what magic allows anyone to stream any JSON file over SCSI or IP?
Let alone some kind of "live stream" that by definition is indefinite,
even if it only lasts a few tenths of a second?

> [...] If you ever pondered h/w or low-level
> protocols s.a. SCSI or IP [...]

I spent a good deal of my career designing and implementing all manner
of communicaations protocols, from transmitting and receiving single
bits over a wire all the way up to what are now known as session and
presentation layers. Some imposed maximum lengths in certain places;
some allowed for indefinite amounts of data to be transferred from one
end to the other without stopping, resetting, or overflowing. And yet
somehow, the universe never collapsed.

If you believe that some implementation of fsync fails to meet a
specification, or fails to work correctly on files containign JSON, then
file a bug report.

1

rocksolid light 0.9.8
clearnet tor