Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #38: secretary plugged hairdryer into UPS


comp / comp.lang.python / Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

SubjectAuthor
o Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeGrant Edwards

1
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Grant Edwards
Newsgroups: comp.lang.python
Date: Mon, 30 Sep 2024 18:41 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: grant.b.edwards@gmail.com (Grant Edwards)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data
(60 GB) from Kenna API
Date: Mon, 30 Sep 2024 14:41:46 -0400 (EDT)
Lines: 36
Message-ID: <mailman.10.1727721708.3018.python-list@python.org>
References: <CA+hg4RiGjXw3am1s=zVLDpcA-VGS+cWNp_YEyzvS+j2MyDE2Cg@mail.gmail.com>
<CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<CA+hg4Rhn8iX7rp0uC=MbOi+8g73wQ4y4=uV0dU0jHdDUz3jk4w@mail.gmail.com>
<CAJQBtgk122sHzs+=MumYM1HW2DwKm1+i02bqgBKh4oUJYievCg@mail.gmail.com>
<4XHQPG4LzsznVwM@mail.python.org> <Zvrt0RJe5omaFkQq@anomaly>
<4XHVKQ2G9wznXbM@mail.python.org>
X-Trace: news.uni-berlin.de L27dpDCQW5MgnXk+uQti+gADS3oaStIqSn0sUQI0IGPA==
Cancel-Lock: sha1:9HOCL3RUQUjN+vzjv1uL5JeGQQ4= sha256:sCi5sl2hqGHefSCEbee7F5oOP2McN6UKoEhJIBBrjIw=
Return-Path: <grant.b.edwards@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.012
X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'stream': 0.04;
'subject:API': 0.07; 'dan': 0.09; 'json': 0.09; 'language,': 0.09;
'numeric': 0.09; 'originally': 0.09; 'parse': 0.09; '(it': 0.16;
'algorithms': 0.16; 'flip': 0.16; 'from:addr:grant.b.edwards':
0.16; 'from:name:grant edwards': 0.16; 'interesting.': 0.16;
'literals': 0.16; 'missing?': 0.16; 'parsing': 0.16; 'wrote:':
0.16; 'problem': 0.16; 'subject:Help': 0.17; 'grant': 0.17;
"can't": 0.17; 'to:addr:python-list': 0.20; 'language': 0.21;
'written': 0.22; 'cannot': 0.25; 'header:User-Agent:1': 0.30;
'think': 0.32; "doesn't": 0.32; 'python-list': 0.32; "wouldn't":
0.32; 'but': 0.32; 'subject:for': 0.33; 'there': 0.33; "didn't":
0.34; 'requires': 0.34; 'question.': 0.35; 'usual': 0.35;
'from:addr:gmail.com': 0.35; 'subject:from': 0.37; 'way': 0.38;
'least': 0.39; 'valid': 0.39; 'still': 0.40; 'match': 0.40;
'something': 0.40; 'should': 0.40; 'skip:h 10': 0.61; 'imagine':
0.64; 'numbers': 0.67; 'back': 0.67; 'that,': 0.67; 'message-
id:invalid': 0.68; 'right': 0.68; 'order': 0.69; '13th': 0.69;
'century': 0.69; 'claim': 0.71; 'subject:Data': 0.71; 'limits':
0.76; 'significant': 0.78; 'left': 0.83; 'anticipated': 0.91
User-Agent: slrn/1.0.3 (Linux)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <4XHVKQ2G9wznXbM@mail.python.org>
X-Mailman-Original-References: <CA+hg4RiGjXw3am1s=zVLDpcA-VGS+cWNp_YEyzvS+j2MyDE2Cg@mail.gmail.com>
<CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<CA+hg4Rhn8iX7rp0uC=MbOi+8g73wQ4y4=uV0dU0jHdDUz3jk4w@mail.gmail.com>
<CAJQBtgk122sHzs+=MumYM1HW2DwKm1+i02bqgBKh4oUJYievCg@mail.gmail.com>
<4XHQPG4LzsznVwM@mail.python.org> <Zvrt0RJe5omaFkQq@anomaly>
View all headers

On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:
> On 2024-09-30 at 11:44:50 -0400,
> Grant Edwards via Python-list <python-list@python.org> wrote:
>
>> On 2024-09-30, Left Right via Python-list <python-list@python.org> wrote:
>> > [...]
>> > Imagine a pathological case of this shape: 1... <60GB of digits>. This
>> > is still a valid JSON (it doesn't have any limits on how many digits a
>> > number can have). And you cannot parse this number in a streaming way
>> > because in order to do that, you need to start with the least
>> > significant digit.
>>
>> Which is how arabic numbers were originally parsed, but when
>> westerners adopted them from a R->L written language, thet didn't
>> flip them around to match the L->R written language into which they
>> were being adopted.
>
> Interesting.
>
>> So now long numbers can't be parsed as a stream in software. They
>> should have anticipated this problem back in the 13th century and
>> flipped the numbers around.
>
> What am I missing? Handwavingly, start with the first digit, and as
> long as the next character is a digit, multipliy the accumulated
> result by 10 (or the appropriate base) and add the next value.
> [...] But why do I need to start with the least significant digit?

Excellent question. That's actully a pretty standard way to parse
numeric literals. I accepted the claim at face value that in JSON
there is something that requires parsing numeric literals from the
least significant end -- but I can't think of why the usual algorithms
used by other languages' lexers for yonks wouldn't work for JSON.

--
Grant

1

rocksolid light 0.9.8
clearnet tor