Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #284: Electrons on a bender


comp / comp.lang.python / Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

SubjectAuthor
o Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Ke2QdxY4RzWzUUiLuE

1
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: 2QdxY4RzWzUUiLuE@potatochowder.com
Newsgroups: comp.lang.python
Date: Tue, 1 Oct 2024 15:47 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: 2QdxY4RzWzUUiLuE@potatochowder.com
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Tue, 1 Oct 2024 11:47:24 -0400
Lines: 35
Message-ID: <mailman.21.1727797649.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: news.uni-berlin.de R37hibZ3ibt5FUuuRUKIXwcurTaSm5759Tmmqe9Dvepw==
Cancel-Lock: sha1:JHkCqBDiggUU3Huo/jGHlgZ2Lgs= sha256:SKsjIJgvYqU0dOAlUIWSbLjpAOBoCzvyd52ZlRt8YxE=
Return-Path: <2QdxY4RzWzUUiLuE@potatochowder.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=potatochowder.com header.i=@potatochowder.com
header.b=O/SZGAQD; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'yet.': 0.04; 'string':
0.07; 'subject:API': 0.07; 'thing.': 0.07; 'cases.': 0.09; 'json':
0.09; 'parse': 0.09; 'received:78': 0.09; 'solving': 0.09;
'memory': 0.15; '"re:': 0.16; '+0200,': 0.16; 'constraint,': 0.16;
'data?': 0.16; 'from:addr:2qdxy4rzwzuuilue': 0.16;
'from:addr:potatochowder.com': 0.16; 'missing?': 0.16; 'oh,':
0.16; 'received:136.243': 0.16; 'received:172.58': 0.16;
'received:78.46': 0.16; 'received:www458.your-server.de': 0.16;
'received:your-server.de': 0.16; 'terminology': 0.16; 'useless':
0.16; 'wrote:': 0.16; 'problem': 0.16; 'subject:Help': 0.17;
'instead': 0.17; 'probably': 0.17; 'to:addr:python-list': 0.20;
"i've": 0.22; 'code': 0.23; 'run': 0.23; 'received:de': 0.23;
'(and': 0.25; 'cannot': 0.25; 'leave': 0.27; 'function': 0.27;
'it,': 0.29; 'python-list': 0.32; 'received:136': 0.32; 'but':
0.32; "i'm": 0.33; 'subject:for': 0.33; 'able': 0.34; 'same':
0.34; 'header:In-Reply-To:1': 0.34; 'yes,': 0.35; 'cases': 0.36;
'special': 0.37; 'subject:from': 0.37; 'file': 0.38; 'two': 0.39;
'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39;
'still': 0.40; 'hand': 0.40; 'skip:h 10': 0.61; 'right': 0.68;
'closing': 0.69; 'quote?': 0.69; 'subject:Data': 0.71; 'care':
0.71; 'significant': 0.78; 'left': 0.83; 'billion': 0.84;
'forgot': 0.84; 'subject: \n ': 0.84; 'hand.': 0.91
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
d=potatochowder.com; s=default2305; h=In-Reply-To:Content-Type:MIME-Version:
References:Message-ID:Subject:To:From:Date:Sender:Reply-To:Cc:
Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID;
bh=+9ww3d5qTItSdvOoSSa6HFItFDfgUyjd4zqTwMibTJ4=; b=O/SZGAQDi2NeJdrkLiEaGngWQN
N0E66MKT3GnnYK5wzrZBzPfZTiSA2UJ+6ZPrVRo+j2yiZD3c+XYt/vpy+jqcqyxodIp0ym+u9smvh
GGqZR23pOa+O3LzNYmPWlKaNEreE2hDEpL0SLXqeKIIDd9lOy71PWxPrxQ1RX6FMchxIHiU7gi2Pd
Oah/Y6QAm/j6zrCCXW3E25zR/nlst8FmcLn65KKeaIL2H26345hFbO60uBW8n1LZPeVJXD7siL0yG
9khy+s3/XdNXsJlhh6QLSRAqOJOTeXlzptYQYqG7Bdw3CYiAM7Vh6CfZ+BRqvq2hoOmikRTIz0v8e
32QqR4rg==;
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
X-Authenticated-Sender: 2QdxY4RzWzUUiLuE@potatochowder.com
X-Virus-Scanned: Clear (ClamAV 0.103.10/27414/Tue Oct 1 10:44:50 2024)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <ZvwZjATEdx8hLhxT@anomaly>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
View all headers

On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
Left Right via Python-list <python-list@python.org> wrote:

> > What am I missing? Handwavingly, start with the first digit, and as
> > long as the next character is a digit, multipliy the accumulated result
> > by 10 (or the appropriate base) and add the next value. Oh, and handle
> > scientific notation as a special case, and perhaps fail spectacularly
> > instead of recovering gracefully in certain edge cases. And in the
> > pathological case of a single number with 60 billion digits, run out of
> > memory (and complain loudly to the person who claimed that the file
> > contained a "dataset"). But why do I need to start with the least
> > significant digit?
>
> You probably forgot that it has to be _streaming_. Suppose you parse
> the first digit: can you hand this information over to an external
> function to process the parsed data? -- No! because you don't know the
> magnitude yet. What about two digits? -- Same thing. You cannot
> leave the parser code until you know the magnitude (otherwise the
> information is useless to the external code).

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

> So, even if you have enough memory and don't care about special cases
> like scientific notation: yes, you will be able to parse it, but it
> won't be a streaming parser.

Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

How much state can a parser maintain (before it invokes an external
function) and still be considered streaming? I fear that we may be
getting hung up on terminology rather than solving the problem at hand.

1

rocksolid light 0.9.8
clearnet tor