Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #236: Fanout dropping voltage too much, try cutting some of those little traces


comp / comp.lang.python / RE: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

SubjectAuthor
* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeLeft Right
`* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeGreg Ewing
 +* RE: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Ke<avi.e.gross
 |`- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeGreg Ewing
 +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeLeft Right
 +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeChris Angelico
 +- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeChris Angelico
 +* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeLeft Right
 |`- doRe: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Greg Ewing
 `- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeLeft Right

1
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Left Right
Newsgroups: comp.lang.python
Date: Mon, 30 Sep 2024 19:34 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: olegsivokon@gmail.com (Left Right)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Mon, 30 Sep 2024 21:34:07 +0200
Lines: 58
Message-ID: <mailman.19.1727796506.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 77n2yoYUEAJaacRY+WzNQwBLsunwGzzm/iTWP83VY3pQ==
Cancel-Lock: sha1:8MeFY9hZ3LNnv+l59GKOn/UPVIE= sha256:fvwG5f0EZ0OiVFQFyIcP7h4po587j6PxPBell3Lk9js=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=ZdRDrKcO;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'yet.': 0.04; 'pypi': 0.05;
'received:mail-qk1-x72d.google.com': 0.07; 'subject:API': 0.07;
'thing.': 0.07; 'cases.': 0.09; 'cc:addr:python-list': 0.09;
'memory.': 0.09; 'parse': 0.09; 'url-ip:151.101.0.223/32': 0.09;
'url-ip:151.101.128.223/32': 0.09; 'url-ip:151.101.192.223/32':
0.09; 'url-ip:151.101.64.223/32': 0.09; 'cc:no real name:2**0':
0.14; 'import': 0.15; 'url:mailman': 0.15; 'memory': 0.15; '2024':
0.16; 'barry': 0.16; 'data?': 0.16; 'janhangeer': 0.16;
'missing?': 0.16; 'oh,': 0.16; 'url:project': 0.16; 'url:pypi':
0.16; 'useless': 0.16; 'wrote:': 0.16; 'problem': 0.16;
'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17;
'cc:addr:python.org': 0.20; 'code': 0.23; 'run': 0.23; '(and':
0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24':
0.25; 'url:listinfo': 0.25; 'cannot': 0.25; 'cc:2**0': 0.25; 'url-
ip:188.166/16': 0.25; 'leave': 0.27; 'function': 0.27; 'computer':
0.29; 'it,': 0.29; 'whole': 0.30; 'am,': 0.31; 'url-ip:188/8':
0.31; 'python-list': 0.32; 'sep': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'unless': 0.32; 'but': 0.32;
'subject:for': 0.33; 'able': 0.34; 'same': 0.34; 'header:In-Reply-
To:1': 0.34; 'received:google.com': 0.34; 'yes,': 0.35;
'from:addr:gmail.com': 0.35; 'cases': 0.36; 'mon,': 0.36;
'special': 0.37; 'subject:from': 0.37; 'file': 0.38; 'two': 0.39;
'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39;
'hand': 0.40; 'search': 0.61; 'skip:h 10': 0.61; 'url-
ip:151.101.0/24': 0.62; 'url-ip:151.101.128/24': 0.62; 'url-
ip:151.101.192/24': 0.62; 'url-ip:151.101.64/24': 0.62; 'once':
0.63; 'right': 0.68; 'subject:Data': 0.71; 'receive': 0.71;
'care': 0.71; 'quick': 0.77; 'significant': 0.78; 'left': 0.83;
'billion': 0.84; 'forgot': 0.84; 'larger,': 0.84; 'revealed':
0.84; 'subject: \n ': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727724858; x=1728329658; darn=python.org;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc:subject:date
:message-id:reply-to;
bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=;
b=ZdRDrKcON+QPyCkWO9YOhQ5KfYjF+uvtA8rJMx7ljJhRLIZhwumn0ivGLsVf5tKSj6
M2RacefCKN9wwn/etOAKuTNctBQZWFx/UCoL8pCFM+pRoDDq/j1lHtRzerhkaQB0HQDc
bb+nwHwERoeE7NI5P4/d97BhahjXfFq89UKLFlo4GsUY5LBt3yE4zxC6OX30962GBgCE
L50IAudsFYE6QyJuV4MQ0Q4iEs/GVEequ4rOIbjMyYK4iHsBPuBvK1nnRH//41IuNNE2
owynJWvekT2Ivb8KlBtvsHTlOgWiUhZvZ08LI30j9K+ffJWAIfil+tstCESE7Jhv5E6d
IlDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727724858; x=1728329658;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
:subject:date:message-id:reply-to;
bh=y1hnTatHgT4FJ/pIy0TM6dKXfZPbbOo5ou2wOjLj+AM=;
b=cNeGa177/aBmvHAxtxe5lkCXuxBhpju0REONiW38apJSHHH6MJibJFKAM7fDZw6LSe
Ze3AwJWtS0MfPUXnue1AIxX2RO+ekuDgh7G2gOe85ASKhktgLTlYxyuLDu+3PMEw3sFa
iEPLOvBYESkB9nGDoUxqNNLxQqHtRSkS+LQwJ1+oaA+0OWZiwjxbJSi/9vToaDcOPJRH
AgWipEzcSPWPf5UH7nTIFJ+HUDpIC4F/T8NSaeIIySLLLzL8xg0p/CFiSXALgq+M7FlM
tVDRskC1XSVXk4fZ977YP+gL6jHhBqWIerQ7RekgrKmXjguq+LxgKsbRRxPtbzr+dWLr
b86g==
X-Gm-Message-State: AOJu0YwFYvg9sUYmp5hnVzp0+sZDhSabCCvwp1412JGiJ7gj1jiOVmlj
Mta2rpVo9thOKauHnTbTwY4Bm8dKCQ2vgFGMu3i3L0E/q5btvnB5usy4DCas/ntLwFSGWne0tQx
88DyLZzMBku+HtXELmBbry79Rk37ANoeE
X-Google-Smtp-Source: AGHT+IFbZ3QjtW/qWxEQfjK2SEtC6qN4cpXSyozTAhytqrMlnnaMfBKsrXIJ3Ick0U4/cIDLawchNY2IkUGKmfYvqrU=
X-Received: by 2002:a0c:e804:0:b0:6cb:584f:ec22 with SMTP id
6a1803df08f44-6cb729c377dmr10830876d6.21.1727724858266; Mon, 30 Sep 2024
12:34:18 -0700 (PDT)
In-Reply-To: <CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
X-Mailman-Approved-At: Tue, 01 Oct 2024 11:28:25 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
View all headers

> What am I missing? Handwavingly, start with the first digit, and as
> long as the next character is a digit, multipliy the accumulated result
> by 10 (or the appropriate base) and add the next value. Oh, and handle
> scientific notation as a special case, and perhaps fail spectacularly
> instead of recovering gracefully in certain edge cases. And in the
> pathological case of a single number with 60 billion digits, run out of
> memory (and complain loudly to the person who claimed that the file
> contained a "dataset"). But why do I need to start with the least
> significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet. What about two digits? -- Same thing. You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

On Mon, Sep 30, 2024 at 9:30 PM Left Right <olegsivokon@gmail.com> wrote:
>
> > Streaming won't work because the file is gzipped. You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
>
> GZip is specifically designed to be streamed. So, that's not a
> problem (in principle), but you would need to have a streaming GZip
> parser, quick search in PyPI revealed this package:
> https://pypi.org/project/gzip-stream/ .
>
> On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list
> <python-list@python.org> wrote:
> >
> > On 9/30/2024 11:30 AM, Barry via Python-list wrote:
> > >
> > >
> > >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:
> > >>
> > >>
> > >> import polars as pl
> > >> pl.read_json("file.json")
> > >>
> > >>
> > >
> > > This is not going to work unless the computer has a lot more the 60GiB of RAM.
> > >
> > > As later suggested a streaming parser is required.
> >
> > Streaming won't work because the file is gzipped. You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
> > --
> > https://mail.python.org/mailman/listinfo/python-list

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Greg Ewing
Newsgroups: comp.lang.python
Date: Tue, 1 Oct 2024 21:48 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: greg.ewing@canterbury.ac.nz (Greg Ewing)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Wed, 2 Oct 2024 10:48:24 +1300
Lines: 18
Message-ID: <lm391bFu38hU1@mid.individual.net>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net 64DS/m5fMcaLXA+UXnkyAQ108zFU56eaU0Lb1q1Jg4tZXGZeUm
Cancel-Lock: sha1:bQRCQ2DDIm3BNXK3oVod80umOhI= sha256:YNrsY9VCbQsMIq8OtWaSIDv4TUcGbm7MVmULzN/0muk=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:91.0)
Gecko/20100101 Thunderbird/91.3.2
Content-Language: en-US
In-Reply-To: <mailman.19.1727796506.3018.python-list@python.org>
View all headers

On 1/10/24 8:34 am, Left Right wrote:
> You probably forgot that it has to be _streaming_. Suppose you parse
> the first digit: can you hand this information over to an external
> function to process the parsed data? -- No! because you don't know the
> magnitude yet.

By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.

--
Greg

Subject: RE: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Date: Tue, 1 Oct 2024 23:26 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Subject: RE: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Tue, 1 Oct 2024 19:26:52 -0400
Lines: 72
Message-ID: <mailman.24.1727825216.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<020101db1459$65b0c4d0$31124e70$@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de NpqKPZHkGjhSZhTi+wITOACqC3ANwftY9JKrylrvu1kw==
Cancel-Lock: sha1:loQhN70+sCPD0ZJZenbumwgLq/o= sha256:kz/oFLZhqDOHymbakcNxLLl5uc4ZO+w9LsiTTegpPG0=
Return-Path: <avi.e.gross@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=OtW7Qcyr;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.005
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'argument': 0.04; 'stream':
0.04; 'yet.': 0.04; 'row': 0.05; 'subject:API': 0.07;
'compressed': 0.09; 'infinite': 0.09; 'json': 0.09; 'locally':
0.09; 'parse': 0.09; 'smaller': 0.09; 'url:mailman': 0.15;
'problem.': 0.15; '*could*': 0.16; '2024': 0.16; 'along.': 0.16;
'appended': 0.16; 'applies': 0.16; 'arbitrary': 0.16; 'columns':
0.16; 'data?': 0.16; 'decimal': 0.16; 'derive': 0.16;
'discarding': 0.16; 'division': 0.16; 'entirety': 0.16;
'evaluating': 0.16; 'greg': 0.16; 'like.': 0.16; 'pi,': 0.16;
'places,': 0.16; 'primes': 0.16; 'somewhat': 0.16; 'structures':
0.16; 'useful.': 0.16; 'want,': 0.16; 'wrote:': 0.16; 'problem':
0.16; 'python': 0.16; 'api': 0.17; 'larger': 0.17; 'october':
0.17; 'subject:Help': 0.17; 'instead': 0.17; 'probably': 0.17;
'message-id:@gmail.com': 0.18; 'to:addr:python-list': 0.20;
'written': 0.22; 'way.': 0.22; 'code': 0.23; 'list,': 0.24;
'anything': 0.25; 'skip:- 10': 0.25; 'url-ip:188.166.95.178/32':
0.25; 'url-ip:188.166.95/24': 0.25; 'discussion': 0.25;
'url:listinfo': 0.25; 'url-ip:188.166/16': 0.25; 'bit': 0.27;
'function': 0.27; 'output': 0.28; 'sense': 0.28; 'series': 0.28;
'ideas': 0.28; 'keeping': 0.28; 'computer': 0.29; 'asked': 0.29;
'am,': 0.31; 'url-ip:188/8': 0.31; 'think': 0.32; 'context': 0.32;
'manner.': 0.32; 'passes': 0.32; 'python-list': 0.32; 'structure':
0.32; 'zero': 0.32; 'but': 0.32; 'subject:for': 0.33; 'there':
0.33; 'particular': 0.33; 'same': 0.34; 'mean': 0.34; 'header:In-
Reply-To:1': 0.34; 'received:google.com': 0.34;
'from:addr:gmail.com': 0.35; 'files': 0.36; 'applying': 0.36;
'year': 0.36; 'necessarily': 0.37; 'subject:from': 0.37; 'hard':
0.37; 'could': 0.38; 'read': 0.38; 'quite': 0.39; 'sending': 0.39;
'list': 0.39; 'received:100': 0.39; 'data.': 0.40; 'hand': 0.40;
'processed': 0.40; 'program.': 0.40; 'serious': 0.40; 'something':
0.40; 'want': 0.40; 'july': 0.60; 'including': 0.60; 'paid': 0.61;
'from:': 0.62; 'to:': 0.62; 'data,': 0.63; 'remote': 0.63; 'ever':
0.63; 'send': 0.63; 'between': 0.63; 'about.': 0.64; 'definition':
0.64; 're:': 0.64; 'remains': 0.64; 'your': 0.64; 'company': 0.64;
'supply': 0.65; 'similar': 0.65; 'well': 0.65; 'less': 0.65;
'wish': 0.66; 'right': 0.68; 'and,': 0.69; 'parts,': 0.69;
'piece': 0.69; 'taylor': 0.69; 'times': 0.69; 'instead,': 0.70;
'claim': 0.71; 'subject:Data': 0.71; 'care': 0.71; 'little': 0.73;
'records': 0.75; 'sent:': 0.78; 'database': 0.80; 'more.': 0.82;
'left': 0.83; 'points': 0.84; 'thousand': 0.84; 'forgot': 0.84;
'gigabytes': 0.84; 'modes,': 0.84; 'streams,': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727825213; x=1728430013; darn=python.org;
h=content-language:thread-index:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:from:to:cc:subject:date:message-id:reply-to;
bh=Ojx23Z4abYr2tFny/mmNnY/jCnax3XXG5aMGdeWh4tM=;
b=OtW7Qcyrs+YntYmmTdLhTh9I1DpkAzkuD3L5vQuOh9u6r4l2fpCE94kImNtSDIeZvj
Zztbn0i/6D86WF4yQsYD4id0Xo8fDwmqfOITAJZY0wMFu7cCUGSNSNXrGx1r0r4uuvjF
f32OVKsxo277nlx/o4aZNwn5wLhEmteeldfiP64eRARSD3WfntBDXAZ5FpVBNTnF6DVU
lXNx/ddvB5M0GnXgB/2whksD4Kjp+7ksa2vom7/yIaM62c0Ik1gZUaWpWAWls+jDN2Vw
pHi0vgdAOlZVeY2bwZqbv/fvUu5fioghG8thM3JkasWkZatzI3d2XaJLEQbBUE/xz98Q
AreQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727825213; x=1728430013;
h=content-language:thread-index:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=Ojx23Z4abYr2tFny/mmNnY/jCnax3XXG5aMGdeWh4tM=;
b=I0PwlBJSlRtSE4aJjcPgDggj9hq5w87qZmvMUaUiK6hzensz+SNTGN6KhT8938TJGu
Op+dYiaDDlTkCUl2h6UIKLqZCIIhqRoZgYf+IjKGLloQ8qveUgr4a10tNWVugROsdFzD
M5HA/rATgVoduRWvTIoFmj7rpxnInTmwHCRfvm5FOPItSSQJ/qIRvVGkGGMKtrpXBFx1
9+o/4LOhnzCip8uzUuv/6MQNM2D1NTJEJHsduuZBxtaTqBhXwSWW/CohM8VFyvYElD86
QxfwEV8OuqEEKu8RQDzmaClDyZFYWZrDcalXFSCX2MvFvFKTazbOGzG9WBizuZuFumGT
zGpQ==
X-Forwarded-Encrypted: i=1;
AJvYcCW1KEgoN89vPM2GMY10XXzSzG2VhV9XVr1qa7ZLPZCSUYHRMIj0SAUPuP0wIvNFUKf+yDzGnhDO/PDWQw==@python.org
X-Gm-Message-State: AOJu0YzIkOf2SR+cCIX/SO+QDAr3qR8s1j6T9ztvkF6DoSSQ9rgXNhN/
0BbRuZJ9EaaQvC1PXm+sOXRGPFy97Z44SrPq0L7+q7VoliBRaWwK
X-Google-Smtp-Source: AGHT+IE75gKw3RdLM4Yja4ksK8D7fSbuhZUaMgnW3DwEhLSxJ/CfGGuKlreKSWlJJCuDhT8/pSezyA==
X-Received: by 2002:a05:6214:4302:b0:6cb:3131:e287 with SMTP id
6a1803df08f44-6cb81a4eadbmr18477116d6.36.1727825212743;
Tue, 01 Oct 2024 16:26:52 -0700 (PDT)
In-Reply-To: <lm391bFu38hU1@mid.individual.net>
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQKErNhKWbz+QZtX9o5Uj/t7Xq9luAHpFUchAlYAEYkB45ipqgFWub9/AisKDFYByRG2hLDD4A6A
Content-Language: en-us
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <020101db1459$65b0c4d0$31124e70$@gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
View all headers

This discussion has become less useful.

E can all agree that in Computer Science, real infinities are avoided, and
frankly, need not be taken seriously in any serious program.

You can store all kinds of infinities quite compactly as in a transcendental
number you can derive to as many decimal points as you like. Want 1/7 to a
thousand decimal places, no problem. You can be given a digit 1 and a digit
7 and asked to do a division to as many digits as you wish in a
deterministic manner. I can think of quite a few generators that could
easily supply the next digit, or just keep giving the next element from
142857 each time from a circular loop.

Sines, cosines, pi, e and so on, can often be calculated to arbitrary
precision by evaluating things like infinite Taylor Series as many times as
needed up to the precision of the data holding the number as you move along.

Similar ideas allow generators to give you as many primes as you want, and
no more.

So, if you can store arbitrary python code as part of your JSON, you can
send quite a bit of somewhat compressed data.

The real problem is how the JSON is set up. If you take umpteen data
structures and wrap them all in something like a list, then it may be a tad
hard to stream as you may not necessarily be examining the contents till the
list finishes gigabytes later. But if, instead, you send lots of smaller
parts, such as perhaps sending each row of something like a data.frame
individually, the other side can recombine them incrementally to a larger
structure such as a data.frame and do some logic on it as it streams, such
as keeping only some columns and discarding the rest, or applying filters
that only keep rows you care about. And, of course, all rows could be
appended to one and perhaps more .CSV files as well so if you need multiple
passes on the data, it can now be processed locally in various modes,
including "streamed".

I think that for some purposes, it makes some sense to not stream anything
but results. I mean consider any database that allows a remote login and SQL
commands that only stream results. If I only want info on records about
company X between July 1 and September 15 of a particular year and only if
the amount paid remains zero or is less than the amount owed, ...

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Greg Ewing via Python-list
Sent: Tuesday, October 1, 2024 5:48 PM
To: python-list@python.org
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data
(60 GB) from Kenna API

On 1/10/24 8:34 am, Left Right wrote:
> You probably forgot that it has to be _streaming_. Suppose you parse
> the first digit: can you hand this information over to an external
> function to process the parsed data? -- No! because you don't know the
> magnitude yet.

By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Greg Ewing
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 05:27 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: greg.ewing@canterbury.ac.nz (Greg Ewing)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Wed, 2 Oct 2024 18:27:54 +1300
Lines: 17
Message-ID: <lm43usF3fl1U1@mid.individual.net>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<020101db1459$65b0c4d0$31124e70$@gmail.com>
<mailman.24.1727825216.3018.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net vafC4e745QwBHchUd864oga9JEw8opcXUh5GjyTcN8RbnsM9Pq
Cancel-Lock: sha1:WQKbVzh6QOaQSUNzorDoy8f8bxk= sha256:rUZbI3LzG/sfqNz7KEjPnhrWppqBPrbEsYvnHsK7R4c=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:91.0)
Gecko/20100101 Thunderbird/91.3.2
Content-Language: en-US
In-Reply-To: <mailman.24.1727825216.3018.python-list@python.org>
View all headers

On 2/10/24 12:26 pm, avi.e.gross@gmail.com wrote:
> The real problem is how the JSON is set up. If you take umpteen data
> structures and wrap them all in something like a list, then it may be a tad
> hard to stream as you may not necessarily be examining the contents till the
> list finishes gigabytes later.

Yes, if you want to process the items as they come in, you might
be better off sending a series of separate JSON strings, rather than
one JSON string containing a list.

Or, use a specialised JSON parser that processes each item of the
list as soon as it's finished parsing it, instead of collecting the
whole list first.

--
Greg

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Left Right
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 06:05 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: olegsivokon@gmail.com (Left Right)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Wed, 2 Oct 2024 08:05:02 +0200
Lines: 19
Message-ID: <mailman.27.1727877147.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de q+tFNDFgoQTqoxUgrsCSjALQDzccW1ODqqfKzcMCgFnQ==
Cancel-Lock: sha1:8nJ4utvnYbIpCLygjPqRl/kK3Z0= sha256:SKuPVMRfJvze8CnY4XrAA955yeD6EKKZHqhQv0aEzvs=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=R76SfGtL;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.044
X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'class.': 0.07;
'subject:API': 0.07; 'cc:addr:python-list': 0.09; 'json': 0.09;
'theory': 0.09; 'typically': 0.09; 'cc:no real name:2**0': 0.14;
'entirety': 0.16; 'hand,': 0.16; 'parsing': 0.16; 'practice,':
0.16; 'received:mail-qv1-xf2e.google.com': 0.16; 'subject:Help':
0.17; 'figure': 0.19; 'cc:addr:python.org': 0.20; 'languages':
0.22; 'examples': 0.25; 'stuff': 0.25; 'cannot': 0.25; 'cc:2**0':
0.25; 'output': 0.28; "doesn't": 0.32; 'words,': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm": 0.33;
'subject:for': 0.33; 'there': 0.33; 'able': 0.34; 'same': 0.34;
'mean': 0.34; 'header:In-Reply-To:1': 0.34; 'received:google.com':
0.34; 'from:addr:gmail.com': 0.35; 'cases': 0.36; 'subject:from':
0.37; "it's": 0.37; 'though': 0.37; 'read': 0.38; 'hand': 0.40;
'something': 0.40; 'want': 0.40; 'should': 0.40; 'sorry': 0.60;
'gave': 0.61; 'come': 0.62; 'ever': 0.63; 'email': 0.63;
'everything': 0.63; "you'd": 0.64; 'definition': 0.64; 'well':
0.65; 'exactly': 0.68; 'and,': 0.69; 'piece': 0.69;
'subject:Data': 0.71; 'study': 0.82; 'subject: \n ': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727849114; x=1728453914; darn=python.org;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:from:to:cc:subject:date:message-id:reply-to;
bh=+41u2NIzn2+NBc+TmUWPhFuWQIkiMECqgtcmhEmn9qc=;
b=R76SfGtLeK2/+8iX72n/G8mh0z92kMns9YSKncJ2IDqgeXh8e4wGaKS+D82KKMNw3A
tROiT8TZJvE3FirMivlppsPbGEz3qxrsobMi9FW1DLei4s7m0dLgKIAm7sjWtjLGp3wg
zxgy9o+4VHwk1nnxzJglsooDsW+n3oCW7pXejf30s8aoy3sw+JaibROrBfWzKy/P5mc8
pEkQWbAt1vNolueyWSB9mmXTuqV/+/15t2lwAqg81seq4GBfQ97b7gDueXrZmWKQIR9Z
Us/OlWz0iHqPaOA65dqCMFdcdNKZ7F5ji32bfhNFxjDmTik19HKKfkVLzJF6WSHCQPok
MQcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727849114; x=1728453914;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
:reply-to;
bh=+41u2NIzn2+NBc+TmUWPhFuWQIkiMECqgtcmhEmn9qc=;
b=Aub+DR5mg69VwOiDofWqlHI+e0XlNSjMrNB1dAZWKWcbLhyGvJosweA4kDOoqz6NwY
Zymxb2j8qIhlS5T7Yq5/UWP8V/GxpU74utXm75pD2jKyXLWWvfWNCgNXV8d0y1nKXQcY
jC3g3rId1OcpNuz9Ihcg89Q6qJP7olndQajkDU8IjEWYKH/AmR0Y/FKtrF7N/AI7mkqC
8oSxmVs16JaZunwa4RF4JQMgI04mqiLNbr2P8cPhyl5nfssy+KfPBjJFCrfCuQTtnc+F
x4xYyqzhqRPvoM28ou9lqvtFjqV65tNUves72eTV3M9fNhg0Zdjy46IGWqP/Q/GeYW7o
4QYw==
X-Gm-Message-State: AOJu0Yw/E+c9XsY3UPp161CfQ4djOwALKhZydescgMimtnI0RQpAY/Dl
xYamAKcYnlH9EE4BP9A7ErytXONJZcWAK/HjrJh2BzVJTOxHv5+6KlDVmYgXcAhuTyRVCjMffbs
0UwIf5DAygu+UOL9CqpxnIzlum3k=
X-Google-Smtp-Source: AGHT+IEmbJTTqxiL+2zAC5FalLCONqzD9x7KpiSW8CUpsaGj4dXFWBwIl3DuiX0oWrCaPQIf/Ahmht2MNxM0dabX5sI=
X-Received: by 2002:a05:6214:5503:b0:6cb:4c23:6576 with SMTP id
6a1803df08f44-6cb81a62007mr26325716d6.37.1727849114065; Tue, 01 Oct 2024
23:05:14 -0700 (PDT)
In-Reply-To: <lm391bFu38hU1@mid.individual.net>
X-Mailman-Approved-At: Wed, 02 Oct 2024 09:52:25 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
View all headers

> By that definition of "streaming", no parser can ever be streaming,
> because there will be some constructs that must be read in their
> entirety before a suitably-structured piece of output can be
> emitted.

In the same email you replied to, I gave examples of languages for
which parsers can be streaming (in general): SCSI or IP. For some
languages (eg. everything in the context-free family) streaming
parsers are _in general_ impossible, because there are pathological
cases like the one with parsing numbers. But this doesn't mean that
you cannot come up with a parser that is only useful _sometimes_.
And, in practice, languages like XML or JSON do well with streaming,
even though in general it's impossible.

I'm sorry if this comes as a surprise. On one hand I don't want to
sound condescending, on the other hand, this is something that you'd
typically study in automata theory class. Well, not exactly in the
very same words, but you should be able to figure this stuff out if
you had that class.

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Chris Angelico
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 13:59 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: rosuav@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Wed, 2 Oct 2024 23:59:41 +1000
Lines: 9
Message-ID: <mailman.28.1727877596.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de bpRCN3GtuwIr7/WGPIEoDAibA7VcQKTWKWhQyqmaOWhg==
Cancel-Lock: sha1:bYGTTeW94+QdUGHzvFi6N99/h/Y= sha256:aojQFJ4Yourih00fhb7xGiVj9GP27Ydeti2HButh2Vg=
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=JAafgSSu;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.065
X-Spam-Evidence: '*H*': 0.87; '*S*': 0.00; 'subject:API': 0.07;
'2024': 0.16; 'chrisa': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'received:mail-
lj1-x235.google.com': 0.16; 'wrote:': 0.16; 'subject:Help': 0.17;
"can't": 0.17; 'to:addr:python-list': 0.20; 'languages': 0.22;
'examples': 0.25; 'python-list': 0.32; 'validate': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'subject:for': 0.33; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'from:addr:gmail.com': 0.35; 'subject:from': 0.37; 'wed,': 0.39;
'gave': 0.61; 'email': 0.63; 'your': 0.64; 'right': 0.68;
'subject:Data': 0.71; 'left': 0.83; 'subject: \n ': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727877594; x=1728482394; darn=python.org;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:from:to:cc:subject:date:message-id:reply-to;
bh=Y942I+l7q3jvGI1R+5cBfFq8tnQ8o6js80ejQMswFu0=;
b=JAafgSSuWKtdzkwanyA/++iaOaE/7jWw578EZP7I7iV2IqYtJ5AQTvdsPqqYI5zYfC
IxRWQBpzTSXEhHjj1iKTpuav1PvbmPljFS1APuwyWkg7Zce/S3jMWJIL4LuzCLQaZA4N
q+mdhlpV1f7pJyZriH4Cabq6v2GqjPM2aDXH1a8D/nx8+L9NuQNZjcRgtM3K09TvmCj9
7C8LQUyzQZmJY39imhnOfEX6ERlfyg8bAaBvWiGnFCPmqtFC3YrpvlkRsawrZrfxJe4B
6ftI6FdBsMw3oKGc3ZjdaOwMbIN7+ywXkvqgkAQ3AhHvQk0DM8hoO3idMPcwo38h62fh
z9LA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727877594; x=1728482394;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=Y942I+l7q3jvGI1R+5cBfFq8tnQ8o6js80ejQMswFu0=;
b=TXFuNBYc2FITNLO/KBxjxA4RcsSiCtIrulHSEyhlY3ZTGrnDrUXkTea7kXlGegVq0A
KHwoRtXuM+IiaHq6A4Mu/4heomK+kgsHxAyEQtYIdZp9hmbcENjDI+CeAx7E5UrWM+tf
6F5Grq3hDocywVCYPHSo5wJ7IYu88HtW2fDL39Gu8/YveSVSJJ7/4bUf8CaLafkvlQ77
/GVr5V2+oCRGScq3ZmrxwBhcfIJty3n8nh4Pnu6KrZde4yIYHVAHgVEb0Q63KYs0vs2N
MMW9QbiJh80/gDjC2iXNvDSVY89n5y7HGj/FPToeqVY4nrb4J1pBtc8GhS9EdBqtzh+S
pOow==
X-Gm-Message-State: AOJu0YwKKGZ8nkUfs3UXT7b9QCBgaSwZqtjfn/lU0wTtz3CxR1QJgCHh
WGW53qIPPWTNGNwkVrTJDE7EgJlK6O94WG+VKvME0tLnAtKuVehoBKOkjEp9xg0E9AMJQposTYu
Ch5XYEz1nEq2XPPpPuFPgn/fQ/FUt4w==
X-Google-Smtp-Source: AGHT+IFYW8B9dhSMCs6A2FnF9nsut2QRTChiVNlz7/qT7JffYMd+Za/yh9SclSH6uzdaP5qvBqdXlhaJG4cBCah0yYw=
X-Received: by 2002:a05:651c:2225:b0:2fa:d79d:d0e with SMTP id
38308e7fff4ca-2fae102794fmr19044811fa.19.1727877593426; Wed, 02 Oct 2024
06:59:53 -0700 (PDT)
In-Reply-To: <CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
View all headers

On Wed, 2 Oct 2024 at 23:53, Left Right via Python-list
<python-list@python.org> wrote:
> In the same email you replied to, I gave examples of languages for
> which parsers can be streaming (in general): SCSI or IP.

You can't validate an IP packet without having all of it. Your notion
of "streaming" is nonsensical.

ChrisA

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Chris Angelico
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 22:51 UTC
References: 1 2 3 4 5 6 7 8 9 10 11
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: rosuav@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Thu, 3 Oct 2024 08:51:01 +1000
Lines: 20
Message-ID: <mailman.29.1727909476.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
<CAPTjJmoWHrKCmktm=4bzCS2dekbR6=u9PD6gc=LZfo+4dq=7zQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de xAsiMeYieY0cUWzUxdSmFAnDzrSSU4w3QiGsdq5v2cwg==
Cancel-Lock: sha1:GZ7I7TkJSPpm/0g9GFiwK2CN2uY= sha256:84br6gQ4iJoQ4J1fcsn22s84uv5nws7fXwvizXriHvk=
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=MojOcFHN;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.013
X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'subject:API': 0.07;
'cc:addr:python-list': 0.09; 'language,': 0.09; 'cc:no real
name:2**0': 0.14; '2024': 0.16; 'alphabet': 0.16; 'chrisa': 0.16;
'examples,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris
angelico': 0.16; 'length.': 0.16; 'packets': 0.16; 'wrote:': 0.16;
'subject:Help': 0.17; "can't": 0.17; 'thu,': 0.19;
'cc:addr:python.org': 0.20; 'language': 0.21; 'saying': 0.25;
'cc:2**0': 0.25; 'seems': 0.26; 'bit': 0.27; 'language.': 0.32;
'validate': 0.32; 'message-id:@mail.gmail.com': 0.32;
'subject:for': 0.33; 'hold': 0.33; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'words': 0.35; 'from:addr:gmail.com':
0.35; 'subject:from': 0.37; 'way': 0.38; 'single': 0.39;
'between': 0.63; 'your': 0.64; 'right': 0.68; 'playing': 0.69;
'subject:Data': 0.71; 'little': 0.73; 'left': 0.83; 'subject: \n
': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727909474; x=1728514274; darn=python.org;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:from:to:cc:subject:date:message-id:reply-to;
bh=bbnsjas95O8EixlyTTDCuuiKnWMVbSlcVLCCl0x2g3Y=;
b=MojOcFHN3SErb/1VPGLwm0+3EEESHeflV0ra/Fn70RCs4uDrhOis/s6GWPUsMTLcdo
tGEBk7/L4sYTwCwtmyy7k1rtBgiR4lsq6ZtG2AL8JSj9h6MPdwtUFIYPKg5+xQxcq+12
4223QQUDVxNJajBF3kkFKCsssN6uYyXw2GAcG5/7LkZM1btgFQ5D3mb6t0v6E03cPWIR
qxWHGMKVZIAcaeg3sVK/yQPGGjvn/qRTaHxw7Kh3DPaBD9WAvAb2zNGIiR905N2m9HnD
xEr3waYMePq3pP1gu7dLfCRbQqJZ5J2KCtckL24IWHI6SULty/W8bfdM6MapLEM5Y4tz
+/eA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727909474; x=1728514274;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
:reply-to;
bh=bbnsjas95O8EixlyTTDCuuiKnWMVbSlcVLCCl0x2g3Y=;
b=CPBS5IASWDLEsY01Is5HRte9p/H+pGsEcMm6Wa6qo07g+PJX1noAjTDiWhqWSuNSoM
5cVG5sl3hOoNRc/Jb3cEb3HwSNy0ncrlnp1qyjnIMhJDGeKzmXAUhJfpPAkl+1iGIlfF
BwCzcjOF3hjAt+gXMzEFRTtR+XvXhEYGQvdxcoQY424jMSaG5rOS/DK7qMucnhLjDxmS
Y6uSjW7Cgcmo81RZ6jRs6Hp4ne0S8/d4bO+PCgokw+kVqHoE9SwkPDfmETdQ1bZtk4Cg
lsHoBHkF+soaR+RREvBeVc+dEDqz0iIshHbEl9bj0s3xDO9WrUK32oSXsvWEbsCQz8Dr
eOiA==
X-Gm-Message-State: AOJu0YzIIROQYt6be/WcERugPsbzIs1VmbEVjTpficQBYCBqFKXBQ/zf
499XjXCBPXmtTUOeiQG2xPrr/i1jGUxF0yUsFLFFhgZJimToRcxiTWAbG3n7gqrjILAaFxdKYCu
WKZQmIhKR8GlYMIb/pOMPydSZHAo=
X-Google-Smtp-Source: AGHT+IEdC46gfXUPSgEJaDMOMx5SQJuJ4Mc7Q61CsVpFB27q/Z00KOJzZ1I3e52777hxE77glzgPFEPOa82gH4u283A=
X-Received: by 2002:a05:6512:3d8d:b0:539:530e:9de5 with SMTP id
2adb3069b0e04-539a07a89f2mr2872604e87.56.1727909473384; Wed, 02 Oct 2024
15:51:13 -0700 (PDT)
In-Reply-To: <CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmoWHrKCmktm=4bzCS2dekbR6=u9PD6gc=LZfo+4dq=7zQ@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
View all headers

On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:
>
> > You can't validate an IP packet without having all of it. Your notion
> > of "streaming" is nonsensical.
>
> Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
> justification :)
>
> It seems you don't understand the difference between words and
> languages! In my examples, IP _protocol_ is the language, sequences of
> IP packets are the words in the language. A language is amenable to
> streaming if the words of the language are repetition of sequences of
> symbols of the alphabet of fixed length. This is, essentially, like
> saying that the words themselves are regular.

One single IP packet is all you can parse. You're playing shenanigans
with words the way Humpty Dumpty does. IP packets are not sequences,
they are individuals.

ChrisA

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Left Right
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 22:48 UTC
References: 1 2 3 4 5 6 7 8 9 10
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: olegsivokon@gmail.com (Left Right)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Thu, 3 Oct 2024 00:48:10 +0200
Lines: 39
Message-ID: <mailman.30.1727920574.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de iYXmtaGixpgIznZyoW9NvAR9cEfKkxMcxzuR6VXgi2uA==
Cancel-Lock: sha1:lmvXxYbsQ2uA0rDxgIs19AmxrgE= sha256:MIzvQQysZhy5BeCROn/sTwH4OAEnSLm1J1Q1wLwNo0U=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=DzMx20wQ;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.011
X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; '(for': 0.05;
'approximate': 0.05; 'else.': 0.07; 'subject:API': 0.07; 'cc:addr
:python-list': 0.09; 'fact,': 0.09; 'language,': 0.09;
'reference:': 0.09; 'cc:no real name:2**0': 0.14; 'alphabet':
0.16; 'encounter': 0.16; 'examples,': 0.16; 'languages.': 0.16;
'length.': 0.16; 'mastered': 0.16; 'overlooked': 0.16; 'packets':
0.16; 'subject,': 0.16; 'subject:Help': 0.17; "can't": 0.17;
'cc:addr:python.org': 0.20; 'language': 0.21; 'written': 0.22;
'languages': 0.22; 'saying': 0.25; 'cc:2**0': 0.25; 'seems': 0.26;
'bit': 0.27; 'sense': 0.28; 'seem': 0.31; 'think': 0.32;
'question': 0.32; 'language.': 0.32; 'validate': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'subject:for': 0.33; 'hold': 0.33;
"didn't": 0.34; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'one.': 0.35; 'words': 0.35;
'from:addr:gmail.com': 0.35; 'subject:from': 0.37; "it's": 0.37;
'students': 0.38; 'way': 0.38; 'enough': 0.39; 'use': 0.39;
'still': 0.40; 'something': 0.40; 'should': 0.40; 'tell': 0.60;
"there's": 0.61; 'come': 0.62; 'between': 0.63; 'about.': 0.64;
'your': 0.64; 'discussing': 0.69; 'interesting': 0.71;
'subject:Data': 0.71; 'future': 0.72; 'little': 0.73; 'follow-up':
0.84; 'characters': 0.84; 'subject: \n ': 0.84; 'truth': 0.86;
'implied': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727909302; x=1728514102; darn=python.org;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:from:to:cc:subject:date:message-id:reply-to;
bh=JOGLHMkOiHEq980pd1EEhj8hY0Z1mfOwsXI6+XSrzaU=;
b=DzMx20wQoRV3eIfJBHDWkN7Lbv7phebUttYsxp1kLc4yTH/jMjskiEpUUqH+zVs90B
iirW1kQPSPYYfPvz2mOxl/nnUOMa88HcPnAg7qmZdWiD4SqbBvev5YaAUtvTmCs3BDOl
KyVUcGRoYIJIH20nOt06GKBM7WdCiGG/fgtScN8F7mB5uz/SoaoMvH5OPkXCrBp6qrY8
APTK1a7PbINwHaRekBzaR82N0ZYYyrpguSxo4RSUtmSX8bFZZfdm0b8BHdai+81eXMfv
DOp1xiG2f15qcCyJQLWUegA9oNBng0B6zPWAxBUPHaJdRNN4PSKJLCKvkD4iG48jD9Ew
pXWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727909302; x=1728514102;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
:reply-to;
bh=JOGLHMkOiHEq980pd1EEhj8hY0Z1mfOwsXI6+XSrzaU=;
b=ddfQIPkqcMKHM95eyUBwRTFssUfqpFVtS3bYoXPE9niMQz8vPTTW+qffp5Pk1XpkbA
wZ02xCzccpRQxF3DOacAMLGffTYFfsqbGjVnYMvwXBglnvrbp1gT0sood3Fso+4F4yc1
l684SDSn2nIhxNjjSfgfhguFEd+07qvJvtALuOigS2nGyOW7CeZB9zs2Z92V7n10O9o/
nnFRqdu0s4M0YjB8Ft3k/rsIrij59kbmmC9YGgOZzQVTwb7C48oCxu2LHyOutU5bP2/o
3Uz2Deu61JG4vlJKTouN2+vefyOm44v1l5wEjDE3wsSghxA0nXhhiLvm07PnBaBZveNy
L7Vw==
X-Gm-Message-State: AOJu0YzXF3mmjpC9EX9M6KTIRDmpsURmn8yjAN0zVp0TzVgYJzcm9vYM
+LOS9mTUGC0AMT8CsGd+Ntpf0glu4cpfWuZLltz0b4pc/qAvrL0piRHoMlS1y2jiOtiy+UBWecB
QtO0hRqx92oe4KGdIo50mek4NedY=
X-Google-Smtp-Source: AGHT+IELldilDFZIaInwPHyTQmJIUQCRj8KsZYxm+GLhiJ6QFNZvPyYFwU1Aj3BTCMl9ETwuRF4MJwUR1luFSInNWVo=
X-Received: by 2002:a05:6214:4521:b0:6cb:3925:ec95 with SMTP id
6a1803df08f44-6cb81bb4d5emr65008466d6.53.1727909301826; Wed, 02 Oct 2024
15:48:21 -0700 (PDT)
In-Reply-To: <CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
X-Mailman-Approved-At: Wed, 02 Oct 2024 21:56:13 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
View all headers

> You can't validate an IP packet without having all of it. Your notion
> of "streaming" is nonsensical.

Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
justification :)

It seems you don't understand the difference between words and
languages! In my examples, IP _protocol_ is the language, sequences of
IP packets are the words in the language. A language is amenable to
streaming if the words of the language are repetition of sequences of
symbols of the alphabet of fixed length. This is, essentially, like
saying that the words themselves are regular.

So, the follow-up question from you to me should be: how come strictly
context-free languages can still be parsed with streaming parsers? --
And the answer to that is it's possible to approximate context-free
languages with regular languages. In fact, this is a very interesting
subject, which unfortunately is usually overlooked in automata
classes. It's interesting in a sense that it's very accessible to the
students who already mastered the understanding of regular and
context-free formalisms.

So, streaming parsers (eg. SAX) are written for a regular language
that approximates XML. This is because in practice we will almost
never encounter more than N nesting levels in an XML, more than N
characters in an element name etc. (for some large enough N).
Something which allows us to create a regular language from a
context-free one.

NB. "Nonsensical" has a very precise meaning, when it comes to
discussing the truth value of a proposition, which I think you also
somehow didn't know about. You seem to use "nonsensical" as a synonym
to "wrong". But, unbeknownst to you, you said something else. You
actually implied that there's no way to tell if my notion of streaming
is correct or not.

But, for the future reference: my notion of streaming is correct, and
you would do better learning some materials about it before jumping to
conclusions.

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Left Right
Newsgroups: comp.lang.python
Date: Wed, 2 Oct 2024 22:56 UTC
References: 1 2 3 4 5 6 7 8 9 10 11 12
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: olegsivokon@gmail.com (Left Right)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Thu, 3 Oct 2024 00:56:36 +0200
Lines: 36
Message-ID: <mailman.31.1727920575.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
<CAPTjJmoWHrKCmktm=4bzCS2dekbR6=u9PD6gc=LZfo+4dq=7zQ@mail.gmail.com>
<CAJQBtgmCyaYSN44U5rZttxdGgbsWFmKbiArZxMf+jMc6pfqobA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de XO4Fy/wOi76DBWZFMHG+0gPN1EjEeaFABaH0mepa94Uw==
Cancel-Lock: sha1:MLHY1Rvn3qEhFYMJjtklsalwSfo= sha256:v4zbvVzwOMmbRtx/vfc0PBRDr+qrmAMPtmVB77Xtg08=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=DQ9WmjF4;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.024
X-Spam-Evidence: '*H*': 0.95; '*S*': 0.00; 'subject:API': 0.07;
'angelico': 0.09; 'cc:addr:python-list': 0.09; 'general,': 0.09;
'language,': 0.09; 'like,': 0.09; 'cc:no real name:2**0': 0.14;
'2024': 0.16; 'alphabet': 0.16; 'chrisa': 0.16; 'examples,': 0.16;
'length.': 0.16; 'mount': 0.16; 'packets': 0.16; 'rude': 0.16;
'words.': 0.16; 'wrote:': 0.16; 'subject:Help': 0.17; "can't":
0.17; 'thu,': 0.19; 'cc:addr:python.org': 0.20; 'language': 0.21;
'languages': 0.22; 'idea': 0.24; 'saying': 0.25; 'cc:2**0': 0.25;
'seems': 0.26; 'bit': 0.27; 'chris': 0.28; 'language.': 0.32;
'validate': 0.32; 'message-id:@mail.gmail.com': 0.32;
'subject:for': 0.33; 'hold': 0.33; 'someone': 0.34; 'header:In-
Reply-To:1': 0.34; 'received:google.com': 0.34; 'trying': 0.35;
'words': 0.35; 'from:addr:gmail.com': 0.35; 'subject:from': 0.37;
'way': 0.38; 'single': 0.39; 'wrote': 0.39; 'tell': 0.60;
'between': 0.63; 'your': 0.64; 'company': 0.64; 'look': 0.65;
'let': 0.66; 'worked': 0.67; 'right': 0.68; 'playing': 0.69;
'subject:Data': 0.71; 'little': 0.73; 'left': 0.83; 'distinction':
0.84; 'ridiculous.': 0.84; 'subject: \n ': 0.84; 'you:': 0.84;
'pleasure': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727909807; x=1728514607; darn=python.org;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc:subject:date
:message-id:reply-to;
bh=iYf6e0HYfyqnGUXTmZEASpTHJaMTDquMBfhINep1P44=;
b=DQ9WmjF4jQpmo2PqduyHIo2/qlkejO0uhxleDC1tDm5aBjrcy/Ojcvkn93XckY4B1X
88BfVy/XRzAtOj9EoSUyru/sHMyWUoOSikMnyrCSCrUF2HCygt066lzRLch+sDffcdSG
zyfyWIVLXFzUTtFyE18m5w46oJIuQjuYNizRJCbiUBQbd3N3Mvf7g0qmxusCwjtexu6I
95JUW0lH+puksIfP61IN+onwGXYpvfGCS5uu9UtYgq9t0iZwHfWLrK8SUDQmxeKi8GVj
xgu/xccnymJHcugeEfXvSsiwsuAv15ce+cLSou+ZPwMvCQtKOitSApJlc2Ys9jv+lEgN
xZcQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727909807; x=1728514607;
h=content-transfer-encoding:cc:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
:subject:date:message-id:reply-to;
bh=iYf6e0HYfyqnGUXTmZEASpTHJaMTDquMBfhINep1P44=;
b=tGCTVljbCZXzz8fmYXLcm2X/CyfRUKoTpuwESxhofM7/x26wvYnlYVKyJdfp/vtsFf
lukziwiZMf5EZwDUz48gIn7MUoilfyD2+3oQie+EgseqM8+JbK7CS4EhcxaH/rUfm4Cd
0dccuul4wI/hZBONBJvMSnMkSXlA3azoXPVR6Ty2zSzNXUUijUpvHhPQaANjZYPc1QBu
2CBLl0PyCwb78Bt7wdF+OEBh1IZrhoG5lFPf7nK4FbXMF73ttx70BXsBAnexhSmwwm6b
SCZdjij9uvbDqmbjsXJHNLAbKEsR9Gw0bX8ofJnk9Sc1Em0cBpK7uxOjzlQYnsjqrOTq
6wgg==
X-Gm-Message-State: AOJu0YwRJD7kZo0L8ehPfBcuQFKRX2w4WUCNNNwbudCAG6DzWwbGID/z
qMaO19D48qWPVkOo0v1Liw0O1Vp+94cQFnPNEQHIDHTeP+CSiWvR6SXjUQulWUGJs2kf1NYAFi/
gIMceeOv1VFBo4KGI4hiFsNSXtMI=
X-Google-Smtp-Source: AGHT+IG/iHQEVyXUcA97HYj03IvibKTZgGCDcdaZaS2B5oRUs2kZZP93jgqV8aGJ2InulAFyIWpMycO5HC7tsRiPmG8=
X-Received: by 2002:a05:6214:5b02:b0:6c7:c645:ddf2 with SMTP id
6a1803df08f44-6cb81a36e5amr63636246d6.30.1727909807290; Wed, 02 Oct 2024
15:56:47 -0700 (PDT)
In-Reply-To: <CAPTjJmoWHrKCmktm=4bzCS2dekbR6=u9PD6gc=LZfo+4dq=7zQ@mail.gmail.com>
X-Mailman-Approved-At: Wed, 02 Oct 2024 21:56:13 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgmCyaYSN44U5rZttxdGgbsWFmKbiArZxMf+jMc6pfqobA@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
<CAPTjJmoWHrKCmktm=4bzCS2dekbR6=u9PD6gc=LZfo+4dq=7zQ@mail.gmail.com>
View all headers

> One single IP packet is all you can parse.

I worked for an undisclosed company which manufactures h/w for ISPs
(4- and 8-unit boxes you mount on a rack in a datacenter).
Essentially, big-big routers. So, I had the pleasure of writing
software that parses IP _protocol_, and let me tell you: you have no
idea what you just wrote.

But, like I wrote earlier: you don't understand the distinction
between languages and words. And in general, are just being stubborn
and rude because you are trying to prove a point to someone you don't
like, but, in reality, you just look more and more ridiculous.

On Thu, Oct 3, 2024 at 12:51 AM Chris Angelico <rosuav@gmail.com> wrote:
>
> On Thu, 3 Oct 2024 at 08:48, Left Right <olegsivokon@gmail.com> wrote:
> >
> > > You can't validate an IP packet without having all of it. Your notion
> > > of "streaming" is nonsensical.
> >
> > Whoa, whoa, hold your horses! "nonsensical" needs a little bit of
> > justification :)
> >
> > It seems you don't understand the difference between words and
> > languages! In my examples, IP _protocol_ is the language, sequences of
> > IP packets are the words in the language. A language is amenable to
> > streaming if the words of the language are repetition of sequences of
> > symbols of the alphabet of fixed length. This is, essentially, like
> > saying that the words themselves are regular.
>
> One single IP packet is all you can parse. You're playing shenanigans
> with words the way Humpty Dumpty does. IP packets are not sequences,
> they are individuals.
>
> ChrisA

Subject: doRe: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Greg Ewing
Newsgroups: comp.lang.python
Date: Thu, 3 Oct 2024 07:08 UTC
References: 1 2 3 4 5 6 7 8 9 10 11
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: greg.ewing@canterbury.ac.nz (Greg Ewing)
Newsgroups: comp.lang.python
Subject: doRe: Help with Streaming and Chunk Processing for Large JSON Data
(60 GB) from Kenna API
Date: Thu, 3 Oct 2024 20:08:35 +1300
Lines: 10
Message-ID: <lm6u7kFgm1kU1@mid.individual.net>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<mailman.19.1727796506.3018.python-list@python.org>
<lm391bFu38hU1@mid.individual.net>
<CAJQBtgmZehSeBu0y73ALdVq00LHi-R_KKS893FwJkEjkLnsXtA@mail.gmail.com>
<CAPTjJmq6QUcBgkNcn50VzyyHoDAEE1JLPgPU+segiEykcieVSw@mail.gmail.com>
<CAJQBtgkWcDH-7c8xTF84bxfbkvOURTBd80A6JBkEKn-f6Xvnew@mail.gmail.com>
<mailman.30.1727920574.3018.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net UY0xhLK+9vrRJ/ntAjiNlgFVAMxtIseWs3xiEQtrRHzjR5obRh
Cancel-Lock: sha1:T9njZsglz5ax3r4D1fpp3YsLQUo= sha256:10ZAH54MCluW9hso543JnMffru2nFjx425SP0k7iWtU=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:91.0)
Gecko/20100101 Thunderbird/91.3.2
Content-Language: en-US
In-Reply-To: <mailman.30.1727920574.3018.python-list@python.org>
View all headers

On 3/10/24 11:48 am, Left Right wrote:
> So, streaming parsers (eg. SAX) are written for a regular language
> that approximates XML.

SAX doesn't parse a whole XML document, it parses small pieces of it
independently and passes them on. It's more like a lexical analyser than
a parser in that respect.

--
Greg

1

rocksolid light 0.9.8
clearnet tor