Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #386: The Internet is being scanned for viruses.


comp / comp.lang.python / Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

SubjectAuthor
* Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeLeft Right
`- Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from KeGreg Ewing

1
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Left Right
Newsgroups: comp.lang.python
Date: Tue, 1 Oct 2024 21:03 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: olegsivokon@gmail.com (Left Right)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Tue, 1 Oct 2024 23:03:01 +0200
Lines: 87
Message-ID: <mailman.23.1727817087.3018.python-list@python.org>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
<CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de +F9Io+KVPGwed3uzxTnGhg5U/dJwujUJwaYtRr8Ia/xw==
Cancel-Lock: sha1:EaEq9XX+mcW9tdvVZQNI+EL1iMg= sha256:X8TFKCnXkjYTkT/llOhQBD10sr6aElFpUsdp8YB8OOU=
Return-Path: <olegsivokon@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=TzAVZTDT;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'argument': 0.04; 'yet.':
0.04; 'string': 0.07; 'subject:API': 0.07; 'thing.': 0.07;
'cases.': 0.09; 'dan': 0.09; 'describe': 0.09; 'example.': 0.09;
'infinite': 0.09; 'json': 0.09; 'parse': 0.09; 'solving': 0.09;
'url:mailman': 0.15; 'memory': 0.15; '"re:': 0.16; '"what': 0.16;
'(because': 0.16; '+0200,': 0.16; '2024': 0.16; 'arbitrary': 0.16;
'constraint,': 0.16; 'data?': 0.16; 'decimal': 0.16; 'filesystem':
0.16; 'for.': 0.16; 'low-level': 0.16; 'missing?': 0.16; 'oh,':
0.16; 'parsing': 0.16; 'sync': 0.16; 'terminology': 0.16;
'useless': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'api': 0.17;
'says': 0.17; 'subject:Help': 0.17; 'instead': 0.17; 'probably':
0.17; "aren't": 0.19; 'implement': 0.19; 'tue,': 0.19; 'to:addr
:python-list': 0.20; 'language': 0.21; "i've": 0.22; 'to:no real
name:2**1': 0.22; 'code': 0.23; 'run': 0.23; 'idea': 0.24; '(and':
0.25; 'anything': 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-
ip:188.166.95/24': 0.25; 'saying': 0.25; 'url:listinfo': 0.25;
'cannot': 0.25; 'url-ip:188.166/16': 0.25; 'again,': 0.26;
'leave': 0.27; 'function': 0.27; 'example,': 0.28; 'computer':
0.29; 'it,': 0.29; 'url-ip:188/8': 0.31; 'think': 0.32; 'python-
list': 0.32; 'message-id:@mail.gmail.com': 0.32; 'but': 0.32;
"i'm": 0.33; 'subject:for': 0.33; 'able': 0.34; 'same': 0.34;
"didn't": 0.34; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'words': 0.35; 'yes,': 0.35;
'from:addr:gmail.com': 0.35; 'built': 0.36; 'cases': 0.36;
'people': 0.36; 'special': 0.37; 'subject:from': 0.37; "it's":
0.37; 'file': 0.38; 'way': 0.38; 'two': 0.39; 'adding': 0.39;
'least': 0.39; 'single': 0.39; 'enough': 0.39; 'handle': 0.39;
'list': 0.39; 'use': 0.39; 'still': 0.40; 'case.': 0.40; 'hand':
0.40; 'family': 0.60; 'skip:h 10': 0.61; "there's": 0.61; 'ever':
0.63; "you'd": 0.64; 'your': 0.64; 'let': 0.66; 'numbers': 0.67;
'right': 0.68; 'closing': 0.69; 'implications': 0.69; 'quote?':
0.69; 'interesting': 0.71; 'subject:Data': 0.71; 'care': 0.71;
'addition': 0.71; 'deal': 0.73; 'potentially': 0.76;
'significant': 0.78; 'constitute': 0.81; 'left': 0.83; 'billion':
0.84; 'extracted': 0.84; 'forgot': 0.84; 'further.': 0.84;
'means,': 0.84; 'strings': 0.84; 'subject: \n ': 0.84; 'words:':
0.84; 'hand.': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1727816593; x=1728421393; darn=python.org;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc:subject:date
:message-id:reply-to;
bh=0FzuoySlsiTQa4AUr9dSqh68igaoFAIizObLLnc+ako=;
b=TzAVZTDTXJ7T8Y2XwD4Ilf2BD4XnPlE3h0f9Kz135ue66jMlWOfLBRJPJ5MPgmN1wS
ZYy+V6jn9KJtUHo5MwXGxhiv4aijrOx+FafJx0sz4LrtjCn7whl3CSDdX/sctuSzIerH
iGorCS+rfxQ0NPqqC3OP4WUGfb9fvW6Rwl+KHgD5RCw2CBYjQVtyMWJniHnfjKNfCiPh
OC09wQ/2cj4tPb/kRAYpSsR80NplroMRlcQIXNe1JWGOTe1KvUtS9x8EPhrWO63jEM6I
aEz065rPgfDaf8887fDkLXUq1y//NcOBqGfrQDP24OKMtxcm7NUIkTVYYjAfUnhJBR2O
wN2A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1727816593; x=1728421393;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
:subject:date:message-id:reply-to;
bh=0FzuoySlsiTQa4AUr9dSqh68igaoFAIizObLLnc+ako=;
b=Y595Of1B0jU398vc1cSVSZPDvwLzd1kUXUxTDRaP3DK+Ws/DLQiA9UoigfvsdL//9O
2ogE0s+z3S03wGg5R3i8udNUFPUX/unlosD5GYtzzlxaRQKwMhT2SkWHGaoET53Nu+Gj
QP74ghqwFAlwv4wmPZSczSQGn2flYmv/9nUS1A6K0zdNWSSh5Zn2+HaYYlZNUt8NkhNe
DSUPZhrQPIMgCaco/KSg5qbcB2BQm/cwZupWNlhT+V3gxS9KTMmoViFBSfY/Y8hkdf+i
YT9UKql5/nAs9jxGxxKjl8tNWnioDD5jlMpYHG5XhA33/zsRDyORG/uKK6VmeEawpd98
6rNA==
X-Forwarded-Encrypted: i=1;
AJvYcCUwUxG/R45KXZh+GmrI9SWbkfHk/MbxVpoYUbRRGnoUpXjvH3gOzBVLxTQCACtl2lq8AB51oJTMbiB9/A==@python.org
X-Gm-Message-State: AOJu0Yyz0NaDq2yXCmIcqX2uMN/cYcYmwXxRETsK+luTWUuekD9Uta0e
Q1UBH9gOjXJGOgH9zc4h9OXb30mXe0dK810M2qjK3YahiaGiQseRguTjgXEn6w9UsI5WHXPIEES
b1/YJByCCbqabR3gEhuOtkIuG52P5SPOB
X-Google-Smtp-Source: AGHT+IHkgF7jJZmm0Plc37A5UoK4m/H5k6M8j1YDf9vGtqFlyUKgLOg4ucRXqUJO/n8eAQVYk4l0o6hz+QKdC7rWKXs=
X-Received: by 2002:a05:6214:2e44:b0:6cb:46ce:744a with SMTP id
6a1803df08f44-6cb81b99009mr9618986d6.48.1727816593190; Tue, 01 Oct 2024
14:03:13 -0700 (PDT)
In-Reply-To: <ZvwZjATEdx8hLhxT@anomaly>
X-Mailman-Approved-At: Tue, 01 Oct 2024 17:11:26 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
X-Mailman-Original-References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
View all headers

> If I recognize the first digit, then I *can* hand that over to an
> external function to accumulate the digits that follow.

And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".

> Under that constraint, I'm not sure I can parse anything. How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

Nobody says that parsing a number is the only pathological case. You,
however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example. There's no problem there.

Again, I think you misunderstand what streaming is for. Let me remind:
it's for processing information as it comes, potentially,
indefinitely. This has far more important implications than what you
find in computer science. For example, some mathematicians use the
same argument to show that real numbers are either fiction or useless:
consider adding two real numbers (where real numbers are potentially
infinite strings of decimal digits after the period) -- there's no way
to prove that such an addition is possible because you would need an
infinite proof for that (because you need to start adding from the
least significant digit).

In principle, any language that has infinite words will have the same
problem with streaming. If you ever pondered h/w or low-level
protocols s.a. SCSI or IP, you'd see that they are specifically
designed in such a way as to never have infinite words (because they
must be amenable to streaming). Consider also an interesting
consequence of SCSI not being able to have infinite words: this means,
besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file! And
that's why fsync() and family are so hated by people who deal with
storage: the only way to implement fsync() in compliance with the
standard is to sync _everything_ (and it hurts!)

On Tue, Oct 1, 2024 at 5:49 PM Dan Sommers via Python-list
<python-list@python.org> wrote:
>
> On 2024-09-30 at 21:34:07 +0200,
> Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API,"
> Left Right via Python-list <python-list@python.org> wrote:
>
> > > What am I missing? Handwavingly, start with the first digit, and as
> > > long as the next character is a digit, multipliy the accumulated result
> > > by 10 (or the appropriate base) and add the next value. Oh, and handle
> > > scientific notation as a special case, and perhaps fail spectacularly
> > > instead of recovering gracefully in certain edge cases. And in the
> > > pathological case of a single number with 60 billion digits, run out of
> > > memory (and complain loudly to the person who claimed that the file
> > > contained a "dataset"). But why do I need to start with the least
> > > significant digit?
> >
> > You probably forgot that it has to be _streaming_. Suppose you parse
> > the first digit: can you hand this information over to an external
> > function to process the parsed data? -- No! because you don't know the
> > magnitude yet. What about two digits? -- Same thing. You cannot
> > leave the parser code until you know the magnitude (otherwise the
> > information is useless to the external code).
>
> If I recognize the first digit, then I *can* hand that over to an
> external function to accumulate the digits that follow.
>
> > So, even if you have enough memory and don't care about special cases
> > like scientific notation: yes, you will be able to parse it, but it
> > won't be a streaming parser.
>
> Under that constraint, I'm not sure I can parse anything. How can I
> parse a string (and hand it over to an external function) until I've
> found the closing quote?
>
> How much state can a parser maintain (before it invokes an external
> function) and still be considered streaming? I fear that we may be
> getting hung up on terminology rather than solving the problem at hand.
> --
> https://mail.python.org/mailman/listinfo/python-list

Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API
From: Greg Ewing
Newsgroups: comp.lang.python
Date: Tue, 1 Oct 2024 22:07 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: greg.ewing@canterbury.ac.nz (Greg Ewing)
Newsgroups: comp.lang.python
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data (60
GB) from Kenna API
Date: Wed, 2 Oct 2024 11:07:41 +1300
Lines: 31
Message-ID: <lm3a5gFu94hU1@mid.individual.net>
References: <CADrxXXmHUwsQbWqNrwzyKWLyTK0J3Hf0z8hAhGwKYoF2PwK7QA@mail.gmail.com>
<082705B5-7C14-4D33-BF38-73F9CB166293@barrys-emacs.org>
<9dfcd123-c31d-4207-869c-d5466487cba4@tompassin.net>
<CAJQBtgkLVyNK+vw4u3bFCFEQDH8T3rpyTL+ERyyYHZJskQR6PQ@mail.gmail.com>
<CAJQBtgnpNkpg-mF2yFCS4P4GYAYsKQ9nEw3Xygja=SE3-=N2Dw@mail.gmail.com>
<ZvwZjATEdx8hLhxT@anomaly>
<CAJQBtgnjespF-W64mBDYAybvOas12-7zPCjA2=iQuxMMfF73vw@mail.gmail.com>
<mailman.23.1727817087.3018.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net PdM/dqU4BQCXwM6lfgecIAt943tHMxb6vwQUrP71xUg15iviLC
Cancel-Lock: sha1:6fQQl4udhbFrK9NUbwsVoe266U8= sha256:Suxuf7KBLm7JhCAwGViWugFBrPe8zyLTeBrmE9kh5k0=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:91.0)
Gecko/20100101 Thunderbird/91.3.2
Content-Language: en-US
In-Reply-To: <mailman.23.1727817087.3018.python-list@python.org>
View all headers

On 2/10/24 10:03 am, Left Right wrote:
> Consider also an interesting
> consequence of SCSI not being able to have infinite words: this means,
> besides other things that fsync() is nonsense! :) If you aren't
> familiar with the concept: UNIX filesystem API suggests that it's
> possible to destage arbitrary large file (or a chunk of file) to disk.
> But SCSI is built of finite "words" and to describe an arbitrary large
> file you'd need to list all the blocks that constitute the file!

I don't follow. What fsync() does is ensure that any data buffered
in the kernel relating to the file is sent to the storage device.
It can send as many blocks of data over SCSI as required to
achieve this. There's no requirement for it to be atomic at the
level of the interface between the kernel and the hardware.

Some devices do their own buffering in ways that are invisible to
the software, so fsync() can't guarantee that the data is actually
written to the storage medium. But that's a problem stemming from
the design of the hardware, not the design of the protocol for
communicating with the hardware.

> the only way to implement fsync() in compliance with the
> standard is to sync _everything_

Again I'm not sure what you mean here. It may be difficult for the
kernel to track down exactly what data is relevant to a particular file,
and so the kernel programmers take the easy way out and just implement
fsync() as sync(). But again that has nothing to do with the protocol.

--
Greg

1

rocksolid light 0.9.8
clearnet tor