👤 🏝️ 🥅 PyDERASN: as I added big-data support 🤛🏽 🔼 🔵

I continue the last article about PyDERASN - the free ASN.1 DER / CER / BER codec in Python. Over the past year, from the time of its writing, in addition to all the little things, small corrections, and even more rigorous data verification (although before it was already the most stringent of free codecs known to me), a functionality for working with large volumes of data appeared in this library - not creeping into RAM. I want to talk about this in this article.

Problems / Tasks

CRL:
, (CA) X.509 . , CRL (certificate revocation list). CRL CACert.org, 8.72 MiB, PyDERASN- Python 3.6 ( asn1crypto pyasn1 ). 416 . . , . .
CMS:
CMS (Cryptographic Message Syntax)
// . ,
SignedData , ,
, - EnvelopedData
.

CMS. , CMS detached data, - , . 10 GiB 20 GiB : 10 GiB . .

What ASN.1 objects in practice can be resource intensive, take up a lot of memory space? Only SEQUENCE OF / SET OF containing numerous objects (hundreds of thousands in the case of CACert.org), and all kinds of * STRINGs (as in the second problematic case). Since one of the virtues of a programmer is laziness (according to Larry Wall), we will try to overcome the problems of resource consumption with a minimal change in the library code.

* STRING objects, from the point of view of the codec, almost do not differ from each other and are implemented in the library by one common class. What is encoding in OCTET STRING in DER?

return tag + len_encode(len(value)) + value

Obviously, value does not have to be in RAM all this time. It is required to be a bytes-like object from which you can find out the length.

memoryview satisfies these conditions. And yet memoryview can be done by mmap on any temporary file that is already reduced in the memory 20 GiB 10 GiB requirements for the CMS database copies in half: it is enough to specify as the value of OCTET STRING s (or any other * STRING s) similar to memoryview . To create it, PyDERASN has a helper function:

from pyderasn import file_mmaped
with open("dump.sql.zst", "rb") as fd:
    ... = OctetString(file_mmaped(fd))

If we want to decode a huge amount of data, then memoryview can also be used for decode:

from pyderasn import file_mmaped
with open("dump.sql.zst", "rb") as fd:
    obj = Schema.decode(file_mmaped(fd))

We consider the problem with * STRING to be partially resolved. Back to creating a huge CRL. What is its structure like?

CertificateList SEQUENCE
. tbsCertList: TBSCertList SEQUENCE
. . version: Version INTEGER v2 (01) OPTIONAL
. . signature: AlgorithmIdentifier SEQUENCE
. . issuer: Name CHOICE rdnSequence
. . thisUpdate: Time CHOICE utcTime
. . nextUpdate: Time CHOICE utcTime OPTIONAL
. . revokedCertificates: RevokedCertificates SEQUENCE OF OPTIONAL
. . . 0: RevokedCertificate SEQUENCE
. . . . userCertificate: CertificateSerialNumber INTEGER 17 (11)
. . . . revocationDate: Time CHOICE utcTime UTCTime 2003-04-01T14:25:08
. . . 1: RevokedCertificate SEQUENCE
. . . . userCertificate: CertificateSerialNumber INTEGER 20 (14)
. . . . revocationDate: Time CHOICE utcTime UTCTime 2002-10-01T02:18:01

                                 [...]

. . . 415753: RevokedCertificate SEQUENCE
. . . . userCertificate: CertificateSerialNumber INTEGER 1341859 (14:79:A3)
. . . . revocationDate: Time CHOICE utcTime UTCTime 2020-02-08T06:51:56
. . . 415754: RevokedCertificate SEQUENCE
. . . . userCertificate: CertificateSerialNumber INTEGER 1341860 (14:79:A4)
. . . . revocationDate: Time CHOICE utcTime UTCTime 2020-02-08T06:53:01
. . . 415755: RevokedCertificate SEQUENCE
. . . . userCertificate: CertificateSerialNumber INTEGER 1341861 (14:79:A5)
. . . . revocationDate: Time CHOICE utcTime UTCTime 2020-02-08T07:25:06
. signatureAlgorithm: AlgorithmIdentifier SEQUENCE
. signatureValue: BIT STRING 4096 bits

A long list of small RevokedCertificate structures. In this case, containing only the serial number of the certificate and the time of its revocation. What is its DER coding?

value = b"".join(revCert.encode() for revCert in revCerts)
return tag + len_encode(len(value)) + value

Just concatenation of DER representations of each individual element of this list. Do we need to have in advance all these hundreds of thousands of objects in the course of just being serialized into 15 bytes of DER representation? Obviously not. Therefore, we replace the list of objects with an iterator / generator. Most likely, the creation of the CRL will be based on data from the DBMS:

def revCertsGenerator(...):
    for row in db.cursor:
        yield RevokedCertificate((
            ("userCertificate", CertificateSerialNumber(row.serial)),
            ("revocationDate", Time(("utcTime", UTCTime(row.revdate)))),
        ))

crl["tbsCertList"]["revokedCertificates"] = revCertsGenerator()

Now we almost do not consume memory when creating such a CRL - we only need a place to work with its DER representation: in this case, we are talking about a couple of tens of megabytes (as opposed to a half-gigabyte list of RevokedCertificate objects). But there is a difference in behavior: the SEQUENCE OF / SET OF size check occurs only after the iterator is exhausted, and not at the moment of assigning a value.

DER streaming coding

It is impossible. After all, this is DER - the lengths of all TLV (tag + length + value) elements must be known in advance!

But I would like so much! After all, we still still need 10 GiB of memory to store the DER representation of the database copy: raw = cms.encode () ! Ideally, I want to convey a certain writer where the serialized representation would be written. Maybe transfer the file descriptor, leave placeholders in the file at the place of lengths and then fill them out by doing seek? Unfortunately, the length length (respectively, and placeholder) is also not known in advance.

PyDERASN has received the possibility of two-pass DER encoding. In the first pass, knowledge about the lengths of objects is collected, creating a temporary state. The second is streamingDER encoding into a certain writer , already possible due to the knowledge of lengths. The implementation of this came out simple, just adding a bit of two-pass methods to each of the ASN.1 base types. Since the traversal of objects is strictly determinate (D - distinguished!), Then, to store the necessary lengths, a simple list is maintained, to which, as the entire tree of objects is traversed, length values are added. For some types, the length is fixed. For some, it is needed only for reporting to an upstream container ( SEQUENCE / SET , SEQUENCE OF / SET OF , EXPLICIT TAG ). For example, the length state for a list of two revoked certificates looks like this:

revCert = RevokedCertificate((
    ("userCertificate", CertificateSerialNumber(123)),
    ("revocationDate", Time(("utcTime", UTCTime(datetime.utcnow())))),
))
revs = RevokedCertificates([revCert, revCert])

(42, [40, 18, 18])

In this case, we need to know the length of the value only for each of the RevokedCertificate and the entire list as a whole. The length of integers and encoded time (in DER it is a fixed length) in the state is not stored as unnecessary. As a result, for our CACert.org CRL, such a list takes a little more than 3.5 MiB, and for the giant CMS, in which almost all the weight falls on one single field with a copy of the database, it takes about 0.5 KiB.

Two-pass encoding is performed by two calls:

fulllen, state = obj.encode1st()
with open("result", "wb") as fd:
    obj.encode2nd(fd.write, iter(state))

The first pass also reports the full length of the data, which can be used to check the free space or allocate it by some posix_fallocate call.

Using the helper function encode2pass (obj), you can perform two-pass encoding into memory. What for? It can be significantly more economical in memory consumption, since it will not, in the case of CACert.org CRL, store 416k + small binary lines joined by a b "". Join () call. However, this requires more processor time, because all the objects will have to be walked twice.

Now we can encode an arbitrarily large CMS in DER, practically without consuming memory. But in the case of CRL, we used the certificate revocation generator, which will run out after the end of the first pass. What to do? Just reinitialize it again!

_, state = crl.encode1st()
crl["tbsCertList"]["revokedCertificates"] = revCertsGenerator()
crl.encode2nd(writer, iter(state))

Of course, we are obliged to ensure that the result of the iterator will be exactly the same, otherwise we will get a broken DER. If data is taken from the DBMS cursor, then do not forget about REPEATABLE READ transaction isolation level and sorting.

Real Stream Encoding: CER

As you know, DER (distinguished encoding rules) is a subset of BER (basic encoding rules) that strictly regulates encoding rules in one and only one way. This allows it to be used in cryptographic tasks. But there is another noteworthy subset of BER: CER (canonical encoding rules). Like DER, it has only one possible representation of the data. CER differs from DER in several details, but they allow you to perform truly streaming encoding of data. Unfortunately, CER did not become as popular as DER.

Omitting not so noticeable differences (such as sorting tags in SET), CER has two fundamental differences from DER:

(constructed, ) indefinite (LENINDEF PyDERASN). , SEQUENCE/SET, SEQUENCE OF/SET OF, EXPLICIT TAG- :
```
TAG_CONSTRUCTED || LEN(VALUE) || VALUE
```
LENINDEF (0x80) EOC ( , ) :
```
TAG_CONSTRUCTED || 80 || VALUE || 00 00
```

*STRING-, 1000-, chunk- 1000-. , , DER. , 512 DER:

TAG_PRIMITIVE || LEN(512) || 512B

2048 :

TAG_CONSTRUCTED || 80 ||
    TAG_PRIMITIVE || LEN(1000) || 1000B ||
    TAG_PRIMITIVE || LEN(1000) || 1000B ||
    TAG_PRIMITIVE || LEN(48) || 48B || 00 00

All this (plus a few little things) allows streaming (with 1000 bytes a buffer for strings) to encode any objects. Similarly, you can use mmap and iterators. CER encoding is done simply by calling the .encode_cer (writer) method. Unfortunately, PyDERASN is not yet able to verify the validity of CER during decoding, so we are forced to decode the data as BER.

The CMS standard, by the way, requires coding in BER (both DER and CER, automatic, are BER). Therefore, we can encode our huge copy of the database into CER CMS without a two-pass DER. However , SignedData is required to have a SignedAttributes element encoded in DER, as is the X.509 Certificatecertificates. PyDERASN allows you to force the use of DER in given structures by simply adding the der_forced = True attribute.

Stream decoding: evgen mode

We learned to encode, but only mmap will help with decoding . A “real” stream decoder, which has control knobs in the form of “give me more data”, “here you are”, some kind of state - would require a radical alteration of PyDERASN. And, personally, I don’t see it would be more convenient than the current solution.

And the current solution is extremely simple. In the process of decoding, we have in our hands various decoded primitive objects in our hands, from them are assembled superior components (constructed), of which other components, etc. ... We accumulate objects to make them composite and, reaching the very top give us one big object. Why not immediately “give out” all kinds of decodedobjects, as soon as they appear on our hands? That is, to return not a final object, but a generator that generates many decoded objects. In fact, in PyDERASN now all decoding methods have become generators of such "events" (event generation, evgen).

If we enable the evgen-decoding mode of our huge CACert.org CRL, we will see the following picture:

$ python -m pyderasn --schema tests.test_crl:CertificateList --evgen revoke.crl
[][T,L,  V len]
     10   [1,1,      1]   . . version: Version INTEGER v2 (01) OPTIONAL
     15   [1,1,      9]   . . . algorithm: OBJECT IDENTIFIER 1.2.840.113549.1.1.13
     26   [0,0,      2]   . . . parameters: [UNIV 5] ANY OPTIONAL
     13   [1,1,     13]   . . signature: AlgorithmIdentifier SEQUENCE
     34   [1,1,      3]   . . . . . . type: AttributeType OBJECT IDENTIFIER 2.5.4.10
     39   [0,0,      9]   . . . . . . value: [UNIV 19] AttributeValue ANY
     32   [1,1,     14]   . . . . . 0: AttributeTypeAndValue SEQUENCE
     30   [1,1,     16]   . . . . 0: RelativeDistinguishedName SET OF

                                 [...]

    188   [1,1,      1]   . . . . userCertificate: CertificateSerialNumber INTEGER 17 (11)
    191   [1,1,     13]   . . . . . utcTime: UTCTime UTCTime 2003-04-01T14:25:08
    191   [0,0,     15]   . . . . revocationDate: Time CHOICE utcTime
    191   [1,1,     13]   . . . . . utcTime: UTCTime UTCTime 2003-04-01T14:25:08
    186   [1,1,     18]   . . . 0: RevokedCertificate SEQUENCE
    208   [1,1,      1]   . . . . userCertificate: CertificateSerialNumber INTEGER 20 (14)
    211   [1,1,     13]   . . . . . utcTime: UTCTime UTCTime 2002-10-01T02:18:01
    211   [0,0,     15]   . . . . revocationDate: Time CHOICE utcTime
    206   [1,1,     18]   . . . 1: RevokedCertificate SEQUENCE

                                 [...]

9144992   [0,0,     15]   . . . . revocationDate: Time CHOICE utcTime
9144985   [1,1,     20]   . . . 415755: RevokedCertificate SEQUENCE
    181   [1,4,9144821]   . . revokedCertificates: RevokedCertificates SEQUENCE OF OPTIONAL
      5   [1,4,9144997]   . tbsCertList: TBSCertList SEQUENCE
9145009   [1,1,      9]   . . algorithm: OBJECT IDENTIFIER 1.2.840.113549.1.1.13
9145020   [0,0,      2]   . . parameters: [UNIV 5] ANY OPTIONAL
9145007   [1,1,     13]   . signatureAlgorithm: AlgorithmIdentifier SEQUENCE
9145022   [1,3,    513]   . signatureValue: BIT STRING 4096 bits
      0   [1,4,9145534]  CertificateList SEQUENCE

At the beginning of decoding, we saw the CertificateList SEQUENCE tag, the data length, but the object is not yet known whether it can be decoded to the end. So far we are only in the process of working on it.
SEQUENCE: version, INTEGER. . , . ( 10)
signature, SEQUENCE- : algorithm parameters. OBJECT IDENTIFIER ANY . ( 15, 26)
, , signature SEQUENCE , , : . ( 13)
, RevokedCertificate . ( 186, 206, ..)
tbsCertList . ( 5)
CertificateList, SEQUENCE, , , . ( 0)

Of course, all * STRING and lists ( * OF ) do not carry real meaning. In the case of DER, knowing .offset and .vlen, you can read the value of a line from a file (a piece of memory?). Sequence objects can be collected and aggregated as you need, while receiving all events.

Decoded path and evgen_mode_upto

How to understand what kind of object, what kind of INTEGER we have on hand? Each object has its own so-called decode path, which uniquely identifies a specific object in the structure. For example, for CACert.org CRL decode path events:

tbsCertList:version
tbsCertList:signature:algorithm
tbsCertList:signature:parameters
tbsCertList:signature
tbsCertList:issuer:rdnSequence:0:0:type
                                 [...]
tbsCertList:issuer:rdnSequence
tbsCertList:issuer
                                 [...]
tbsCertList:revokedCertificates:0:userCertificate
tbsCertList:revokedCertificates:0:revocationDate:utcTime
tbsCertList:revokedCertificates:0:revocationDate
tbsCertList:revokedCertificates:0
tbsCertList:revokedCertificates:1:userCertificate
tbsCertList:revokedCertificates:1:revocationDate:utcTime
tbsCertList:revokedCertificates:1:revocationDate
tbsCertList:revokedCertificates:1
                                 [...]
tbsCertList:revokedCertificates:415755:userCertificate
tbsCertList:revokedCertificates:415755:revocationDate:utcTime
tbsCertList:revokedCertificates:415755:revocationDate
tbsCertList:revokedCertificates:415755
tbsCertList:revokedCertificates
tbsCertList
signatureAlgorithm:algorithm
signatureAlgorithm:parameters
signatureAlgorithm
signatureValue

This is how we can print the list of certificate revocation serial numbers from this CRL:

raw = file_mmaped(open("....crl", "rb"))
for decode_path, obj, tail in CertificateList().decode_evgen(raw):
    if (len(decode_path) == 5) and (decode_path[-1] == "userCertificate"):
        print(int(obj))

The RevokedCertificate structure can contain a lot of information, including various extensions. Purely technically, we get all the data about the revoked certificate in evgen mode, but aggregating events related to one revokedCertificates element is not very convenient. Since each final RevokedCertificate , in practice, will not take up much space, it would be great to have it all the same, not all objects are so thoroughly “sorted” into events. PyDERASN allows the list to specify decoding paths at which the evgen mode is disabled. So we can set the decode path (any element from the ("tbsCertList", "revokedCertificates") list) on which we want to get the full RevokedCertificate object:

for decode_path, obj, _ in CertificateList().decode_evgen(raw, ctx={
    "evgen_mode_upto": (
        (("tbsCertList", "revokedCertificates", any), True),
    ),
}):
    if (len(decode_path) == 3) and (decode_path[1] == "revokedCertificates"):
        print(int(obj["userCertificate"]))

Aggregate strings: agg_octet_string

Now we have no problems decoding objects of any size. There is no problem decoding a DER-encoded CMS with a copy of the database either: we are waiting for an event with a decode path pointing to the signed / encrypted data in the CMS and processing the data from the file using offset + vlen. But what if the CMS was in CER format? Then offset + vlen will not help, since all our 10 GiBs are divided into pieces of 1000 bytes, between which by the DER header. But what if we have a BER in which the nesting of * STRINGs can be anything?

SOME STRING[CONSTRUCTED]
    OCTET STRING[CONSTRUCTED]
        OCTET STRING[PRIMITIVE]
            DATA CHUNK
        OCTET STRING[PRIMITIVE]
            DATA CHUNK
        OCTET STRING[PRIMITIVE]
            DATA CHUNK
    OCTET STRING[PRIMITIVE]
        DATA CHUNK
    OCTET STRING[CONSTRUCTED]
        OCTET STRING[PRIMITIVE]
            DATA CHUNK
        OCTET STRING[PRIMITIVE]
            DATA CHUNK
    OCTET STRING[CONSTRUCTED]
        OCTET STRING[CONSTRUCTED]
            OCTET STRING[PRIMITIVE]
                DATA CHUNK

When decoding in evgen mode, for each piece we get the corresponding event and it is enough for us to collect only primitive (not constructed) * STRING events, where offset + vlen contain real pieces of data. PyDERASN has a convenient helper function agg_octet_string that performs this. It is enough for her to pass an event generator, the decode path “below” of which it is necessary to aggregate a string, binary data (or memoryview ) and writer - a function that is called with each piece of data received. We want to calculate the SHA512 hash and at the same time save the contents of encapContentInfo.eContent CMS? Find the location of the content field (in which SignedData will be located), and then decode its CER contents, simultaneously writing to the FS and hashing:

fdIn = open("data.p7m", "rb")
raw = file_mmaped(fdIn)
for decode_path, obj, _ in ContentInfo().decode_evgen(raw, ctx={"bered": True}):
    if decode_path == ("content",):
        content = obj
        break
hasher_state = sha512()
fdOut = open("dump.sql.zst", "wb")
def hash_n_save(data):
    write_full(fdOut, data)
    hasher_state.update(data)
    return len(data)
evgens = SignedData().decode_evgen(
    raw[content.offset:],
    offset=content.offset,
    ctx={"bered": True},
)
agg_octet_string(evgens, ("encapContentInfo", "eContent"), raw, hash_n_save)

Here, another write_full utility is used , which calls writer until all data is written, since, in general, when writing to a file, the OS is not required to process all transferred data and you need to continue the process until it is completely written.

A couple of affectionate about SET OF

Purely technically, SET OF cannot be encoded on the fly in DER and CER encodings, since encoded representations of all elements must be sorted. Neither CER nor a two-pass DER will help here. The modern ASN.1 standard therefore does not recommend the use of both SET (requiring similar sorting in DER) and SET OF .

And what is this picture at the beginning of the article?

It was in PyDERASN that an interactive, unpretentious ASN.1 browser appeared , which personally helped me several times to write handlers for complex structures. Allows you to walk around the entire decoded structure, showing full detail about each object, its location in binary data, decode path. In addition, any item can be saved in a separate file, such as certificates or CRLs included in the CMS.

Sergey Matveev , cipherpunk , Python / Go-developer, chief specialist of FSUE “Scientific and Technical Center“ Atlas ”.

PyDERASN: as I added big-data support