Skip to main content

Olfeo OEM documentation

Next Gen file format

File format

Our Next Gen file format is based on Protobuf.

Each internal file contains the binary representation of a Protobuf message.

  • Naming convention:
<content-type>-<index>.proto

  • Generated files are delivered as .tar archives.

This structure standardizes the delivery process and ensures consistent alignment across all export formats.

File structure

Every content type is linked to a Protobuf message that define its structure, creating a clear map between the file and the message that defines the file name and their encoded data. These files are encoded using Protobuf format.

The mapping between content-type (file name) and Protobuf message (file contents) is provided in the following table:

Table 2. Generated files are delivered as .tar archives.mapping

content-type

Protobuf message

url-themes

olfeo.database.v1.ExportedDomainThemeInfo

url-categories

olfeo.database.v1.ExportedDomainCategoryInfo

url-categories

olfeo.database.v1.ExportedDomainInfo

applications

olfeo.database.v1.ExportedApplicationInfo

application-categories

olfeo.database.v1.ExportedApplicationCategoryInfo

application-domains

olfeo.database.v1.ExportedApplicationDomainInfo

application-ip-ranges

olfeo.database.v1.ExportedApplicationIpRangeInfo

application-logos

olfeo.database.v1.ExportedLogoData



General structure of the Protobuf messages

Each message described above share the same structure.

The first element contains metadata describing the message's content. This includes the file's creation timestamp, as well as the timestamps of the first and most recent updates to the elements that follow.

The second element is repeated as needed and represent the unique entry that are part of the whole database. Each entry, defined as a sub-message called Entry specifies the item it refers to, followed by its associated data.

For instance, the entries of the olfeo.database.v1.ExportedApplicationIpRangeInfo contains three fields:

  • application_id which is the unique Id of the application that is associated with this ip range.

  • ip_range_id which is the unique Id of this ip range.

  • And finally, some specific information to this ip range, namely FirstIp and LastIp.

The metadata messages are defined in the common.proto file included with this documentation.

Categorizing domains using Olfeo OEM

Getting information on an entry in Olfeo OEM, requires :

  1. knowing how to hash the value you'll be looking for

  2. implementing some logic to retrieve the "right" entry

Beware: Domain field is hashed!

For security purposes, the domain field inside each Entry sub-message contains the hexadecimal digest of the hash of the actual domain name. This means that the stored value is not the plain FQDN itself but the result of hashing the domain using a specific algorithm and key.

This applies to the following message types: olfeo.database.v1.ExportedDomainInfo and olfeo.database.v1.ExportApplicationDomainInfo.

Default algorithm is xx64 as it represents a fine balance between compute needs, security and speed at runtime.

Obtaining the hashing algorithm and key

The hashing algorithm and key depend on the dataset you are working with (sample, production, legacy...).

Use the GetashInfo endpoint of Olfeo OEM's Update API to get the hashing algorithm and the hashing key that is specific to your files. This is stable and should not change over time.

However, it is a good idea to implement some way of checking hash method and value and updating it in your product.

NB: you will need to retrieve an API Key first in order to use this endpoint.

Hashing domains

Use teh appropriate hashing algorithm and key to hash the key and domain you want to get info about.. For instance, if you are using our sample file and want to get the category for the domain linuxcommand.org, you can do the following operation:

# The xxhash library provides an implementation of the xx64 hashing algorithm
import xxhash
KEY = "14dacd9b-5006-42f5-a8d6-825240a90b56" # This is the static key for all 
samples
DOMAIN = "linuxcommand.org"
hash = xxhash.xxh64_hexdigest(KEY + DOMAIN)
assert(hash == "9f688147060c5cab")

Our sample data always use the xx64 hashing algorithm ; hashing key is 14dacd9b-5006- 42f5-a8d6-825240a90b56

Specific cases using the sha1 hashing algorithm

For historical reasons, the order of KEY and DOMAIN is reversed when using sha1. This means that, you should hash the domain first and then the key.

This can be useful for customers migrating from legacy file format to Next Gen.

In order to keep the database manageable given the great number of domain names in the database, Olfeo OEM uses a hierarchical approach in which subdomains can inherit the category of their parent domain. However, in some cases this inheritance is not correct, the Olfeo database has a method to provide a more accurate results in such cases.

The general algorithm to find the category of a particular domain is to search the database starting from the full domain name to the top level domain, stopping as soon as a match is found. at the first match.

Example 1. Categorizing jira-frontendbifrost.prod-east.frontend.public.atl-paas.net

For instance, if you want to know the category of this example's domain, you would perform the following lookups, in order:

  • jirafrontend-bifrost.prod-east.frontend.public.atl-paas.net

  • prod-east.frontend.public.atl-paas.net

  • frontend.public.atlpaas.net

  • atl-paas.net



This process continue until a match is found.

Note: when a match occurs, you must then check the prefix field of the matching entry. If the value of this field is DOMAIN_PREFIX_STATUS_PREFIX, then the categoty of that domain does not apply to its subdomains. If you were looking for this domain specifically, then it is its category. However, if you were looking for one of its subdomains, the category do not apply and you should consider that subdomain as not categorized by the Olfeo database.

This mechanism exist to support domains that have very large breadth of content, such as blog hosting platforms, where the content of each subdomain can not be inferred from its hosting platform.

Retrieving the actual Olfeo OEM files

Next Gen format files must be downloaded through your Olfeo OEM customer portal. Go to the "MyFiles" page to display the files generated for your account and download them.

Each entry corresponds to a specific generation request tailored to your agreement and includes all related metadata.

Available actions

On this page, you can:

  • Download directly the generated files,

  • or copy the Client download link and use it in a script to automate the process.

Available Resources

Access via portal

The following resources can only be accessed and downloaded through your Olfeo OEM customer portal.

The top section of the My Files page provides links to related reference materials:

  • Protobuf sample (.tar.gz): example archive in Protobuf format.

  • Protobuf schema (.tar.gz): describes the structure of Protobuf messages.

  • Protobuf Format Documentation (PDF): the document explains how files are organized, how Protobuf messages are structured, and how to manage and classify domains with the Olfeo database.