Botto's Art Engine
Botto is a fully autonomous artist with a closed loop process and outputs that are unaltered by human hands. This page explains how its art engine works.
Botto makes use of a combination of software models called Stable Diffusion, VQGAN + CLIP, GPT-3, voting, and a number of other models and custom augmentations. The generative models are the largest neural network architectures publicly available in the world and have analyzed millions of works of art, faces, animals, objects, images, artistic movements, poems, prose, essays, etc. They have been trained on more content than any human being could study in their lifetime.
These models give Botto the highest amount of latent space to work with and therefore the most possible variation of different styles and themes without being locked into a single area.
The machine creates its images based on text prompts generated by an algorithm. These prompts are a combination of random words and full sentences. The prompt is then sent to Stable Diffusion and VQGAN + CLIP, two generative models that Botto uses.
There are an infinite number of possible prompts and possible images. These models bridge textual and visual information, and can even be "empathic" and know what kind of emotional associations humans have in connection with imagery or text. The DAO may also decide to add a theme that Botto has proposed, in which case the theme is injected into every prompt.
Given all the different possible outputs, Botto needs direction to develop its artistic talent. That is where voting comes in: Botto will adjust its prompts based on what it thinks will be more likely to get popular results.
This process runs through 300 prompts a day, generating images with a range of styles. From that set, the engine uses a “taste-model” that pre-selects 350 images each week to be presented to the community to vote on each new round, which start every Tuesday at 2200 CET / 1600 EST / 1300 PST.
So as to not find itself in a niche too quickly, Botto is also directed to surprise and challenge the audience by selecting a number of images for voting that have different characteristics from what has been presented to date.
The Paradox Period started February 21st with a voting pool of 1050 fragments using the same ratio of VQGAN and Stable Diffusion as it was when Fragmentation Period ended. The weekly cull of the lowest scoring fragments by VP maintains this size each week and the ratio between VQGAN + CLIP and Stable Diffusion of the new 350 will rebalance each week in proportion with the number of unique votes.
If a generative model’s ratio in the taste model’s selection drops below 10% in a round and does not go back above 10% in voting, it will be discontinued. The community can decide to re-add the model if it wishes, as well as reconsider the automatic threshold in future periods.
The 10% threshold will need to be reconsidered as more models are added.
Occasionally, parallel voting pools are set up for collaborations.
Botto uses voting feedback in two places: (1) curating the text prompts used to generate fragments, and (2) the taste model that pre-selects images for voting each week.
- 1.Text Prompts: Votes influence which aspects of text prompts are used to generate fragments. Characteristics of prompts that generate desirable images will be more likely to get reused, and vice versa.
- 2.Taste Model: The taste model used for pre-selection tries to replicate the voting behavior of the community. This is not a yes/no decision, but a gradient of probabilities such that each set has images with different chances of getting picked in voting (as voting behavior is gradient as well).
For both points, all the votes on all the images are important and get used. The training of Botto is designed to not allow for an overly skewed voting weight. For example, 500 votes each cast by separate voters for one piece will have more weight in the training than 2000 votes by a single voter for the same piece. Other factors, like being the winner or the sale amount, are not currently used in the training.
Botto will freely choose between 1:1, 9:16, and 16:9 aspect ratio artworks provided each model allows for it, and will adjust its selection of format based on voting.
The titles are created with an algorithm generating random combinations of 2-4 words that are given to CLIP to determine if there is a good match. Different titles are generated until CLIP finds a combination that is the best match with the image and has not been used before. Titles are checked against a list of existing titles so that it does not recreate an existing title.
The descriptions are generated with GPT-3 and are the only part of the process that involves some direct human curation. As GPT-3 was trained on much of the internet, its language can be quite foul at times and is not ready to be out in the world without some supervision.
Until trustworthy text generation methods are developed, the DAO will pick from a series of 5-10 generated descriptions by GPT-3 that CLIP likes and that they feel best fits Botto’s voice. Beyond selecting the description, there is absolutely no editing other than correcting typos and punctuation. This final selection could eventually be passed along to voters.
The final title, description, metadata, and URL to the bitmap on IPFS are all on-chain and minted using Manifold.
One of the rules for Botto is that there be no direct human interference in the creation process. Botto is strictly against any “cheating” or human guidance other than the voting. That means the prompts are random, there are no seed images of existing real-world images used, and the selection of fragments are entirely controlled by Botto.
The only direction Botto got at the outset was from adding a small amount of pre-curated prompts to the entirely random ones generated by the algorithm. While providing more direct human guidance would generate more coherent compositions at the outset, this wouldn’t allow Botto to play in all the latent space available in Stable Diffusion and VQGAN + CLIP.
The one temporary exception is the curation of the artist description for the final piece (see Generating Titles and Artist Descriptions).
Quasimondo (aka Mario Klingemann) designed Botto based on a whitepaper he wrote back in 2018. He is the only person who works with the AI part of the code and enforces the rule for Botto that there be no direct human interference with the creations. As such, his only work is to adjust the way votes are implemented to ensure Botto is learning as best as possible. Quasimondo is also responsible for adding new capabilities, if and when that happens.
Anyone can propose adding or removing a new model to Botto’s set. Some suggested (but not strictly required) guidelines are:
- The model is sufficiently large so as to not be introducing a highly human-curated model that violates Botto’s agency
- Fees are affordable for the DAO treasury
- Not adding more than one model at a time to Botto’s core process
Proposing to add/remove a model works like any other BIP proposal a Botto member can make.
The DAO could also decide to add a new model before the scheduled end of a period, cutting it short and starting a new period by default.
Themes will be generated by asking Botto via GPT-3 to propose a set of 10 themes. The DAO will then vote on the themes proposed by Botto using the same interface for voting on mint descriptions. The selected theme will be added by default to the prompts generated for that period, adding the theme verbatim at the end of the prompt.
The community may also decide to provide no theme for a period, for instance when a new model is being added and there is a desire to see its full range before narrowing in on a theme.
Themes will be presented and voted on in the last week of a period unless the DAO decides otherwise. The DAO could also decide to cut a period short and go on to a different theme.