Guidewire

AI-Generated Photoshoots for Real Estate: Training AI on Content Hub Assets

SergeyYatsenko
Sitecore Technology MVP & Sr. Director
  • Twitter
  • LinkedIn

Introduction

Recent advancements in image-to-image AI models have opened new possibilities in digital asset management and content generation. Having worked with Sitecore Content Hub across numerous enterprise implementations, I've discovered an interesting application: using our extensive library of professional photography to train specialized AI models for generating new property images.

Content Hub's robust asset management capabilities make it an ideal platform for both sourcing training data and managing AI-generated content. Through years of implementations, we've accumulated a substantial collection of high-quality property photographs, complete with detailed metadata. This presented an opportunity to explore how modern AI could help solve a common challenge in real estate marketing.

The Challenge

In real estate development, presenting yet-to-be-built properties requires compelling visual content that accurately represents the final product. Traditional approaches often involve basic architectural renders or reference photos, which may not fully capture the intended aesthetic. Our task was to generate photorealistic images of future properties using a combination of architectural renders, sketches, and text descriptions, while maintaining the professional quality and distinctive style present in our existing photography.

The complexity lies in translating architectural specifications into images that match the caliber of professional photography. Each new property visualization needs to reflect specific design elements, materials, and finishes while maintaining visual consistency with our existing portfolio. This requires an AI model that understands not just general architectural features, but also our unique brand aesthetic captured in thousands of previous property photos.

The core concept is performing style transfer from our curated Content Hub assets to newly generated images using the Flux image-to-image model.

Technical Architecture

The solution leverages several cutting-edge technologies:

  1. Base Model: After evaluating various options, including Stable Diffusion 3.5, I selected Flux 1 from Black Forest Labs as our foundation. Their image-to-image model demonstrated superior photorealism and high-resolution output capabilities, crucial for architectural visualization.
  2. Custom LoRA Models: Through experimentation, I developed separate LoRA models for different contexts (interior and exterior shots). The training process required careful balancing of dataset size and training parameters. Finding the optimal learning rate, batch size, and training steps proved to be one of the more challenging aspects of the implementation.
  3. Infrastructure Stack:

Technical Implementation

The core of our solution involves fine-tuning LoRA models for Flux 1. I chose to use this AI toolkit framework as a foundation, adapting it to our specific requirements.

Here's a glimpse into our training pipeline:

  1. Data Preparation: Quality metadata is crucial for training. I leveraged OpenAI's vision API to generate detailed descriptions, focusing on creating prompts that would help the image model render new photorealistic images that closely resemble our existing portfolio. I'm using the same approach as what I described in another blog post, AI-Enhanced Interior Photography: Leveraging Content Hub for Image Generation.
  2. Model Training: configuration: here's my configuration file (please refer to readme in ai-toolkit documentation for more details on the configuration):
---
job: extension
config:
  # this name will be the folder and filename name
  name: "mh_flux_lora_v7"
  process:
    - type: "sd_trainer"
      # root folder to save training sessions/samples/weights
      training_folder: "output"
      # uncomment to see performance stats in the terminal every N steps
      #      performance_log_every: 1000
      device: cuda:0
      # if a trigger word is specified, it will be added to captions of training data if it does not already exist
      # alternatively, in your captions you can add [trigger] and it will be replaced with the trigger word
      # trigger_word: "meritage"
      network:
        type: "lora"
        linear: 16
        linear_alpha: 16
      save:
        dtype: float16 # precision to save
        save_every: 250 # save every this many steps
        max_step_saves_to_keep: 4 # how many intermittent saves to keep
        push_to_hub: false #change this to True to push your trained model to Hugging Face.
        # You can either set up a HF_TOKEN env variable or you'll be prompted to log-in
      #       hf_repo_id: your-username/your-model-slug
      #       hf_private: true #whether the repo is private or public
      datasets:
        # datasets are a folder of images. captions need to be txt files with the same name as the image
        # for instance image2.jpg and image2.txt. Only jpg, jpeg, and png are supported currently
        # images will automatically be resized and bucketed into the resolution specified
        # on windows, escape back slashes with another backslash so
        # "C:\\path\\to\\images\\folder"
        - folder_path: "/workspace/ai-toolkit/bedrooms_empty"
          caption_ext: "txt"
          caption_dropout_rate: 0.05 # will drop out the caption 5% of time
          shuffle_tokens: false # shuffle caption order, split by commas
          cache_latents_to_disk: true # leave this true unless you know what you're doing
          resolution: [512, 768, 1024] # flux enjoys multiple resolutions
      train:
        batch_size: 16
        steps: 2000 # total number of steps to train 500 - 4000 is a good range
        gradient_accumulation_steps: 1
        train_unet: true
        train_text_encoder: false # probably won't work with flux
        gradient_checkpointing: true # need the on unless you have a ton of vram
        noise_scheduler: "flowmatch" # for training only
        optimizer: "adamw8bit"
        lr: 1e-4
        # uncomment this to skip the pre training sample
        #        skip_first_sample: true
        # uncomment to completely disable sampling
        #        disable_sampling: true
        # uncomment to use new vell curved weighting. Experimental but may produce better results
        #        linear_timesteps: true

        # ema will smooth out learning, but could slow it down. Recommended to leave on.
        ema_config:
          use_ema: true
          ema_decay: 0.99

        # will probably need this if gpu supports it for flux, other dtypes may not work correctly
        dtype: bf16
      model:
        # huggingface model name or path
        name_or_path: "black-forest-labs/FLUX.1-dev"
        is_flux: true
        quantize: true # run 8bit mixed precision
      #        low_vram: true  # uncomment this if the GPU is connected to your monitors. It will use less vram to quantize, but is slower.
      sample:
        sampler: "flowmatch" # must match train.noise_scheduler
        sample_every: 250 # sample every this many steps
        width: 1024
        height: 1024
        prompts:
          # you can add [trigger] to the prompts here and it will be replaced with the trigger word
          # - "[trigger] holding a sign that says 'I LOVE PROMPTS!'"\n
          - "a primary bedroom with a large bed, nightstands, and a dresser, soft lighting, and a window with sheer curtains"
          - "a cozy living room with a fireplace, a large sofa, and a coffee table, warm lighting, and a rug on the floor"
          - "a modern kitchen with stainless steel appliances, white cabinets, and a marble countertop, pendant lighting, and a tiled backsplash"
          - "a spacious bathroom with a freestanding tub, a double vanity, and a walk-in shower, natural light, and a potted plant"
          - "a minimalist dining room with a wooden table, chairs, and a chandelier, neutral colors, and artwork on the walls"
          - "a home office with a desk, a chair, and bookshelves, a window with a view, and a laptop on the desk"
          - "a walk-in closet with shelves, drawers, and hanging space, a mirror, and a chandelier"
          - "a sunroom with wicker furniture, plants, and a ceiling fan, large windows, and a view of the garden"
          - "a basement with a home theater, a bar, and a pool table, dim lighting, and a popcorn machine"
          - "an attic with exposed beams, a skylight, and a reading nook, cozy lighting, and a bookshelf"
        neg: "" # not used on flux
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 20
# you can add any additional meta info here. [name] is replaced with config name at top
meta:
  name: "[name]"
  version: "1.0"
  1. Training script: And here's what my training script looks like, so far (this is still work in progress, as I'm making it better):
#!/usr/bin/env python3
"""
LoRA Training Script for AI Image Generation
This script sets up and runs the training process for LoRA models using the AI-toolkit framework.
It includes setup for environment, image processing, and model training.
"""

#!/usr/bin/env python3
"""
LoRA Training Script for AI Image Generation
This script processes images and generates captions using OpenAI's GPT-4 Vision API,
then runs LoRA training for AI image generation models.

Key features:
- Parallel image processing using Ray
- Caption generation with GPT-4 Vision
- Integration with AI-toolkit for LoRA training
"""

import os
import base64
from typing import List, Tuple
import time
from openai import OpenAI
import ray
from PIL import Image

# Initialize Ray for parallel processing
ray.shutdown()
ray.init(num_cpus=6)  # Adjust based on available CPU cores

@ray.remote
class ImageProcessor:
    """Handles image processing and caption generation using OpenAI's GPT-4 Vision."""

    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)

    def encode_image(self, image_path: str) -> str:
        """Convert image to base64 encoding for API submission."""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    def generate_caption(self, image_path: str) -> Tuple[str, str]:
        """
        Generate a detailed caption for an image using GPT-4 Vision.
        Returns tuple of (image_path, caption)
        """
        prompt = """
        Create a caption of this image (the materials and finishes) using a set of 5-15
        one-to-two-word descriptors that can be fed into a diffusion image generation model,
        starting with 'empty bedroom'.
        """
        try:
            base64_image = self.encode_image(image_path)
            response = self.client.chat.completions.create(
                model="gpt-4-vision-preview",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{base64_image}"
                                },
                            },
                        ],
                    }
                ],
            )
            return image_path, response.choices[0].message.content
        except Exception as e:
            print(f"Error generating caption for {image_path}: {e}")
            return image_path, None

def get_image_files(folder_path: str) -> List[str]:
    """Return list of supported image file paths in the given folder."""
    supported_formats = (".jpg", ".jpeg", ".png", ".bmp", ".tiff")
    return [
        os.path.join(folder_path, f) for f in os.listdir(folder_path)
        if f.lower().endswith(supported_formats)
    ]

def save_caption(result: Tuple[str, str]) -> None:
    """Save generated caption to a text file alongside the image."""
    image_path, caption = result
    if caption:
        txt_path = os.path.splitext(image_path)[0] + ".txt"
        with open(txt_path, "w") as txt_file:
            txt_file.write(caption)
        print(f"Caption saved for {os.path.basename(image_path)}.")
    else:
        print(f"Failed to generate caption for {os.path.basename(image_path)}.")

def process_images_parallel(folder_path: str, api_key: str, num_workers: int = 8) -> None:
    """
    Process images in parallel using Ray.

    Args:
        folder_path: Path to the folder containing images
        api_key: OpenAI API key
        num_workers: Number of parallel workers to use
    """
    # Get list of image files
    image_files = get_image_files(folder_path)

    # Skip images that already have captions
    image_files = [
        f for f in image_files
        if not os.path.exists(os.path.splitext(f)[0] + ".txt")
    ]

    if not image_files:
        print("No new images to process.")
        return

    # Create processor actors
    processors = [ImageProcessor.remote(api_key) for _ in range(num_workers)]

    # Distribute work among processors
    futures = []
    for i, image_path in enumerate(image_files):
        processor = processors[i % num_workers]
        futures.append(processor.generate_caption.remote(image_path))

    # Process results as they complete
    for future in ray.get(futures):
        save_caption(future)

    print(f"Processed {len(image_files)} images using {num_workers} workers.")

def main():
    """Main execution function."""
    # Configuration
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Please set OPENAI_API_KEY environment variable")

    folder_path = "/workspace/ai-toolkit/bedrooms_empty"

    # Process images
    process_images_parallel(folder_path, api_key)

    # Run LoRA training using AI-toolkit
    os.system("python3 run.py config/train_lora*")

if __name__ == "__main__":
    main()

## Next Steps: Integration with Sitecore Content Hub

The integration of generated images back into Content Hub deserves its own detailed discussion. The process involves automated asset ingestion, metadata enhancement, and maintaining proper versioning - topics I plan to cover in a future post.

## Looking Forward: Generative AI and Virtual Staging

This project represents an early exploration of combining enterprise DAM capabilities with generative AI. One particularly promising direction is virtual staging, where we can transform empty interior photos into fully staged spaces ready for marketing materials. The possibilities for streamlining real estate visualization while maintaining brand consistency are quite exciting.

## Conclusion

The intersection of Content Hub enterprise-grade asset management and modern AI capabilities offers intriguing possibilities for real estate visualization. While the technology continues to evolve, our experiments with Flux and LoRA models demonstrate the potential for bridging the gap between architectural concepts and photorealistic presentations. The key to success lies in leveraging high-quality training data and careful model fine-tuning - areas where Content Hub structured approach to digital asset management proves invaluable.

As we continue to explore these technologies, the focus remains on practical applications that deliver real value to our marketing and sales processes. The ability to generate consistent, high-quality property visualizations represents just the beginning of what is possible with this combination of enterprise DAM and generative AI.

## Useful Links

- [Flux 1](https://blackforestlabs.ai)
- [Stable Diffusion](https://stability.ai/stable-diffusion)
- [AI Toolkit](https://github.com/ostris/ai-toolkit)
- [Hugging Face](https://huggingface.co/)
- [RunPod.io](https://www.runpod.io/)
- [Sitecore Content Hub](https://www.sitecore.com/products/content-hub)
- [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685)
- [fal.ai](https://fal.ai/)