Data Center, Scaling & Management with AI

chevron-icon
Back
Top
chevron-icon
project-presentation-img
Kossiso Udodi
Project Owner

Data Center, Scaling & Management with AI

Expert Rating

n/a
  • Proposal for BGI Nexus 1
  • Funding Request $50,000 USD
  • Funding Pools Beneficial AI Solutions
  • Total 2 Milestones

Overview

We are developing an AI-driven platform that automates data centre tasks like server provisioning, patching, general maintenance and monitoring. It uses reinforcement learning to handle standard operations and offer insights. We are currently building datasets for the model that will power the platform. We are using Qwen 2.5 as a foundational model. | Our dataset currently has 1.8M rows and hopes to hit 5M rows before April. Because data centres have low fault tolerance, we must precisely structure and validate our datasets. We aim to train a beta version of the model within the next two months and start working on the platform’s software section in May.

Proposal Description

How Our Project Will Contribute To The Growth Of The Decentralized AI Platform

Our AI tool helps data centres work smarter by automating tasks, which reduces waste and lowers costs. This efficiency means more resources can be dedicated to supporting good AI projects that BGI cares about, aligning with its mission of responsible and ethical AI development. By optimizing data centres, we help BGI enable more beneficial AI initiatives through better resource allocation.

Our Team

The team consists of me, Kossiso Udodi, and Farida Adamu. I have direct experience managing servers and infrastructure at the National Agency for Science and Engineering Infrastructure and building models in MLOps.

I also do software architecture.

My co-founder, who has worked with the Rural Electrification Agency, PwC, and Generation, brings deep expertise in data engineering.

AI services (New or Existing)

Named Entity Recognition

How it will be used

We will use it to extract data container and entity data for command execution. For example if the model needs to configure a Kubernetes container we will use the service to extract the name of the container from logs and adjust the command to be executed.

Company Name (if applicable)

Collosa AI

The core problem we are aiming to solve

The core problem we're solving is the high costs and inefficiency of managing data centres. Traditional methods are time-consuming, error-prone, and expensive, making it hard for data centres to scale and grow sustainably. Our AI-driven platform automates server management and maintenance tasks, cutting costs and reducing errors. This makes data centres more efficient and environmentally friendly. By solving this problem, we help make AI and technology more accessible and sustainable.

Our specific solution to this problem

Using reinforcement learning, our AI-driven platform automates data centre tasks like server management and maintenance. This reduces human intervention and errors, cutting costs and freeing up staff. The platform continuously learns and adapts, optimizing operations as data centres grow. By improving efficiency, we lower environmental impact and support responsible AI development. This solution directly addresses inefficiencies, making data centres more scalable and sustainable.

Existing resources

We will be leveraging Qwen 2.5 as our foundational model. This cuts the cost of building a model from the scratch. 

Links and references

https://imara.collosaai.com

Additional videos

https://drive.google.com/file/d/1sYLk0LDeME-pSg6iIaCjDtaEDwuvTggr/view?usp=sharing

Proposal Video

Placeholder for Spotlight Day Pitch-presentations. Video's will be added by the DF team when available.

  • Total Milestones

    2

  • Total Budget

    $50,000 USD

  • Last Updated

    12 Feb 2025

Milestone 1 - Creation of datasets

Description

Completing the creation of datasets is a key milestone providing the essential data needed to train our AI model. Our dataset currently has 1.8 million rows and is on track to reach 5 million by April covering a wide range of server logs and maintenance records. The data is structured to support both supervised and reinforcement learning helping the AI understand cause-and-effect relationships between commands and system states. Achieving this sets the stage for training the AI model and developing a functional beta version of our platform.

Deliverables

This deliverable would contain millions of rows of data i.e A robust dataset

Budget

$30,000 USD

Success Criterion

When we have completed 6 million rows of data containing data ranging from Core Infrastructure & Server Management to development and automation IDEs.

Milestone 2 - Model training

Description

Completing the model training is a crucial step in developing an AI-driven data center management platform. This process involves teaching the AI to perform tasks like server provisioning patching and maintenance by using a large dataset. The dataset currently at 1.8 million rows and expected to grow to 5 million by April is structured as Invocation | Command | Context | Ansible Playbook Command capturing various aspects of command execution. The training process will use reinforcement learning where the AI interacts with a simulated environment receiving feedback to optimize its actions. Once complete the AI will efficiently manage data center tasks leading to cost savings and improved reliability.

Deliverables

This deliverable sets the stage for the beta release of the datacenter AI model.

Budget

$20,000 USD

Success Criterion

First, the model must achieve a high accuracy rate, such as 95%, in executing and recommending optimal data center tasks. This ensures that the platform can reliably perform its functions without frequent human intervention. Second, the model should demonstrate data efficiency by effectively learning from the structured dataset, which is expected to grow to 5 million rows. This ability to generalize from the data is crucial for handling diverse and complex tasks. Third the model should exhibit scalability, maintaining its performance and accuracy as the complexity and scale of the tasks increase.

Join the Discussion (4)

Sort by

4 Comments
  • 1
    commentator-avatar
    Simon250
    Mar 9, 2025 | 1:17 PM

    I suggest we could reallocate some of the budget from Milestone 1 to create two additional milestones. For instance, Milestone 3 could focus on building a user-friendly UI, ensuring that our platform is accessible and intuitive for end users. Then, Milestone 4 could be dedicated to developing comprehensive documentation and planning for further expansion. This revised structure would not only streamline the development process but also enhance the overall scalability and usability of the platform. What are your thoughts on this approach?

    • 0
      commentator-avatar
      Kossiso Udodi
      Mar 9, 2025 | 1:36 PM

      I agree with you. Thank you for the clarity on this. We'll make corrections!

  • 0
    commentator-avatar
    Sky Yap
    Mar 9, 2025 | 12:42 PM

    Really like this idea! I think changing the milestone title from "Creation of datasets" to "Collection of datasets" is a good idea. "Collection" better emphasizes that you're gathering data from existing sources, like server logs and maintenance records, rather than generating it from scratch. This slight change clarifies the process and aligns well with the continuous growth of your dataset. What do you think?

    • 1
      commentator-avatar
      Kossiso Udodi
      Mar 9, 2025 | 2:30 PM

      We "collect" for sections that are less error sensitive and format specially for critical sections. In early testing, one of the problems we encountered was that sometimes models sent the wrong initiation commands. There are certain functions where the system can fail and try again but in some areas like security and firewalls, we cannot afford even slight margins of error. We are trying to get around this by formatting some sections of data this way:   {        "Invocation": "Apply Pod Security Standards at Cluster level",        "Command": "kubectl apply -f podsecurity.yaml",        "NLP Context": "This is for when we need to apply Pod Security Standards at Cluster level",        "Ansible Task": {            "name": "Execute: Apply Pod Security Standards at Cluster level",            "module": "k8s",            "args": {                "definition": "kubectl apply -f podsecurity.yaml"            }We are also working with a few server centre workers to address specific scenarios with very low error margins. We want to get it right so that managers can trust it well enough to adopt it quickly.  Collection is also part of what we are doing, especially for sections with low sensitivity. 

Expert Ratings

Reviews & Ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

feedback_icon