Google is open-sourcing Magika, an AI system for file type recognition. It can quickly and accurately identify binary and text-based file types.
Accurately identifying file types is a challenging problem due to the diverse structures of file formats. Traditional recognition tools such as libmagic rely on hand-crafted heuristics and user-defined rules, which can be time-consuming and error-prone.
Magika addresses these issues with its AI-based model and large training dataset. It provides a more reliable way to identify file types at scale, Google said. The tool uses a custom deep learning model that is only 1MB in size and can identify files in milliseconds, Google writes.
In a benchmark of one million files, Magika outperformed existing tools by 20 percent, with even better performance for text files.
Google says it uses Magika internally to route Gmail, Drive, and Safe Browsing files to the correct security and content policy scanners.
Magika's open-source approach is intended to help other software improve the accuracy of its file detection, and to provide researchers with a reliable tool for large-scale detection. The upcoming integration with VirusTotal is expected to improve the platform's efficiency and accuracy in detecting malicious code.
Users can try the web demo of Magika or install it as a Python library and standalone command-line tool.
Magika is available on Github under the Apache2 license and can be installed as a standalone tool and as a Python library using the pypi package manager using the "pip install magika" command.