Identify Scripting Languages in Malware: A Faster Alternative to Magika and Guesslang
Need to quickly identify the scripting language of a suspicious file? Lex Sleuther is lightweight, command-line tool for identifying scripting languages that focuses on speed and accuracy, with a lexer for the file's supported languages. Unlike machine-learning models like Guesslang and Magika, Lex Sleuther uses a unique lexing approach, excelling in environments where resources are limited.
What is Lex Sleuther and How Does It Work?
Lex Sleuther analyzes source code using dedicated lexers for each supported language. Instead of relying on a heavy machine-learning model, it dissects code into tokens, counts their occurrences, and calculates a probability based on weights derived from real-world sample analysis. This method offers a distinct advantage identifying malware components written in scripting languages.
Instead of relying on a trained machine-learning model, this tool relies on these key steps:
- Lexing: Breaking down the code into tokens.
- Token Counting: Tallies the number of errors found.
- Probability Calculation: Assigning language based on probabilities.
Why Use Lex Sleuther? Key Benefits
- Speed: Lex Sleuther is designed for fast identification, crucial in time-sensitive malware analysis.
- Accuracy: For supported file types, Lex Sleuther rivals Magika's accuracy.
- Resource Efficiency: Avoid the overhead of large machine learning models. Perfect for resource-constrained environments.
Getting Started with Lex Sleuther
Here's how to install and run Lex Sleuther:
- Prerequisites: Ensure you have Rust and Cargo installed on your system.
- Installation: Use the following command to install the CLI:
cargo install --git https://github.com/CrowdStrike/lex_sleuther.git
- Help: Learn about command usage, execute command
lex_sleuther --help
.
Lex Sleuther vs. Magika: A Key Difference
Lex Sleuther excels where older versions of Magika faltered: nested formats. Consider HTML files containing JavaScript, CSS, and other languages. Lex Sleuther accurately identifies the outer-most language, ignoring embedded code snippets. The latest Magika version has seemingly addressed that limitation.
Supported Scripting Languages
Lex Sleuther focuses on scripting languages frequently found in malware. This targeted approach allows for faster and more accurate analysis.
Understanding Limitations
- Limited Scope: Lex Sleuther supports fewer languages than tools like Guesslang or Magika.
- Experimental: This tool is considered a novel experiment.
Contributing and Expanding Lex Sleuther
Want to add support for more languages?
- New Lexers: Instructions for adding new lexers can be found in the
lex_sleuther_multiplexer
README. - New recognized languages: Retrain on using a new dataset. Follow the instructions in the dataset folder.
Performance Considerations
- Lexing Bottleneck: The lexing process is currently the most time-consuming aspect of Lex Sleuther.
- Possible Improvements: Optimizations like minimized state machines and memory-efficient scanning can further enhance performance.
Disclaimer
Lex Sleuther is not a formally supported CrowdStrike product. This project is for research purposes only. If you encounter issues, report them via the GitHub repository.