Using network flow data for network operations, performance and security mangaement is a large data problem, in that we're talking about collecting, processing and storing a large amount of data. Modern relational databases management (RDBM) technology works very well to assist in processing and mining all that data. Some thought and engineering, however, needs to be done to get the most benefit.
When using RDBM technology to support flow data auditing and analysis, the primary issue is database performance and data arrival rates. Can the RDBM keep up, inserting the flow data from the sensors and also serve queries against the data? For most companies, corporations and some universities and colleges, using an RDBM like MySQL to manage and process all the primitive data from their principal argus sensors (or netflow data sources) works well. Many database systems running on contemporary PC technology (dual/quad core, 2+GHz Intel, 8-16GB memory, 1+TB disk) can handle around 500-1000 argus record insertions per second, which is around 50-85M flows per day or around 70GB flow data/day), and have plenty of cycles to handle many standard forensic queries to the data, like "show me all the activity from this network address in the last month".
For larger sites, such as member universities of Internet2, where the flow record demand can get into the 20-50K flows per second range, there are databases that can keep up. The STANFORD Isis Project handles this type of load very well, but you are working with commercial systems that need a bit of thought and attention (as a general rule). For ISP's and national level monitoring systems, where the flow record demand can be in the 100K records per second range, databases are not currently capable of handling the transaction rates of the "primitive" data. By distributing the data, you can handle the flow demand, or you can use native file system methods and use RDBM technology to handle the "derived" data, such as the list of IP addresses seen. Hybrid architectures are how large scale organizations make using flow data successful.
The key programs that support the direct use of argus data and databases are radium(), which is used to collect the data and transport it to a central node, and the two MySQL programs, rasqlinsert() and rasql(). These example programs (yes, they are just examples of what you can do), are designed to deal with schema decisions, database and table generation, and provide some simple queries strategies that are relevant to flow data.
rasqlinsert() is the primary RDBM program. It inserts argus data into MySQL database tables, and it tries to deal with most of the issues that come up when inserting data into a database. Flexible table creation? Near real-time insertion? Bulk insertion optimization? Support flexible keying, indexing and schema specification? Aggregate/filter/label records before you insert? Control the record insertion rate to control the impact on the database engine? Multi-threaded to deal with large loads? All of this sounds like some of the improtant requirements. If you see some missing, please send them to the email list, and we'll add them.
Using a database for handling argus data provides some interesting solutions to some interesting problems. racluster() has been limited in how many unique flows it can process, because of RAM limitations. rasqlinsert() can solve this problem, as it can do the aggregation of racluster(), but use a MySQL table as the backing store, rather than memory. Programs like rasort() which read in all the argus data, use qsort() to sort the records, and then output the records as a stream, have scaling issues, in that you need to have enough memory to hold the binary records. Programs like rasqlinsert() can insert the data into a database, and the database can sort the table contents. So hopefully we'll provide a lot of flexibility with these simple techniques.
Getting data into a database table involves a few basic steps. You need to decide what specific data you want to put into the database, how you want the data to be processed (keys, indexes, paritioning, etc...), and then you have to transform the data into a format that the database system likes. rasqlinsert() attempts to handle these tasks with a simple set of configuration and command line options.
The program rasqlinsert() allows you to insert any type of data that is derived from argus data directly into MySQL database table(s). Think of rasqlinsert() as a collection of the basic ra* programs, but with a database backend. It can act like ra(), where it takes in an argus stream or file, and appeands it to a database table. rasqlinsert() can act like racluster() where it generates aggregated representations of the argus data stream, and it inserts the results into a database table. The output stream that rasqlinsert() generates can be directed to database tables using the same methods as rasplit(), such that the database table names can be dynamically controlled, and rasqlinsert() can act like ratop() but with the screen being duplicated to a database backend. Also, rasqlinsert() has a daemon mode, where it can do all of this near real-time database management operations, without having a CURSES based output window.
In this section, we'll describe the simplest operation, appending a live argus data stream to a set of database tables that are explicitly partioned by date. We use this very successfully when the number of flows are < 20M flows/day, which are most working group systems, and many, many enterprise networks.
Argus data inserted into a database table has two types of representation in the database:
1) the original primitive argus record, which is stored as a "BINARY BLOB" called "record".
2) additional fields that are the database table "COLUMNS" or attributes.
The argus record itself can contain up to 148 (at last count) fields, metrics, attributes, objects: far too many fields to expose in a database table, at least if you're worried about performance. But when you insert a record into a database, MySQL can't efficiently deal with "BINARY BLOB" attributes in that its hard to sort, index, select, key the values in the BLOB. So, its generally a good idea to provide some attributes that are inserted along with the binary record so that the database can efficiently work with the data.
The attributes can be any field that ra() can print, and you specify them using the "-s fields" option. Because all ra* programs have a default list of printable fields, rasqlinsert() will use these default fields as attributes in the database, if you don't specify any other list, either in the .rarc file, or on the command line. rasqlinsert() supports two(2) addtional fields, an 'autoid' field, which MySQL understands, and the 'record' field, that provides you the ability to make the BINARY BLOB an optional field.
Because rasqlinsert() is like racluster() and ratop(), it also has a default flow key specified. This keying specification is passed into the database, and so if you don't want keys in your schema, or if you want to change the keying strategy, you'll need to a "-m keys" option on the commandline.
So lets run rasqlinsert() as a daemon, appending records coming from a remote near real-time argus data source to a MySQL database:
rasqlinsert -S argus.data.source -w mysql://user@localhost/argusData/argusTable -m none -d
This will attach to argus.data.source, and append any argus data into the MySQL table 'argusTable' that is in the 'argusData' database, using the 'root' account on the local MySQL database engine. The table schema that is created is composed of the default output fields that ra() uses and field for the actual binary argus record. Here is the table schema that was created:
mysql> desc argusData;
| Field | Type | Null | Key | Default | Extra |
| stime | double(18,6) unsigned | NO | | NULL | |
| flgs | varchar(32) | YES | | NULL | |
| proto | varchar(16) | NO | | NULL | |
| saddr | varchar(64) | NO | | NULL | |
| sport | varchar(10) | NO | | NULL | |
| dir | varchar(3) | YES | | NULL | |
| daddr | varchar(64) | NO | | NULL | |
| dport | varchar(10) | NO | | NULL | |
| pkts | bigint(20) | YES | | NULL | |
| bytes | bigint(20) | YES | | NULL | |
| state | varchar(32) | YES | | NULL | |
| record | blob | YES | | NULL | |
12 rows in set (0.00 sec)
This appends argus data as it's received into a single table. The average size of an entry is 500 bytes, in this configuration, and so 20M flows/day will result in 10GB daily tables.
With this type of table, you are storing your argus records in a MySQL database table. No real need for putting them anywhere else, so this could be considered your archive. You can keep the data in this single table, and use "PARTIONING" statements to structure the data based on time. However, if you were interested in an explicit partitioning scheme, you can have rasqlinsert() manage tables based on time, like rasplit() does.
rasqlinsert -S argus.data -m none -d \
-M time 1d -w mysql://user@localhost/argusData/argusTable_%Y_%m_%d
rasqlinsert() will use the "-M time 1d" and the "%Y_%m_%d" directive in the table name to generate MySQL database tables that have the daily date in the table name, and like rasplit(), it will 'split' the argus data records such that no record in the table extends beyound the bounds of the table time, in this case, 1 day.
These are a few simple examples of how rasqlinsert() can append argus data records to tables in a mysql database. Additonal, more complex tasks, will be described at this link (to be constructed).
rasql() is a relatively simple program. It performs a SQL "SELECT" query for the "record" field in the specified database table. The "record" field is expected to be a "binary blob" that contains a single binary argus record. If there isn't a "record" field, rasql() quietly exists. After rasql() gets the argus record(s) from the database, it has the functionality of ra(), where it can print, filter, and/or output the records. Specifying the table for reading is done using the "-r url" format, where you indicate that you are looking for data from a "mysql:" database. The table can be explicit, or it can be programmatically derived.
By default, rasql() returns all the "records' from a table. There are a few strategies for "selecting" specific records from the table. There is a specific option for you to provide a "WHERE" clause to the SQL query, using the "-M sql='clause'" option. Using this option, you can use MySQL to select specific records from the table. When this opion is used, rasql() provides the "WHERE" and it takes your string and appends it. If there already is a "WHERE" clause being used, rasql() will append "AND " and then your phrase to the "WHERE" clause that will be used in the query. (so don't put in the "WHERE"). This is important if you are using a "-t start-stop" filter on the command line, as that will attempt to provide a "WHERE" clause, if the database table has a time field in the schema.
Now, for an example. Say you are interested in getting the flows seen from 00:00:00-00:12:00, but 2 days ago, from a specific IP address, 18.104.22.168. Lets assume that the data is stored in a set of tables that have the date in the table name, and the table schema contains both source and destination IP addresses, 'saddr' and 'daddr' respectively. This should work:
rasql -t -2d+20m -r mysql://user@localhost/argusData/argus_%Y_%m_%d \
-M sql="saddr='22.214.171.124' or daddr='126.96.36.199'"
This takes the time range, and the table naming strategy, and figures out which table(s) from the local argusData database to read from, to provide the "record" data you're looking for. The "-M sql="clase" option uses the 'saddr' and 'daddr' attributes which are ASCII strings. You can write these records to a file, or pipe to other programs, like racluster() to generate a view that you want. See ra.1 for how to specify time filters.
Page Last Modified: 14:22:39 EDT 13 Mar 2012 ©Copyright 2000 - 2012 QoSient, LLC. All Rights Reserved.