CSV.jl Documentation
GitHub Repo: https://github.com/JuliaData/CSV.jl
Welcome to CSV.jl! A pure-Julia package for handling delimited text data, be it comma-delimited (csv), tab-delimited (tsv), or otherwise.
Overview
To start out, let’s discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
-
CSV.File: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns aCSV.Fileobject, which is like a lightweight table/DataFrame. Assumingfileis a variable of aCSV.Fileobject, individual columns can be accessed likefile.col1,file[:col1], orfile["col"]. You can see parsed column names viafile.names. ACSV.Filecan also be iterated, where aCSV.Rowis produced on each iteration, which allows access to each value in the row viarow.col1,row[:col1], orrow[1]. You can also index aCSV.Filedirectly, likefile[1]to return the entireCSV.Rowat the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated.CSV.Filesatisfies the Tables.jl "source" interface, and so can be passed to valid sink functions likeDataFrame,SQLite.load!,Arrow.write, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options. -
CSV.read: a convenience function identical toCSV.File, but used when aCSV.Filewill be passed directly to a sink function, like aDataFrame. In some cases, sinks may make copies of incoming data for their own safety; by callingCSV.read(file, DataFrame), no copies of the parsedCSV.Filewill be made, and theDataFramewill take direct ownership of theCSV.File's columns, which is more efficient than doingCSV.File(file) |> DataFramewhich will result in an extra copy of each column being made. Keyword arguments are identical toCSV.File. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. LikeCSV.File, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types). -
CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint thanCSV.File. Supports many of the same options asCSV.File, except column type handling is a little different. By default, every column type will be essentiallyUnion{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing aCSV.Rowsobject, rows can be "streamed" by iterating, where each iteration produces aCSV.Row2object, which operates similar toCSV.File'sCSV.Rowtype where individual row values can be accessed viarow.col1,row[:col1], orrow[1]. If each row is processed individually, additional memory can be saved by passingreusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row.CSV.Rowsalso supports the Tables.jl interface and can also be passed to valid sink functions. -
CSV.Chunks: similar toCSV.File, but allows passing antasks::Integerkeyword argument which will cause the input file to be "chunked" up intontasksnumber of chunks. After constructing aCSV.Chunksobject, each iteration of the object will return aCSV.Fileof the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source",CSV.Chunkssatisfies theTables.partitionsinterface, so sinks that can process input partitions can operate by passingCSV.Chunksas the "source". -
CSV.write: A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out aDataFrame, for example, it’s simplyCSV.write("data.csv", df), or to write out a matrix, it’susing Tables; CSV.write("data.csv", Tables.table(mat)) -
CSV.RowWriter: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table’s row.
That’s quite a bit! Let’s boil down a TL;DR:
-
Just want to read a delimited file or collection of files and do basic stuff with data? Use
CSV.File(file)orCSV.read(file, DataFrame) -
Don’t need the data as a whole or want to stream through a large file row-by-row? Use
CSV.Rows. -
Want to process a large file in "batches"/chunks? Use
CSV.Chunks. -
Need to produce a csv? Use
CSV.write. -
Want to iterate an input table and produce a single csv string per row?
CSV.RowWriter.
For the rest of the manual, we’re going to have two big sections, Reading and Writing where we’ll walk through the various options to CSV.File/CSV.read/CSV.Rows/CSV.Chunks and CSV.write/CSV.RowWriter.