Science and technology

Using C and C++ for knowledge science

While languages like Python and R are more and more standard for knowledge science, C and C++ could be a robust alternative for environment friendly and efficient knowledge science. In this text, we are going to use C99 and C++11 to write down a program that makes use of the Anscombe’s quartet dataset, which I am going to clarify about subsequent.

I wrote about my motivation for frequently studying languages in an article protecting Python and GNU Octave, which is price reviewing. All of the applications are supposed to be run on the command line, not with a graphical user interface (GUI). The full examples can be found within the polyglot_fit repository.

The programming process

The program you’ll write on this sequence:

  • Reads knowledge from a CSV file
  • Interpolates the info with a straight line (i.e., f(x)=m ⋅ x + q)
  • Plots the consequence to a picture file

This is a typical state of affairs that many knowledge scientists have encountered. The instance knowledge is the primary set of Anscombe’s quartet, proven within the desk under. This is a set of artificially constructed knowledge that offers the identical outcomes when fitted with a straight line, however their plots are very completely different. The knowledge file is a textual content file with tabs as column separators and some traces as a header. This process will use solely the primary set (i.e., the primary two columns).

Anscombe’s quartet

I II III IV
x y x y x y x y
10.zero eight.04 10.zero 9.14 10.zero 7.46 eight.zero 6.58
eight.zero 6.95 eight.zero eight.14 eight.zero 6.77 eight.zero 5.76
13.zero 7.58 13.zero eight.74 13.zero 12.74 eight.zero 7.71
9.zero eight.81 9.zero eight.77 9.zero 7.11 eight.zero eight.84
11.zero eight.33 11.zero 9.26 11.zero 7.81 eight.zero eight.47
14.zero 9.96 14.zero eight.10 14.zero eight.84 eight.zero 7.04
6.zero 7.24 6.zero 6.13 6.zero 6.08 eight.zero 5.25
four.zero four.26 four.zero three.10 four.zero 5.39 19.zero 12.50
12.zero 10.84 12.zero 9.13 12.zero eight.15 eight.zero 5.56
7.zero four.82 7.zero 7.26 7.zero 6.42 eight.zero 7.91
5.zero 5.68 5.zero four.74 5.zero 5.73 eight.zero 6.89

The C means

C is a general-purpose programming language that’s among the many hottest languages in use right this moment (in keeping with knowledge from the TIOBE Index, RedMonk Programming Language Rankings, Popularity of Programming Language Index, and State of the Octoverse of GitHub). It is a fairly outdated language (circa 1973), and plenty of profitable applications had been written in it (e.g., the Linux kernel and Git to call simply two). It can be one of many closest languages to the inside workings of the pc, as it’s used to control reminiscence straight. It is a compiled language; due to this fact, the supply code must be translated by a compiler into machine code. Its standard library is small and lightweight on options, so different libraries have been developed to offer lacking functionalities.

It is the language I exploit probably the most for number crunching, largely due to its efficiency. I discover it slightly tedious to make use of, because it wants a number of boilerplate code, however it’s effectively supported in numerous environments. The C99 normal is a latest revision that provides some nifty options and is effectively supported by compilers.

I’ll cowl the mandatory background of C and C++ programming alongside the best way so each learners and superior customers can observe alongside.  

Installation

To develop with C99, you want a compiler. I usually use Clang, however GCC is one other legitimate open supply compiler. For linear becoming, I selected to make use of the GNU Scientific Library. For plotting, I couldn’t discover any smart library, and due to this fact this program depends on an exterior program: Gnuplot. The instance additionally makes use of a dynamic knowledge construction to retailer knowledge, which is outlined within the Berkeley Software Distribution (BSD).

Installing in Fedora is as simple as working:

sudo dnf set up clang gnuplot gsl gsl-devel

In C99, comments are formatted by placing // in the beginning of the road, and the remainder of the road shall be discarded by the interpreter. Alternatively, something between /* and */ is discarded, as effectively.

// This is a remark ignored by the interpreter.
/* Also that is ignored */

Necessary libraries

Libraries are composed of two elements:

  • A header file that incorporates an outline of the capabilities
  • A supply file that incorporates the capabilities’ definitions

Header recordsdata are included within the supply, whereas the libraries’ sources are linked in opposition to the executable. Therefore, the header recordsdata wanted for this instance are:

// Input/Output utilities
#embody <stdio.h>
// The normal library
#embody <stdlib.h>
// String manipulation utilities
#embody <string.h>
// BSD queue
#embody <sys/queue.h>
// GSL scientific utilities
#embody <gsl/gsl_fit.h>
#embody <gsl/gsl_statistics_double.h>

Main operate

In C, this system have to be inside a particular operate known as main():

int essential(void)
    ...

This differs from Python, as coated within the final tutorial, which can run no matter code it finds within the supply recordsdata.

Defining variables

In C, variables need to be declared earlier than they’re used, and so they need to be related to a sort. Whenever you need to use a variable, it’s important to determine what sort of knowledge to retailer in it. You can even specify should you intend to make use of a variable as a continuing worth, which isn’t vital, however the compiler can profit from this data. From the fitting_C99.c program within the repository:

const char *input_file_name = "anscombe.csv";
const char *delimiter = "t";
const unsigned int skip_header = three;
const unsigned int column_x = zero;
const unsigned int column_y = 1;
const char *output_file_name = "fit_C99.csv";
const unsigned int N = 100;

Arrays in C will not be dynamic, within the sense that their size must be determined upfront (i.e., earlier than compilation):

int data_array[1024];

Since you usually have no idea what number of knowledge factors are in a file, use a singly linked list. This is a dynamic knowledge construction that may develop indefinitely. Luckily, the BSD provides linked lists. Here is an instance definition:

struct data_point
    double x;
    double y;

    SLIST_ENTRY(data_point) entries;
;

SLIST_HEAD(data_list, data_point) head = SLIST_HEAD_INITIALIZER(head);
SLIST_INIT(&head);

This instance defines a data_point listing comprised of structured values that include each an x worth and a y worth. The syntax is slightly difficult however intuitive, and describing it intimately could be too wordy.

Printing output

To print on the terminal, you need to use the printf() operate, which works like Octave’s printf() operate (described within the first article):

printf("#### Anscombe's first set with C99 ####n");

The printf() operate doesn’t mechanically add a newline on the finish of the printed string, so it’s important to add it. The first argument is a string that may include format data for the opposite arguments that may be handed to the operate, corresponding to:

printf("Slope: %fn", slope);

Reading knowledge

Now comes the arduous half… There are some libraries for CSV file parsing in C, however none appeared secure or standard sufficient to be within the Fedora packages repository. Instead of including a dependency for this tutorial, I made a decision to write down this half alone. Again, going into particulars could be too wordy, so I’ll solely clarify the overall thought. Some traces within the supply shall be ignored for the sake of brevity, however yow will discover the entire instance within the repository.

First, open the enter file:

FILE* input_file = fopen(input_file_name, "r");

Then learn the file line-by-line till there’s an error or the file ends:

whereas (!ferror(input_file) && !feof(input_file))

The getline() operate is a pleasant latest addition from the POSIX.1-2008 standard. It can learn an entire line in a file and handle allocating the mandatory reminiscence. Each line is then cut up into tokens with the strtok() operate. Looping over the token, choose the columns that you really want:

char *token = strtok(buffer, delimiter);

whereas (token != NULL)

    double worth;
    sscanf(token, "%lf", &worth);

    if (column == column_x)
        x = worth;
    else if (column == column_y)
        y = worth;
   

    column += 1;
    token = strtok(NULL, delimiter);

Finally, when the x and y values are chosen, insert the brand new knowledge level within the linked listing:

struct data_point *datum = malloc(sizeof(struct data_point));
datum->x = x;
datum->y = y;

SLIST_INSERT_HEAD(&head, datum, entries);

The malloc() operate dynamically allocates (reserves) some persistent reminiscence for the brand new knowledge level.

Fitting knowledge

The GSL linear becoming operate gsl_fit_linear() expects easy arrays for its enter. Therefore, since you will not know upfront the dimensions of the arrays you create, you should manually allocate their reminiscence:

const size_t entries_number = row - skip_header - 1;

double *x = malloc(sizeof(double) * entries_number);
double *y = malloc(sizeof(double) * entries_number);

Then, loop over the linked listing to save lots of the related knowledge to the arrays:

SLIST_FOREACH(datum, &head, entries)

Now that you’re performed with the linked listing, clear it up. Always launch the reminiscence that has been manually allotted to stop a memory leak. Memory leaks are dangerous, dangerous, dangerous. Every time reminiscence isn’t launched, a backyard gnome loses its head:

whereas (!SLIST_EMPTY(&head))

Finally, lastly(!), you may suit your knowledge:

gsl_fit_linear(x, 1, y, 1, entries_number,
               &intercept, &slope,
               &cov00, &cov01, &cov11, &chi_squared);
const double r_value = gsl_stats_correlation(x, 1, y, 1, entries_number);

printf("Slope: %fn", slope);
printf("Intercept: %fn", intercept);
printf("Correlation coefficient: %fn", r_value);

Plotting

You should use an exterior program for the plotting. Therefore, save the becoming operate to an exterior file:

const double step_x = ((max_x + 1) - (min_x - 1)) / N;

for (unsigned int i = zero; i < N; i += 1)

The Gnuplot command for plotting each recordsdata is:

plot 'fit_C99.csv' utilizing 1:2 with traces title 'Fit', 'anscombe.csv' utilizing 1:2 with factors pointtype 7 title 'Data'

Results

Before working this system, you should compile it:

clang -std=c99 -I/usr/embody/ fitting_C99.c -L/usr/lib/ -L/usr/lib64/ -lgsl -lgslcblas -o fitting_C99

This command tells the compiler to make use of the C99 normal, learn the fitting_C99.c file, load the libraries gsl and gslcblas, and save the consequence to fitting_C99. The ensuing output on the command line is:

#### Anscombe's first set with C99 ####
Slope: zero.500091
Intercept: three.000091
Correlation coefficient: zero.816421

Here is the ensuing picture generated with Gnuplot.

The C++11 means

C++ is a general-purpose programming language that can be among the many hottest languages in use right this moment. It was created as a successor of C (in 1983) with an emphasis on object-oriented programming (OOP). C++ is usually considered a superset of C, so a C program ought to be capable of be compiled with a C++ compiler. This isn’t precisely true, as there are some nook circumstances the place they behave in another way. In my expertise, C++ wants much less boilerplate than C, however the syntax is harder if you wish to develop objects. The C++11 normal is a latest revision that provides some nifty options and is kind of supported by compilers.

Since C++ is basically appropriate with C, I’ll simply spotlight the variations between the 2. If I don’t cowl a bit on this half, it means that it’s the identical as in C.

Installation

The dependencies for the C++ instance are the identical because the C instance. On Fedora, run:

sudo dnf set up clang gnuplot gsl gsl-devel

Necessary libraries

Libraries work in the identical means as in C, however the embody directives are barely completely different:

#embody <cstdlib>
#embody <cstring>
#embody <iostream>
#embody <fstream>
#embody <string>
#embody <vector>
#embody <algorithm>

extern "C"
#embody <gsl/gsl_fit.h>
#embody <gsl/gsl_statistics_double.h>

Since the GSL libraries are written in C, you should inform the compiler about this peculiarity.

Defining variables

C++ helps extra knowledge sorts (courses) than C, corresponding to a string kind that has many extra options than its C counterpart. Update the definition of the variables accordingly:

const std::string input_file_name("anscombe.csv");

For structured objects like strings, you may outline the variable with out utilizing the = signal.

Printing output

You can use the printf() operate, however the cout object is extra idiomatic. Use the operator << to point the string (or objects) that you simply need to print with cout:

std::cout << "#### Anscombe's first set with C++11 ####" << std::endl;

...

std::cout << "Slope: " << slope << std::endl;
std::cout << "Intercept: " << intercept << std::endl;
std::cout << "Correlation coefficient: " << r_value << std::endl;

Reading knowledge

The scheme is identical as earlier than. The file is opened and skim line-by-line, however with a unique syntax:

std::ifstream input_file(input_file_name);

whereas (input_file.good())
    std::string line;

    getline(input_file, line);

    ...

The line tokens are extracted with the identical operate as within the C99 instance. Instead of utilizing normal C arrays, use two vectors. Vectors are an extension of C arrays within the C++ standard library that enables dynamic administration of reminiscence with out explicitly calling malloc():

std::vector<double> x;
std::vector<double> y;

// Adding a component to x and y:
x.emplace_back(worth);
y.emplace_back(worth);

Fitting knowledge

For becoming in C++, you should not have to loop over the listing, as vectors are assured to have contiguous reminiscence. You can straight go to the becoming operate the tips to the vectors buffers:

gsl_fit_linear(x.knowledge(), 1, y.knowledge(), 1, entries_number,
               &intercept, &slope,
               &cov00, &cov01, &cov11, &chi_squared);
const double r_value = gsl_stats_correlation(x.knowledge(), 1, y.knowledge(), 1, entries_number);

std::cout << "Slope: " << slope << std::endl;
std::cout << "Intercept: " << intercept << std::endl;
std::cout << "Correlation coefficient: " << r_value << std::endl;

Plotting

Plotting is finished with the identical strategy as earlier than. Write to a file:

const double step_x = ((max_x + 1) - (min_x - 1)) / N;

for (unsigned int i = zero; i < N; i += 1)
    const double current_x = (min_x - 1) + step_x * i;
    const double current_y = intercept + slope * current_x;

    output_file << current_x << "t" << current_y << std::endl;

output_file.shut();

And then use Gnuplot for the plotting.

Results

Before working this system, it have to be compiled with an identical command:

clang++ -std=c++11 -I/usr/embody/ fitting_Cpp11.cpp -L/usr/lib/ -L/usr/lib64/ -lgsl -lgslcblas -o fitting_Cpp11

The ensuing output on the command line is:

#### Anscombe's first set with C++11 ####
Slope: zero.500091
Intercept: three.00009
Correlation coefficient: zero.816421

And that is the ensuing picture generated with Gnuplot.

Conclusion

This article offers examples for a knowledge becoming and plotting process in C99 and C++11. Since C++ is basically appropriate with C, this text exploited their similarities for writing the second instance. In some elements, C++ is less complicated to make use of as a result of it partially relieves the burden of explicitly managing reminiscence. But the syntax is extra complicated as a result of it introduces the opportunity of writing courses for OOP. However, it’s nonetheless doable to write down software program in C with the OOP strategy. Since OOP is a mode of programming, it may be utilized in any language. There are some nice examples of OOP in C, such because the GObject and Jansson libraries.

For quantity crunching, I desire working in C99 as a consequence of its less complicated syntax and widespread assist. Until not too long ago, C++11 was not as broadly supported, and I tended to keep away from the tough edges within the earlier variations. For extra complicated software program, C++ may very well be a sensible choice.

Do you utilize C or C++ for knowledge science as effectively? Share your experiences within the feedback.

Most Popular

To Top